jmlr jmlr2005 jmlr2005-7 knowledge-graph by maker-knowledge-mining

7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression


Source: pdf

Author: Joaquin Quiñonero-Candela, Carl Edward Rasmussen

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints. Keywords: Gaussian process, probabilistic regression, sparse approximation, Bayesian committee machine Regression models based on Gaussian processes (GPs) are simple to implement, flexible, fully probabilistic models, and thus a powerful tool in many areas of application. Their main limitation is that memory requirements and computational demands grow as the square and cube respectively, of the number of training cases n, effectively limiting a direct implementation to problems with at most a few thousand cases. To overcome the computational limitations numerous authors have recently suggested a wealth of sparse approximations. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, and the remaining variables are given some approximate, but computationally cheaper treatment. However, the published algorithms have widely different motivations, emphasis and exposition, so it is difficult to get an overview (see Rasmussen and Williams, 2006, chapter 8) of how they relate to each other, and which can be expected to give rise to the best algorithms. In this paper we provide a unifying view of sparse approximations for GP regression. Our approach is simple, but powerful: for each algorithm we analyze the posterior, and compute the effective prior which it is using. Thus, we reinterpret the algorithms as “exact inference with an approximated prior”, rather than the existing (ubiquitous) interpretation “approximate inference with the exact prior”. This approach has the advantage of directly expressing the approximations in terms of prior assumptions about the function, which makes the consequences of the approximations much easier to understand. While our view of the approximations is not the only one possible, it has the advantage of putting all existing probabilistic sparse approximations under one umbrella, thus enabling direct comparison and revealing the relation between them. In Section 1 we briefly introduce GP models for regression. In Section 2 we present our unifying framework and write out the key equations in preparation for the unifying analysis of sparse c 2005 Joaquin Qui˜ onero-Candela and Carl Edward Rasmussen. n ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN algorithms in Sections 4-7. The relation of transduction and augmentation to our sparse framework is covered in Section 8. All our approximations are written in terms of a new set of inducing variables. The choice of these variables is itself a challenging problem, and is discussed in Section 9. We comment on a few special approximations outside our general scheme in Section 10 and conclusions are drawn at the end. 1. Gaussian Processes for Regression Probabilistic regression is usually formulated as follows: given a training set D = {(xi , yi ), i = 1, . . . , n} of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi , compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test locations x∗ . In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by yi = f (xi ) + εi , where εi ∼ N (0, σ2 ) , noise (1) where σ2 is the variance of the noise. noise Definition 1 A Gaussian process (GP) is a collection of random variables, any finite number of which have consistent1 joint Gaussian distributions. Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to p(f|x1 , x2 , . . . , xn ) = N (0, K) , (2) where f = [ f1 , f2 , . . . , fn ] is a vector of latent function values, fi = f (xi ) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(xi , x j ). Note that the GP treats the latent function values fi as random variables, indexed by the corresponding input. In the following, for simplicity we will always neglect the explicit conditioning on the inputs; the GP model and all expressions are always conditional on the corresponding inputs. The GP model is concerned only with the conditional of the outputs given the inputs; we do not model anything about the inputs themselves. Remark 2 Note, that to adhere to a strict Bayesian formalism, the GP covariance function,3 which defines the prior, should not depend on the data (although it can depend on additional parameters). As we will see in later sections, some approximations are strictly equivalent to GPs, while others are not. That is, the implied prior may still be multivariate Gaussian, but the covariance function may be different for training and test cases. Definition 3 A Gaussian process is called degenerate iff the covariance function has a finite number of non-zero eigenvalues. 1. By consistency is meant simply that the random variables obey the usual rules of marginalization, etc. 2. For notational simplicity we exclusively use zero-mean priors. 3. The covariance function itself shouldn’t depend on the data, though its value at a specific pair of inputs of course will. 1940 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Degenerate GPs (such as e.g. with polynomial covariance function) correspond to finite linear (-in-the-parameters) models, whereas non-degenerate GPs (such as e.g. with squared exponential or RBF covariance function) do not. The prior for a finite m dimensional linear model only considers a universe of at most m linearly independent functions; this may often be too restrictive when n m. Note however, that non-degeneracy on its own doesn’t guarantee the existence of the “right kind” of flexibility for a given particular modelling task. For a more detailed background on GP models, see for example that of Rasmussen and Williams (2006). Inference in the GP model is simple: we put a joint GP prior on training and test latent values, f and f∗ 4 , and combine it with the likelihood5 p(y|f) using Bayes rule, to obtain the joint posterior p(f, f∗ )p(y|f) . p(y) p(f, f∗ |y) = (3) The final step needed to produce the desired posterior predictive distribution is to marginalize out the unwanted training set latent variables: p(f∗ |y) = Z 1 p(y) p(f, f∗ |y)df = Z p(y|f) p(f, f∗ ) df , (4) or in words: the predictive distribution is the marginal of the renormalized joint prior times the likelihood. The joint GP prior and the independent likelihood are both Gaussian p(f, f∗ ) = N 0, Kf,f K∗,f Kf,∗ K∗,∗ , and p(y|f) = N (f, σ2 I) , noise (5) where K is subscript by the variables between which the covariance is computed (and we use the asterisk ∗ as shorthand for f∗ ) and I is the identity matrix. Since both factors in the integral are Gaussian, the integral can be evaluated in closed form to give the Gaussian predictive distribution p(f∗ |y) = N K∗,f (Kf,f + σ2 I)−1 y, K∗,∗ − K∗,f (Kf,f + σ2 I)−1 Kf,∗ , noise noise (6) see the relevant Gaussian identity in appendix A. The problem with the above expression is that it requires inversion of a matrix of size n × n which requires O (n3 ) operations, where n is the number of training cases. Thus, the simple exact implementation can handle problems with at most a few thousand training cases. 2. A New Unifying View We now seek to modify the joint prior p(f∗ , f) from (5) in ways which will reduce the computational requirements from (6). Let us first rewrite that prior by introducing an additional set of m latent variables u = [u1 , . . . , um ] , which we call the inducing variables. These latent variables are values of the Gaussian process (as also f and f∗ ), corresponding to a set of input locations Xu , which we call the inducing inputs. Whereas the additional latent variables u are always marginalized out in the predictive distribution, the choice of inducing inputs does leave an imprint on the final solution. 4. We will mostly consider a vector of test cases f∗ (rather than a single f∗ ). 5. You may have been expecting the likelihood written as p(y|f∗ , f) but since the likelihood is conditionally independent of everything else given f, this makes no difference. 1941 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN The inducing variables will turn out to be generalizations of variables which other authors have referred to variously as “support points”, “active set” or “pseudo-inputs”. Particular sparse algorithms choose the inducing variables in various different ways; some algorithms chose the inducing inputs to be a subset of the training set, others not, as we will discuss in Section 9. For now consider any arbitrary inducing variables. Due to the consistency of Gaussian processes, we know that we can recover p(f∗ , f) by simply integrating (marginalizing) out u from the joint GP prior p(f∗ , f, u) p(f∗ , f) = Z p(f∗ , f, u) du = Z p(f∗ , f|u) p(u) du, where p(u) = N (0, Ku,u ) . (7) This is an exact expression. Now, we introduce the fundamental approximation which gives rise to almost all sparse approximations. We approximate the joint prior by assuming that f∗ and f are conditionally independent given u, see Figure 1, such that p(f∗ , f) q(f∗ , f) = Z q(f∗ |u) q(f|u) p(u) du . (8) The name inducing variable is motivated by the fact that f and f∗ can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f∗ |u) of the integral in (8). It will be useful for future reference to specify here the exact expressions for the two conditionals training conditional: test conditional: −1 p(f|u) = N (Kf,u Ku,u u, Kf,f − Qf,f ) , −1 p(f∗ |u) = N (K∗,u Ku,u u, K∗,∗ − Q∗,∗ ) , (9a) (9b) −1 where we have introduced the shorthand notation6 Qa,b Ka,u Ku,u Ku,b . We can readily identify the expressions in (9) as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. Note that the (positive semi-definite) covariance matrices in (9) have the form K − Q with the following interpretation: the prior covariance K minus a (non-negative definite) matrix Q quantifying how much information u provides about the variables in question (f or f∗ ). We emphasize that all the sparse methods discussed in the paper correspond simply to different approximations to the conditionals in (9), and throughout we use the exact likelihood and inducing prior p(y|f) = N (f, σ2 I) , and p(u) = N (0, Ku,u ) . (10) noise 3. The Subset of Data (SoD) Approximation Before we get started with the more sophisticated approximations, we mention as a baseline method the simplest possible sparse approximation (which doesn’t fall inside our general scheme): use only a subset of the data (SoD). The computational complexity is reduced to O (m3 ), where m < n. We would not generally expect SoD to be a competitive method, since it would seem impossible (even with fairly redundant data and a good choice of the subset) to get a realistic picture of the 6. Note, that Qa,b depends on u although this is not explicit in the notation. 1942 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗ Figure 1: Graphical model of the relation between the inducing variables u, the training latent functions values f = [ f1 , . . . , fn ] and the test function value f∗ . The thick horizontal line represents a set of fully connected nodes. The observations y1 , . . . , yn , y∗ (not shown) would dangle individually from the corresponding latent values, by way of the exact (factored) likelihood (5). Left graph: the fully connected graph corresponds to the case where no approximation is made to the full joint Gaussian process distribution between these variables. The inducing variables u are superfluous in this case, since all latent function values can communicate with all others. Right graph: assumption of conditional independence between training and test function values given u. This gives rise to the separation between training and test conditionals from (8). Notice that having cut the communication path between training and test latent function values, information from f can only be transmitted to f∗ via the inducing variables u. uncertainties, when only a part of the training data is even considered. We include it here mostly as a baseline against which to compare better sparse approximations. In Figure 5 top, left we see how the SoD method produces wide predictive distributions, when training on a randomly selected subset of 10 cases. A fair comparison to other methods would take into account that the computational complexity is independent of n as opposed to other more advanced methods. These extra computational resources could be spent in a number of ways, e.g. larger m, or an active (rather than random) selection of the m points. In this paper we will concentrate on understanding the theoretical foundations of the various approximations rather than investigating the necessary heuristics needed to turn the approximation schemes into actually practical algorithms. 4. The Subset of Regressors (SoR) Approximation The Subset of Regressors (SoR) algorithm was given by Silverman (1985), and mentioned again by Wahba et al. (1999). It was then adapted by Smola and Bartlett (2001) to propose a sparse greedy approximation to Gaussian process regression. SoR models are finite linear-in-the-parameters models with a particular prior on the weights. For any input x∗ , the corresponding function value f∗ is given by: f∗ = K∗,u wu , with −1 p(wu ) = N (0, Ku,u ) , (11) where there is one weight associated to each inducing input in Xu . Note that the covariance matrix for the prior on the weights is the inverse of that on u, such that we recover the exact GP prior on u, 1943 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN which is Gaussian with zero mean and covariance u = Ku,u wu ⇒ uu = Ku,u wu wu Ku,u = Ku,u . (12) −1 Using the effective prior on u and the fact that wu = Ku,u u we can redefine the SoR model in an equivalent, more intuitive way: −1 f∗ = K∗,u Ku,u u , with u ∼ N (0, Ku,u ) . (13) We are now ready to integrate the SoR model in our unifying framework. Given that there is a deterministic relation between any f∗ and u, the approximate conditional distributions in the integral in eq. (8) are given by: −1 qSoR (f|u) = N (Kf,u Ku,u u, 0) , and −1 qSoR (f∗ |u) = N (K∗,u Ku,u u, 0) , (14) with zero conditional covariance, compare to (9). The effective prior implied by the SoR approximation is easily obtained from (8), giving qSoR (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f Q∗,∗ , (15) −1 where we recall Qa,b Ka,u Ku,u Ku,b . A more descriptive name for this method, would be the Deterministic Inducing Conditional (DIC) approximation. We see that this approximate prior is degenerate. There are only m degrees of freedom in the model, which implies that only m linearly independent functions can be drawn from the prior. The m + 1-th one is a linear combination of the previous. For example, in a very low noise regime, the posterior could be severely constrained by only m training cases. The degeneracy of the prior causes unreasonable predictive distributions. Indeed, the approximate prior over functions is so restrictive, that given enough data only a very limited family of functions will be plausible under the posterior, leading to overconfident predictive variances. This is a general problem of finite linear models with small numbers of weights (for more details see Rasmussen and Qui˜ onero-Candela, 2005). Figure 5, top, right panel, illustrates the unreasonable n predictive uncertainties of the SoR approximation on a toy dataset.7 The predictive distribution is obtained by using the SoR approximate prior (15) instead of the true prior in (4). For each algorithm we give two forms of the predictive distribution, one which is easy to interpret, and the other which is economical to compute with: qSoR (f∗ |y) = N Q∗,f (Qf,f + σ2 I)−1 y, Q∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ , noise noise = N σ K∗,u Σ Ku,f y, K∗,u ΣKu,∗ , −2 (16a) (16b) where we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 . Equation (16a) is readily recognized as the regular prediction equation (6), except that the covariance K has everywhere been replaced by Q, which was already suggested by (15). This corresponds to replacing the covariance function k with −1 kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The new covariance function has rank (at most) m. Thus we have the following 7. Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances of the SoD, or a more accurate computationally costly alternative (more details are given by Qui˜ onero-Candela, 2004, Chapter 3). n 1944 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Remark 4 The SoR approximation is equivalent to exact inference in the degenerate Gaussian −1 process with covariance function kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The equivalent (16b) is computationally cheaper, and with (11) in mind, Σ is the covariance of the posterior on the weights wu . Note that as opposed to the subset of data method, all training cases are taken into account. The computational complexity is O (nm2 ) initially, and O (m) and O (m2 ) per test case for the predictive mean and variance respectively. 5. The Deterministic Training Conditional (DTC) Approximation Taking up ideas already contained in the work of Csat´ and Opper (2002), Seeger et al. (2003) o recently proposed another sparse approximation to Gaussian process regression, which does not suffer from the nonsensical predictive uncertainties of the SoR approximation, but that interestingly leads to exactly the same predictive mean. Seeger et al. (2003), who called the method Projected Latent Variables (PLV), presented the method as relying on a likelihood approximation, based on −1 the projection f = Kf,u Ku,u u: p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, σ2 I) . noise (17) The method has also been called the Projected Process Approximation (PPA) by Rasmussen and Williams (2006, Chapter 8). One way of obtaining an equivalent model is to retain the usual likelihood, but to impose a deterministic training conditional and the exact test conditional from eq. (9b) −1 qDTC (f|u) = N (Kf,u Ku,u u, 0), and qDTC (f∗ |u) = p(f∗ |u) . (18) This reformulation has the advantage of allowing us to stick to our view of exact inference (with exact likelihood) with approximate priors. Indeed, under this model the conditional distribution of f given u is identical to that of the SoR, given in the left of (14). A systematic name for this approximation is the Deterministic Training Conditional (DTC). The fundamental difference with SoR is that DTC uses the exact test conditional (9b) instead of the deterministic relation between f∗ and u of SoR. The joint prior implied by DTC is given by: qDTC (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f K∗,∗ , (19) which is surprisingly similar to the effective prior implied by the SoR approximation (15). The fundamental difference is that under the DTC approximation f∗ has a prior variance of its own, given by K∗,∗ . This prior variance reverses the behaviour of the predictive uncertainties, and turns them into sensible ones, see Figure 5 for an illustration. The predictive distribution is now given by: qDTC (f∗ |y) = N (Q∗,f (Qf,f + σ2 I)−1 y, K∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ noise noise = N σ K∗,u Σ Ku,f y, K∗,∗ − Q∗,∗ + K∗,u ΣK∗,u , −2 (20a) (20b) where again we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 as in (16). The predictive mean for the DTC is identical to that of the SoR approximation (16), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (which is larger, since K∗,∗ − Q∗,∗ is positive definite). This added term is the predictive variance of the posterior of f∗ conditioned on u. It grows to the prior variance K∗,∗ as x∗ moves far from the inducing inputs in Xu . 1945 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r f1 f2 fn f∗ Figure 2: Graphical model for the FITC approximation. Compared to those in Figure 1, all edges between latent function values have been removed: the latent function values are conditionally fully independent given the inducing variables u. Although strictly speaking the SoR and DTC approximations could also be represented by this graph, note that both further assume a deterministic relation between f and u. Remark 5 The only difference between the predictive distribution of DTC and SoR is the variance. The predictive variance of DTC is never smaller than that of SoR. Note, that since the covariances for training cases and test cases are computed differently, see (19), it follows that Remark 6 The DTC approximation does not correspond exactly to a Gaussian process, as the covariance between latent values depends on whether they are considered training or test cases, violating consistency, see Definition 1. The computational complexity has the same order as for SoR. 6. The Fully Independent Training Conditional (FITC) Approximation Recently Snelson and Ghahramani (2006) proposed another likelihood approximation to speed up Gaussian process regression, which they called Sparse Gaussian Processes using Pseudo-inputs (SGPP). While the DTC is based on the likelihood approximation given by (17), the SGPP proposes a more sophisticated likelihood approximation with a richer covariance p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, diag[Kf,f − Qf,f ] + σ2 I) , noise (21) where diag[A] is a diagonal matrix whose elements match the diagonal of A. As we did in (18) for the DTC, we provide an alternative equivalent formulation called Fully Independent Training Conditional (FITC) based on the inducing conditionals: n qFITC (f|u) = ∏ p( fi |u) = N i=1 −1 Kf,u Ku,u u, diag[Kf,f −Qf,f ] , and qFITC ( f∗ |u) = p( f∗ |u) . (22) We see that as opposed to SoR and DTC, FITC does not impose a deterministic relation between f and u. Instead of ignoring the variance, FITC proposes an approximation to the training conditional distribution of f given u as a further independence assumption. In addition, the exact test conditional from (9b) is used in (22), although for reasons which will become clear towards the end of this 1946 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION section, we initially consider only a single test case, f∗ . The corresponding graphical model is given in Figure 2. The effective prior implied by the FITC is given by qFITC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ . (23) Note, that the sole difference between the DTC and FITC is that in the top left corner of the implied prior covariance, FITC replaces the approximate covariances of DTC by the exact ones on the diagonal. The predictive distribution is qFITC ( f∗ |y) = N Q∗,f (Qf,f + Λ)−1 y, K∗,∗ − Q∗,f (Qf,f + Λ)−1 Qf,∗ (24a) = N K∗,u ΣKu,f Λ−1 y, K∗,∗ − Q∗,∗ + K∗,u ΣKu,∗ , (24b) where we have defined Σ = (Ku,u + Ku,f Λ−1 Kf,u )−1 and Λ = diag[Kf,f − Qf,f + σ2 I ]. The compunoise tational complexity is identical to that of SoR and DTC. So far we have only considered a single test case. There are two options for joint predictions, either 1) use the exact full test conditional from (9b), or 2) extend the additional factorizing assumption to the test conditional. Although Snelson and Ghahramani (2006) don’t explicitly discuss joint predictions, it would seem that they probably intend the second option. Whereas the additional independence assumption for the test cases is not really necessary for computational reasons, it does affect the nature of the approximation. Under option 1) the training and test covariance are computed differently, and thus this does not correspond to our strict definition of a GP model, but Remark 7 Iff the assumption of full independence is extended to the test conditional, the FITC approximation is equivalent to exact inference in a non-degenerate Gaussian process with covariance function kFIC (xi , x j ) = kSoR (xi , x j ) + δi, j [k(xi , x j ) − kSoR (xi , x j )], where δi, j is Kronecker’s delta. A logical name for the method where the conditionals (training and test) are always forced to be fully independent would be the Fully Independent Conditional (FIC) approximation. The effective prior implied by FIC is: qFIC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f Q∗,∗ − diag[Q∗,∗ − K∗,∗ ] . (25) 7. The Partially Independent Training Conditional (PITC) Approximation In the previous section we saw how to improve the DTC approximation by approximating the training conditional with an independent distribution, i.e. one with a diagonal covariance matrix. In this section we will further improve the approximation (while remaining computationally attractive) by extending the training conditional to have a block diagonal covariance: −1 qPITC (f|u) = N Kf,u Ku,u u, blockdiag[Kf,f − Qf,f ] , and qPITC (f∗ |u) = p(f∗ |u) . (26) where blockdiag[A] is a block diagonal matrix (where the blocking structure is not explicitly stated). We represent graphically the PITC approximation in Figure 3. Developing this analogously to the FITC approximation from the previous section, we get the joint prior qPITC (f, f∗ ) = N 0, Qf,f − blockdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ 1947 , (27) ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r fI fI fI f∗ 1 2 k Figure 3: Graphical representation of the PITC approximation. The set of latent function values fIi indexed by the the set of indices Ii is fully connected. The PITC differs from FITC (see graph in Fig. 2) in that conditional independence is now between the k groups of training latent function values. This corresponds to the block diagonal approximation to the true training conditional given in (26). and the predictive distribution is identical to (24), except for the alternative definition of Λ = blockdiag[Kf,f − Qf,f + σ2 I ]. An identical expression was obtained by Schwaighofer and Tresp noise (2003, Sect. 3), developing from the original Bayesian committee machine (BCM) by Tresp (2000). The relationship to the FITC was pointed out by Lehel Csat´ . The BCM was originally proposed as o a transductive learner (i.e. where the test inputs have to be known before training), and the inducing inputs Xu were chosen to be the test inputs. We discuss transduction in detail in the next section. It is important to realize that the BCM proposes two orthogonal ideas: first, the block diagonal structure of the partially independent training conditional, and second setting the inducing inputs to be the test inputs. These two ideas can be used independently and in Section 8 we propose using the first without the second. The computational complexity of the PITC approximation depends on the blocking structure imposed in (26). A reasonable choice, also recommended by Tresp (2000) may be to choose k = n/m blocks, each of size m × m. The computational complexity remains O (nm2 ). Since in the PITC model the covariance is computed differently for training and test cases Remark 8 The PITC approximation does not correspond exactly to a Gaussian process. This is because computing covariances requires knowing whether points are from the training- or test-set, (27). One can obtain a Gaussian process from the PITC by extending the partial conditional independence assumption to the test conditional, as we did in Remark 7 for the FITC. 8. Transduction and Augmentation The idea of transduction is that one should restrict the goal of learning to prediction on a prespecified set of test cases, rather than trying to learn an entire function (induction) and then evaluate it at the test inputs. There may be no universally agreed upon definition of transduction. In this paper we use Definition 9 Transduction occurs only if the predictive distribution depends on other test inputs. This operational definition excludes models for which there exist an equivalent inductive counterpart. According to this definition, it is irrelevant when the bulk of the computation takes place. 1948 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u f∗   e  ¡     ¡ e   e ¡     ¡     e e   ¡       e ¡   r r r fI fI fI 1 2 k Figure 4: Two views on Augmentation. One view is to see that the test latent function value f∗ is now part of the inducing variables u and therefore has access to the training latent function values. An equivalent view is to consider that we have dropped the assumption of conditional independence between f∗ and the training latent function values. Even if f∗ has now direct access to each of the training fi , these still need to go through u to talk to each other if they fall in conditionally independent blocks. We have in this figure decided to recycle the graph for PITC from Figure 3 to show that all approximations we have presented can be augmented, irrespective of what the approximation for the training conditional is. There are several different possible motivations for transduction: 1) transduction is somehow easier than induction (Vapnik, 1995), 2) the test inputs may reveal important information, which should be used during training. This motivation drives models in semi-supervised learning (studied mostly in the context of classification) and 3) for approximate algorithms one may be able to limit the discrepancies of the approximation at the test points. For exact GP models it seems that the first reason doesn’t really apply. If you make predictions at the test points that are consistent with a GP, then it is trivial inside the GP framework to extend these to any other input points, and in effect we have done induction. The second reason seems more interesting. However, in a standard GP setting, it is a consequence of the consistency property, see Remark 2, that predictions at one test input are independent of the location of any other test inputs. Therefore transduction can not be married with exact GPs: Remark 10 Transduction can not occur in exact Gaussian process models. Whereas this holds for the usual setting of GPs, it could be different in non-standard situations where e.g. the covariance function depends on the empirical input densities. Transduction can occur in the sparse approximation to GPs, by making the choice of inducing variables depend on the test inputs. The BCM from the previous section, where Xu = X∗ (where X∗ are the test inputs) is an example of this. Since the inducing variables are connected to all other nodes (see Figure 3) we would expect the approximation to be good at u = f∗ , which is what we care about for predictions, relating to reason 3) above. While this reasoning is sound, it is not necessarily a sufficient consideration for getting a good model. The model has to be able to simultaneously explain the training targets as well and if the choice of u makes this difficult, the posterior at the points of interest may be distorted. Thus, the choice of u should be governed by the ability to model the conditional of the latents given the inputs, and not solely by the density of the (test) inputs. The main drawback of transduction is that by its nature it doesn’t provide a predictive model in the way inductive models do. In the usual GP model one can do the bulk of the computation 1949 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN involved in the predictive distributions (e.g. matrix inversion) before seeing the test cases, enabling fast computation of test predictions. It is interesting that whereas other methods spend much effort trying to optimize the inducing variables, the BCM simply uses the test set. The quality of the BCM approximation depends then on the particular location of the test inputs, upon which one usually does not have any control. We now see that there may be a better method, eliminating the drawback of transduction, namely use the PITC approximation, but choose the u’s carefully (see Section 9), don’t just use the test set. 8.1 Augmentation An idea closely related to transduction, but not covered by our definition, is augmentation, which in contrast to transduction is done individually for each test case. Since in the previous sections, we haven’t assumed anything about u, we can simply augment the set of inducing variables by f∗ (i.e. have one additional inducing variable equal to the current test latent), and see what happens in the predictive distributions for the different methods. Let’s first investigate the consequences for the test conditional from (9b). Note, the interpretation of the covariance matrix K∗,∗ − Q∗,∗ was “the prior covariance minus the information which u provides about f∗ ”. It is clear that the augmented u (with f∗ ) provides all possible information about f∗ , and consequently Q∗,∗ = K∗,∗ . An equivalent view on augmentation is that the assumption of conditional independence between f∗ and f is dropped. This is seen trivially by adding edges between f∗ and the fi in the graphical model, Figure 4. Augmentation was originally proposed by Rasmussen (2002), and applied in detail to the SoR with RBF covariance by Qui˜ onero-Candela (2004). Because the SoR is a finite linear model, and n the basis functions are local (Gaussian bumps), the predictive distributions can be very misleading. For example, when making predictions far away from the center of any basis function, all basis functions have insignificant magnitudes, and the prediction (averaged over the posterior) will be close to zero, with very small error-bars; this is the opposite of the desired behaviour, where we would expect the error-bars to grow as we move away from the training cases. Here augmentation makes a particularly big difference turning the nonsensical predictive distribution into a reasonable one, by ensuring that there is always a basis function centered on the test case. Compare the nonaugmented to the augmented SoR in Figure 5. An analogous Gaussian process based finite linear model that has recently been healed by augmentation is the relevance vector machine (Rasmussen and Qui˜ onero-Candela, 2005). n Although augmentation was initially proposed for a narrow set of circumstances, it is easily applied to any of the approximations discussed. Of course, augmentation doesn’t make any sense for an exact, non-degenerate Gaussian process model (a GP with a covariance function that has a feature-space which is infinite dimensional, i.e. with basis functions everywhere). Remark 11 A full non-degenerate Gaussian process cannot be augmented, since the corresponding f∗ would already be connected to all other variables in the graphical model. But augmentation does make sense for sparse approximations to GPs. The more general process view on augmentation has several advantages over the basis function view. It is not completely clear from the basis function view, which basis function should be used for augmentation. For example, Rasmussen and Qui˜ onero-Candela (2005) successfully apply augn mentation using basis functions that have a zero contribution at the test location! In the process view 1950 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION however, it seems clear that one would chose the additional inducing variable to be f∗ , to minimize the effects of the approximations. Let us compute the effective prior for the augmented SoR. Given that f∗ is in the inducing set, the test conditional is not an approximation and we can rewrite the integral leading to the effective prior: Z qASoR (f∗ , f) = qSoR (f| f∗ , u) p( f∗ , u) du . (28) It is interesting to notice that this is also the effective prior that would result from augmenting the DTC approximation, since qSoR (f| f∗ , u) = qDTC (f| f∗ , u). Remark 12 Augmented SoR (ASoR) is equivalent to augmented DTC (ADTC). Augmented DTC only differs from DTC in the additional presence of f∗ among the inducing variables in the training conditional. We can only expect augmented DTC to be a more accurate approximation than DTC, since adding an additional inducing variable can only help capture information from y. Therefore Remark 13 DTC is a less accurate (but cheaper) approximation than augmented SoR. We saw previously in Section 5 that the DTC approximation does not suffer from the nonsensical predictive variances of the SoR. The equivalence between the augmented SoR and augmented DTC is another way of seeing how augmentation reverses the misbehaviour of SoR. The predictive distribution of the augmented SoR is obtained by adding f∗ to u in (20). Prediction with an augmented sparse model comes at a higher computational cost, since now f∗ directly interacts with all of f and not just with u. For each new test case, updating the augmented Σ in the predictive equation (for example (20b) for DTC) implies computing the vector matrix product K∗,f Kf,u with complexity O (nm). This is clearly higher than the O (m) for the mean, and O (m2 ) for the predictive distribution of all the non-augmented methods we have discussed. Augmentation seems to be only really necessary for methods that make a severe approximation to the test conditional, like the SoR. For methods that make little or no approximation to the test conditional, it is difficult to predict the degree to which augmentation would help. However, one can see by giving f∗ access to all of the training latent function values in f, one would expect augmentation to give less under-confident predictive distributions near the training data. Figure 5 clearly shows that augmented DTC (equivalent to augmented SoR) has a superior predictive distribution (both mean and variance) than standard DTC. Note however that in the figure we have purposely chosen a too short lengthscale to enhance visualization. Quantitatively, this superiority was experimentally assessed by Qui˜ onero-Candela (2004, Table 3.1). Augmentation hasn’t been n compared to the more advanced approximations FITC and PITC, and the figure would change in the more realistic scenario where the inducing inputs and hyperparameters are learnt (Snelson and Ghahramani, 2006). Transductive methods like the BCM can be seen as joint augmentation, and one could potentially use it for any of the methods presented. It seems that the good performance of the BCM could essentially stem from augmentation, the presence of the other test inputs in the inducing set being probably of little benefit. Joint augmentation might bring some computational advantage, but won’t change the scaling: note that augmenting m times at a cost of O (nm) apiece implies the same O (nm2 ) total cost as the jointly augmented BCM. 1951 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN SoD SoR 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 DTC 0 5 10 15 5 10 15 5 10 15 ASoR/ADTC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 FITC 0 PITC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 0 Figure 5: Toy example with identical covariance function and hyperparameters. The squared exponential covariance function is used, and a slightly too short lengthscale is chosen on purpose to emphasize the different behaviour of the predictive uncertainties. The dots are the training points, the crosses are the targets corresponding to the inducing inputs, randomly selected from the training set. The solid line is the mean of the predictive distribution, and the dotted lines show the 95% confidence interval of the predictions. Augmented DTC (ADTC) is equivalent to augmented SoR (ASoR), see Remark 12. 1952 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION 9. On the Choice of the Inducing Variables We have until now assumed that the inducing inputs Xu were given. Traditionally, sparse models have very often been built upon a carefully chosen subset of the training inputs. This concept is probably best exemplified in the popular support vector machine (Cortes and Vapnik, 1995). In sparse Gaussian processes it has also been suggested to select the inducing inputs Xu from among the training inputs. Since this involves a prohibitive combinatorial optimization, greedy optimization approaches have been suggested using various selection criteria like online learning (Csat´ and o Opper, 2002), greedy posterior maximization (Smola and Bartlett, 2001), maximum information gain (Seeger et al., 2003), matching pursuit (Keerthi and Chu, 2006), and probably more. As discussed in the previous section, selecting the inducing inputs from among the test inputs has also been considered in transductive settings. Recently, Snelson and Ghahramani (2006) have proposed to relax the constraint that the inducing variables must be a subset of training/test cases, turning the discrete selection problem into one of continuous optimization. One may hope that finding a good solution is easier in the continuous than the discrete case, although finding the global optimum is intractable in both cases. And perhaps the less restrictive choice can lead to better performance in very sparse models. Which optimality criterion should be used to set the inducing inputs? Departing from a fully Bayesian treatment which would involve defining priors on Xu , one could maximize the marginal likelihood (also called the evidence) with respect to Xu , an approach also followed by Snelson and Ghahramani (2006). Each of the approximate methods proposed involves a different effective prior, and hence its own particular effective marginal likelihood conditioned on the inducing inputs q(y|Xu ) = ZZ p(y|f) q(f|u) p(u|Xu )du df = Z p(y|f) q(f|Xu )df , (29) which of course is independent of the test conditional. We have in the above equation explicitly conditioned on the inducing inputs Xu . Using Gaussian identities, the effective marginal likelihood is very easily obtained by adding a ridge σ2 I (from the likelihood) to the covariance of effective noise prior on f. Using the appropriate definitions of Λ, the log marginal likelihood becomes 1 log q(y|Xu ) = − 2 log |Qf,f + Λ| − 1 y (Qf,f + Λ)−1 y − n log(2π) , 2 2 (30) where ΛSoR = ΛDTC = σ2 I, ΛFITC = diag[Kf,f − Qf,f ] + σ2 I, and ΛPITC = blockdiag[Kf,f − noise noise Qf,f ] + σ2 I. The computational cost of the marginal likelihood is O (nm2 ) for all methods, that of noise its gradient with respect to one element of Xu is O (nm). This of course implies that the complexity of computing the gradient wrt. to the whole of Xu is O (dnm2 ), where d is the dimension of the input space. It has been proposed to maximize the effective posterior instead of the effective marginal likelihood (Smola and Bartlett, 2001). However this is potentially dangerous and can lead to overfitting. Maximizing the whole evidence instead is sound and comes at an identical computational cost (for a deeper analysis see Qui˜ onero-Candela, 2004, Sect. 3.3.5 and Fig. 3.2). n The marginal likelihood has traditionally been used to learn the hyperparameters of GPs in the non fully Bayesian treatment (see for example Williams and Rasmussen, 1996). For the sparse approximations presented here, once you are learning Xu it is straightforward to allow for learning hyperparameters (of the covariance function) during the same optimization, and there is no need to interleave optimization of u with learning of the hyperparameters as it has been proposed for example by Seeger et al. (2003). 1953 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN 10. Other Methods In this section we briefly mention two approximations which don’t fit in our unifying scheme, since one doesn’t correspond to a proper probabilistic model, and the other one uses a particular construction for the covariance function, rather than allowing any general covariance function. 10.1 The Nystr¨ m Approximation o The Nystr¨ m Approximation for speeding up GP regression was originally proposed by Williams o and Seeger (2001), and then questioned by Williams et al. (2002). Like SoR and DTC, the Nystr¨ m o Approximation for GP regression approximates the prior covariance of f by Qf,f . However, unlike these methods, the Nystr¨ m Approximation is not based on a generative probabilistic model. The o prior covariance between f∗ and f is taken to be exact, which is inconsistent with the prior covariance on f: Qf,f Kf,∗ . (31) q(f, f∗ ) = N 0, K∗,f K∗,∗ As a result we cannot derive this method from our unifying framework, nor represent it with a graphical model. Worse, the resulting prior covariance matrix is not even guaranteed to be positive definite, allowing the predictive variances to be negative. Notice that replacing Kf,∗ by Qf,∗ in (31) is enough to make the prior covariance positive definite, and one obtains the DTC approximation. Remark 14 The Nystr¨ m Approximation does not correspond to a well-formed probabilistic model. o Ignoring any quibbles about positive definiteness, the predictive distribution of the Nystr¨ m Apo proximation is given by: p( f∗ |y) = N Kf,∗ [Qf,f + σ2 I]−1 y, K∗,∗ − Kf,∗ [Qf,f + σ2 I]−1 Kf,∗ , noise noise (32) but the predictive variance is not guaranteed to be positive. The computational cost is O (nm2 ). 10.2 The Relevance Vector Machine The relevance vector machine, introduced by Tipping (2001), is a finite linear model with an independent Gaussian prior imposed on the weights. For any input x∗ , the corresponding function output is given by: f∗ = φ∗ w , with p(w|A) = N (0, A) , (33) where φ∗ = [φ1 (x), . . . , φm (x)] is the (row) vector of responses of the m basis functions, and A = diag(α1 , . . . , αm ) is the diagonal matrix of joint prior precisions (inverse variances) of the weights. The αi are learnt by maximizing the RVM evidence (obtained by also assuming Gaussian additive iid. noise, see (1)), and for the typical case of rich enough sets of basis functions many of the precisions go to infinity effectively pruning out the corresponding weights (for a very interesting analysis see Wipf et al., 2004). The RVM is thus a sparse method and the surviving basis functions are called relevance vectors. Note that since the RVM is a finite linear model with Gaussian priors on the weights, it can be seen as a Gaussian process: Remark 15 The RVM is equivalent to a degenerate Gaussian process with covariance function kRVM (xi , x j ) = φi A−1 φ j = ∑m α−1 φk (xi ) φk (x j ), k=1 k 1954 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION q(f∗ |u) q(f|u) GP exact exact SoR determ. determ. DTC exact determ. FITC (exact) fully indep. PITC exact partially indep. Method joint prior covariance Kf,f Kf,∗ K∗,f K∗,∗ Qf,f Qf,∗ Q∗,f Q∗,∗ Qf,f Qf,∗ Q∗,f K∗,∗ Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ Qf,f − blokdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ GP? √ √ √ ( ) Table 1: Summary of the way approximations are built. All these methods are detailed in the previous sections. The initial cost and that of the mean and variance per test case are respectively n2 , n and n2 for the exact GP, and nm2 , m and m2 for all other methods. The “GP?” column indicates whether the approximation is equivalent to a GP. For FITC see Remark 7. as was also pointed out by Tipping (2001, eq. (59)). Whereas all sparse approximations we have presented until now are totally independent of the choice of covariance function, for the RVM this choice is restricted to covariance functions that can be expressed as finite expansions in terms of some basis functions. Being degenerate GPs in exactly the same way as the SoR (presented in Section 4), the RVM does also suffer from unreasonable predictive variances. Rasmussen and Qui˜ onero-Candela (2005) show that the predictive distributions of RVMs can also be healed by n augmentation, see Section 8. Once the αi have been learnt, denoting by m the number of surviving relevance vectors, the complexity of computing the predictive distribution of the RVM is O (m) for mean and O (m2 ) for the variance. RVMs are often used with radial basis functions centered on the training inputs. One potentially interesting extension to the RVM would be to learn the locations of the centers of the basis functions, in the same way as proposed by Snelson and Ghahramani (2006) for the FITC approximation, see Section 6. This is a curious reminiscence of learning the centers in RBF Networks. 11. Conclusions We have provided a unifying framework for sparse approximations to Gaussian processes for regression. Our approach consists of two steps, first 1) we recast the approximation in terms of approximations to the prior, and second 2) we introduce inducing variables u and the idea of conditional independence given u. We recover all existing sparse methods by making further simplifications of the covariances of the training and test conditionals, see Table 1 for a summary. Previous methods were presented based on different approximation paradigms (e.g. likelihood approximations, projection methods, matrix approximations, minimization of Kullback-Leibler divergence, etc), making direct comparison difficult. Under our unifying view we deconstruct methods, making it clear which building blocks they are based upon. For example, the SGPP by Snelson 1955 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN and Ghahramani (2006) contains two ideas, 1) a likelihood approximation and 2) the idea of varying the inducing inputs continuously; these two ideas could easily be used independently, and incorporated in other methods. Similarly, the BCM by Tresp (2000) contains two independent ideas 1) a block diagonal assumption, and 2) the (transductive) idea of choosing the test inputs as the inducing variables. Finally we note that although all three ideas of 1) transductively setting u = f∗ , 2) augmentation and 3) continuous optimization of Xu have been proposed in very specific settings, in fact they are completely general ideas, which can be applied to any of the approximation schemes considered. We have ranked the approximation according to how close they are to the corresponding full GP. However, the performance in practical situations may not always follow this theoretical ranking since the approximations might exhibit properties (not present in the full GP) which may be particularly suitable for specific datasets. This may make the interpretation of empirical comparisons challenging. A further complication arises when adding the necessary heuristics for turning the theoretical constructs into practical algorithms. We have not described full algorithms in this paper, but are currently working on a detailed empirical study (in preparation, see also Rasmussen and Williams, 2006, chapter 8). We note that the order of the computational complexity is identical for all the methods considered, O (nm2 ). This highlights that there is no computational excuse for using gross approximations, such as assuming deterministic relationships, in particular one should probably think twice before using SoR or even DTC. Although augmentation has attractive predictive properties, it is computationally expensive. It remains unclear whether augmentation could be beneficial on a fixed computational budget. We have only considered the simpler case of regression in this paper, but sparseness is also commonly sought in classification settings. It should not be difficult to cast probabilistic approximation methods such as Expectation Propagation (EP) or the Laplace method (for a comparison, see Kuss and Rasmussen, 2005) into our unifying framework. Our analysis suggests that a new interesting approximation would come from combining the best possible approximation (PITC) with the most powerful selection method for the inducing inputs. This would correspond to a non-transductive version of the BCM. We would evade the necessity of knowing the test set before doing the bulk of the computation, and we could hope to supersede the superior performance reported by Snelson and Ghahramani (2006) for very sparse approximations. Acknowledgments Thanks to Neil Lawrence for arranging the 2005 Gaussian Process Round Table meeting in Sheffield, which provided much inspiration to this paper. Special thanks to Olivier Chapelle, Lehel Csat´ , o Zoubin Ghahramani, Matthias Seeger, Ed Snelson and Chris Williams for helpful discussions, and to three anonymous reviewers. Both authors were supported by the German Research Council (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. 1956 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Appendix A. Gaussian and Matrix Identities In this appendix we provide identities used to manipulate matrices and Gaussian distributions throughout the paper. Let x and y be jointly Gaussian x y µx µy ∼ N , A C C B , (34) then the marginal and the conditional are given by x ∼ N (µx , A) , and x|y ∼ N µx +C B−1 (y − µy ), A −C B−1C (35) Also, the product of a Gaussian in x with a Gaussian in a linear projection P x is again a Gaussian, although unnormalized N (x|a, A) N (P x|b, B) = zc N (x|c,C) , (36) where C = A−1 + P B−1 P −1 c = C A−1 a + P B−1 b . , The normalizing constant zc is gaussian in the means a and b of the two Gaussians: m 1 1 zc = (2 π)− 2 |B + P A P |− 2 exp − 2 (b − P a) B+PAP −1 (b − P a) . (37) The matrix inversion lemma, also known as the Woodbury, Sherman & Morrison formula states that: (Z +UWV )−1 = Z −1 − Z −1U(W −1 +V Z −1U)−1V Z −1 , (38) assuming the relevant inverses all exist. Here Z is n × n, W is m × m and U and V are both of size n × m; consequently if Z −1 is known, and a low rank (ie. m < n) perturbation are made to Z as in left hand side of eq. (38), considerable speedup can be achieved. References Corinna Cortes and Vladimir Vapnik. Support-vector network. Machine Learning, 20(3):273–297, 1995. Lehel Csat´ and Manfred Opper. Sparse online Gaussian processes. Neural Computation, 14(3): o 641–669, 2002. Sathiya Keerthi and Wei Chu. A Matching Pursuit approach to sparse Gaussian process regression. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, Cambridge, Massachussetts, 2006. The MIT Press. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, pages 1679–1704, 2005. Joaquin Qui˜ onero-Candela. Learning with Uncertainty – Gaussian Processes and Relevance Vecn tor Machines. PhD thesis, Technical University of Denmark, Lyngby, Denmark, 2004. Carl Edward Rasmussen. Reduced rank Gaussian process learning. Technical report, Gatsby Computational Neuroscience Unit, UCL, 2002. 1957 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN Carl Edward Rasmussen and Joaquin Qui˜ onero-Candela. Healing the relevance vector machine by n augmentation. In International Conference on Machine Learning, 2005. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT press, 2006. Anton Schwaighofer and Volker Tresp. Transductive and inductive methods for approximate Gaussian process regression. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 953–960, Cambridge, Massachussetts, 2003. The MIT Press. Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Christopher M. Bishop and Brendan J. Frey, editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. Bernhard W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Roy. Stat. Soc. B, 47(1):1–52, 1985. (with discussion). Alexander J. Smola and Peter L. Bartlett. Sparse greedy Gaussian process regression. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625, Cambridge, Massachussetts, 2001. The MIT Press. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems o 18, Cambridge, Massachussetts, 2006. The MIT Press. Michael E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211–244, 2001. Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, and Barbara Klein. The biasvariance tradeoff and the randomized GACV. In Michael S. Kerns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 620–626, Cambridge, Massachussetts, 1999. The MIT Press. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, Massachussetts, 1996. The MIT Press. Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations of the Nystr¨ m method for Gaussiam process prediction. Technical report, University o of Edinburgh, Edinburgh, Scotland, 2002. 1958 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Christopher K. I. Williams and Mathias Seeger. Using the Nystr¨ m method to speed up kernel o machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, Massachussetts, 2001. The MIT Press. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on sparse Bayesian learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16, Cambridge, Massachussetts, 2004. The MIT Press. 1959

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The relation of transduction and augmentation to our sparse framework is covered in Section 8. [sent-23, score-0.366]

2 All our approximations are written in terms of a new set of inducing variables. [sent-24, score-0.409]

3 , n} of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi , compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test locations x∗ . [sent-31, score-0.386]

4 , fn ] is a vector of latent function values, fi = f (xi ) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(xi , x j ). [sent-42, score-0.363]

5 That is, the implied prior may still be multivariate Gaussian, but the covariance function may be different for training and test cases. [sent-48, score-0.288]

6 Inference in the GP model is simple: we put a joint GP prior on training and test latent values, f and f∗ 4 , and combine it with the likelihood5 p(y|f) using Bayes rule, to obtain the joint posterior p(f, f∗ )p(y|f) . [sent-64, score-0.337]

7 Since both factors in the integral are Gaussian, the integral can be evaluated in closed form to give the Gaussian predictive distribution p(f∗ |y) = N K∗,f (Kf,f + σ2 I)−1 y, K∗,∗ − K∗,f (Kf,f + σ2 I)−1 Kf,∗ , noise noise (6) see the relevant Gaussian identity in appendix A. [sent-67, score-0.348]

8 These latent variables are values of the Gaussian process (as also f and f∗ ), corresponding to a set of input locations Xu , which we call the inducing inputs. [sent-76, score-0.467]

9 Whereas the additional latent variables u are always marginalized out in the predictive distribution, the choice of inducing inputs does leave an imprint on the final solution. [sent-77, score-0.773]

10 1941 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN The inducing variables will turn out to be generalizations of variables which other authors have referred to variously as “support points”, “active set” or “pseudo-inputs”. [sent-82, score-0.347]

11 Particular sparse algorithms choose the inducing variables in various different ways; some algorithms chose the inducing inputs to be a subset of the training set, others not, as we will discuss in Section 9. [sent-83, score-0.872]

12 (8) The name inducing variable is motivated by the fact that f and f∗ can only communicate though u, and u therefore induces the dependencies between training and test cases. [sent-89, score-0.43]

13 As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f∗ |u) of the integral in (8). [sent-90, score-0.412]

14 We can readily identify the expressions in (9) as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. [sent-92, score-0.25]

15 Note that the (positive semi-definite) covariance matrices in (9) have the form K − Q with the following interpretation: the prior covariance K minus a (non-negative definite) matrix Q quantifying how much information u provides about the variables in question (f or f∗ ). [sent-93, score-0.275]

16 We emphasize that all the sparse methods discussed in the paper correspond simply to different approximations to the conditionals in (9), and throughout we use the exact likelihood and inducing prior p(y|f) = N (f, σ2 I) , and p(u) = N (0, Ku,u ) . [sent-94, score-0.681]

17 The inducing variables u are superfluous in this case, since all latent function values can communicate with all others. [sent-110, score-0.442]

18 Notice that having cut the communication path between training and test latent function values, information from f can only be transmitted to f∗ via the inducing variables u. [sent-113, score-0.525]

19 In Figure 5 top, left we see how the SoD method produces wide predictive distributions, when training on a randomly selected subset of 10 cases. [sent-116, score-0.278]

20 For any input x∗ , the corresponding function value f∗ is given by: f∗ = K∗,u wu , with −1 p(wu ) = N (0, Ku,u ) , (11) where there is one weight associated to each inducing input in Xu . [sent-127, score-0.39]

21 Note that the covariance matrix for the prior on the weights is the inverse of that on u, such that we recover the exact GP prior on u, 1943 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN which is Gaussian with zero mean and covariance u = Ku,u wu ⇒ uu = Ku,u wu wu Ku,u = Ku,u . [sent-128, score-0.503]

22 The degeneracy of the prior causes unreasonable predictive distributions. [sent-139, score-0.34]

23 Indeed, the approximate prior over functions is so restrictive, that given enough data only a very limited family of functions will be plausible under the posterior, leading to overconfident predictive variances. [sent-140, score-0.317]

24 Figure 5, top, right panel, illustrates the unreasonable n predictive uncertainties of the SoR approximation on a toy dataset. [sent-142, score-0.345]

25 7 The predictive distribution is obtained by using the SoR approximate prior (15) instead of the true prior in (4). [sent-143, score-0.384]

26 Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances of the SoD, or a more accurate computationally costly alternative (more details are given by Qui˜ onero-Candela, 2004, Chapter 3). [sent-149, score-0.25]

27 The computational complexity is O (nm2 ) initially, and O (m) and O (m2 ) per test case for the predictive mean and variance respectively. [sent-153, score-0.305]

28 (2003) o recently proposed another sparse approximation to Gaussian process regression, which does not suffer from the nonsensical predictive uncertainties of the SoR approximation, but that interestingly leads to exactly the same predictive mean. [sent-156, score-0.699]

29 One way of obtaining an equivalent model is to retain the usual likelihood, but to impose a deterministic training conditional and the exact test conditional from eq. [sent-160, score-0.294]

30 The joint prior implied by DTC is given by: qDTC (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f K∗,∗ , (19) which is surprisingly similar to the effective prior implied by the SoR approximation (15). [sent-166, score-0.303]

31 This prior variance reverses the behaviour of the predictive uncertainties, and turns them into sensible ones, see Figure 5 for an illustration. [sent-168, score-0.317]

32 The predictive distribution is now given by: qDTC (f∗ |y) = N (Q∗,f (Qf,f + σ2 I)−1 y, K∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ noise noise = N σ K∗,u Σ Ku,f y, K∗,∗ − Q∗,∗ + K∗,u ΣK∗,u , −2 (20a) (20b) where again we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 as in (16). [sent-169, score-0.348]

33 The predictive mean for the DTC is identical to that of the SoR approximation (16), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (which is larger, since K∗,∗ − Q∗,∗ is positive definite). [sent-170, score-0.535]

34 This added term is the predictive variance of the posterior of f∗ conditioned on u. [sent-171, score-0.282]

35 It grows to the prior variance K∗,∗ as x∗ moves far from the inducing inputs in Xu . [sent-172, score-0.495]

36 Compared to those in Figure 1, all edges between latent function values have been removed: the latent function values are conditionally fully independent given the inducing variables u. [sent-174, score-0.598]

37 Remark 5 The only difference between the predictive distribution of DTC and SoR is the variance. [sent-176, score-0.25]

38 The predictive variance of DTC is never smaller than that of SoR. [sent-177, score-0.25]

39 As we did in (18) for the DTC, we provide an alternative equivalent formulation called Fully Independent Training Conditional (FITC) based on the inducing conditionals: n qFITC (f|u) = ∏ p( fi |u) = N i=1 −1 Kf,u Ku,u u, diag[Kf,f −Qf,f ] , and qFITC ( f∗ |u) = p( f∗ |u) . [sent-183, score-0.385]

40 The predictive distribution is qFITC ( f∗ |y) = N Q∗,f (Qf,f + Λ)−1 y, K∗,∗ − Q∗,f (Qf,f + Λ)−1 Qf,∗ (24a) = N K∗,u ΣKu,f Λ−1 y, K∗,∗ − Q∗,∗ + K∗,u ΣKu,∗ , (24b) where we have defined Σ = (Ku,u + Ku,f Λ−1 Kf,u )−1 and Λ = diag[Kf,f − Qf,f + σ2 I ]. [sent-190, score-0.25]

41 There are two options for joint predictions, either 1) use the exact full test conditional from (9b), or 2) extend the additional factorizing assumption to the test conditional. [sent-193, score-0.249]

42 and the predictive distribution is identical to (24), except for the alternative definition of Λ = blockdiag[Kf,f − Qf,f + σ2 I ]. [sent-211, score-0.25]

43 where the test inputs have to be known before training), and the inducing inputs Xu were chosen to be the test inputs. [sent-217, score-0.619]

44 It is important to realize that the BCM proposes two orthogonal ideas: first, the block diagonal structure of the partially independent training conditional, and second setting the inducing inputs to be the test inputs. [sent-219, score-0.537]

45 In this paper we use Definition 9 Transduction occurs only if the predictive distribution depends on other test inputs. [sent-230, score-0.305]

46 One view is to see that the test latent function value f∗ is now part of the inducing variables u and therefore has access to the training latent function values. [sent-234, score-0.62]

47 There are several different possible motivations for transduction: 1) transduction is somehow easier than induction (Vapnik, 1995), 2) the test inputs may reveal important information, which should be used during training. [sent-238, score-0.252]

48 Transduction can occur in the sparse approximation to GPs, by making the choice of inducing variables depend on the test inputs. [sent-248, score-0.506]

49 Since the inducing variables are connected to all other nodes (see Figure 3) we would expect the approximation to be good at u = f∗ , which is what we care about for predictions, relating to reason 3) above. [sent-250, score-0.382]

50 The main drawback of transduction is that by its nature it doesn’t provide a predictive model in the way inductive models do. [sent-254, score-0.366]

51 In the usual GP model one can do the bulk of the computation 1949 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN involved in the predictive distributions (e. [sent-255, score-0.273]

52 It is interesting that whereas other methods spend much effort trying to optimize the inducing variables, the BCM simply uses the test set. [sent-258, score-0.402]

53 Since in the previous sections, we haven’t assumed anything about u, we can simply augment the set of inducing variables by f∗ (i. [sent-263, score-0.347]

54 have one additional inducing variable equal to the current test latent), and see what happens in the predictive distributions for the different methods. [sent-265, score-0.652]

55 Note, the interpretation of the covariance matrix K∗,∗ − Q∗,∗ was “the prior covariance minus the information which u provides about f∗ ”. [sent-267, score-0.275]

56 An equivalent view on augmentation is that the assumption of conditional independence between f∗ and f is dropped. [sent-269, score-0.283]

57 Because the SoR is a finite linear model, and n the basis functions are local (Gaussian bumps), the predictive distributions can be very misleading. [sent-272, score-0.25]

58 Here augmentation makes a particularly big difference turning the nonsensical predictive distribution into a reasonable one, by ensuring that there is always a basis function centered on the test case. [sent-274, score-0.542]

59 An analogous Gaussian process based finite linear model that has recently been healed by augmentation is the relevance vector machine (Rasmussen and Qui˜ onero-Candela, 2005). [sent-276, score-0.239]

60 n Although augmentation was initially proposed for a narrow set of circumstances, it is easily applied to any of the approximations discussed. [sent-277, score-0.243]

61 Of course, augmentation doesn’t make any sense for an exact, non-degenerate Gaussian process model (a GP with a covariance function that has a feature-space which is infinite dimensional, i. [sent-278, score-0.31]

62 But augmentation does make sense for sparse approximations to GPs. [sent-282, score-0.312]

63 In the process view 1950 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION however, it seems clear that one would chose the additional inducing variable to be f∗ , to minimize the effects of the approximations. [sent-286, score-0.372]

64 Let us compute the effective prior for the augmented SoR. [sent-287, score-0.242]

65 Given that f∗ is in the inducing set, the test conditional is not an approximation and we can rewrite the integral leading to the effective prior: Z qASoR (f∗ , f) = qSoR (f| f∗ , u) p( f∗ , u) du . [sent-288, score-0.579]

66 Augmented DTC only differs from DTC in the additional presence of f∗ among the inducing variables in the training conditional. [sent-291, score-0.375]

67 We can only expect augmented DTC to be a more accurate approximation than DTC, since adding an additional inducing variable can only help capture information from y. [sent-292, score-0.521]

68 We saw previously in Section 5 that the DTC approximation does not suffer from the nonsensical predictive variances of the SoR. [sent-294, score-0.318]

69 The equivalence between the augmented SoR and augmented DTC is another way of seeing how augmentation reverses the misbehaviour of SoR. [sent-295, score-0.459]

70 The predictive distribution of the augmented SoR is obtained by adding f∗ to u in (20). [sent-296, score-0.389]

71 For each new test case, updating the augmented Σ in the predictive equation (for example (20b) for DTC) implies computing the vector matrix product K∗,f Kf,u with complexity O (nm). [sent-298, score-0.444]

72 This is clearly higher than the O (m) for the mean, and O (m2 ) for the predictive distribution of all the non-augmented methods we have discussed. [sent-299, score-0.25]

73 For methods that make little or no approximation to the test conditional, it is difficult to predict the degree to which augmentation would help. [sent-301, score-0.271]

74 However, one can see by giving f∗ access to all of the training latent function values in f, one would expect augmentation to give less under-confident predictive distributions near the training data. [sent-302, score-0.582]

75 Figure 5 clearly shows that augmented DTC (equivalent to augmented SoR) has a superior predictive distribution (both mean and variance) than standard DTC. [sent-303, score-0.528]

76 Augmentation hasn’t been n compared to the more advanced approximations FITC and PITC, and the figure would change in the more realistic scenario where the inducing inputs and hyperparameters are learnt (Snelson and Ghahramani, 2006). [sent-307, score-0.514]

77 It seems that the good performance of the BCM could essentially stem from augmentation, the presence of the other test inputs in the inducing set being probably of little benefit. [sent-309, score-0.483]

78 Joint augmentation might bring some computational advantage, but won’t change the scaling: note that augmenting m times at a cost of O (nm) apiece implies the same O (nm2 ) total cost as the jointly augmented BCM. [sent-310, score-0.32]

79 The squared exponential covariance function is used, and a slightly too short lengthscale is chosen on purpose to emphasize the different behaviour of the predictive uncertainties. [sent-336, score-0.354]

80 The dots are the training points, the crosses are the targets corresponding to the inducing inputs, randomly selected from the training set. [sent-337, score-0.403]

81 The solid line is the mean of the predictive distribution, and the dotted lines show the 95% confidence interval of the predictions. [sent-338, score-0.25]

82 On the Choice of the Inducing Variables We have until now assumed that the inducing inputs Xu were given. [sent-341, score-0.428]

83 In sparse Gaussian processes it has also been suggested to select the inducing inputs Xu from among the training inputs. [sent-344, score-0.555]

84 As discussed in the previous section, selecting the inducing inputs from among the test inputs has also been considered in transductive settings. [sent-347, score-0.605]

85 Recently, Snelson and Ghahramani (2006) have proposed to relax the constraint that the inducing variables must be a subset of training/test cases, turning the discrete selection problem into one of continuous optimization. [sent-348, score-0.37]

86 Which optimality criterion should be used to set the inducing inputs? [sent-351, score-0.347]

87 We have in the above equation explicitly conditioned on the inducing inputs Xu . [sent-354, score-0.428]

88 Using Gaussian identities, the effective marginal likelihood is very easily obtained by adding a ridge σ2 I (from the likelihood) to the covariance of effective noise prior on f. [sent-355, score-0.331]

89 The o prior covariance between f∗ and f is taken to be exact, which is inconsistent with the prior covariance on f: Qf,f Kf,∗ . [sent-378, score-0.342]

90 Worse, the resulting prior covariance matrix is not even guaranteed to be positive definite, allowing the predictive variances to be negative. [sent-380, score-0.421]

91 o Ignoring any quibbles about positive definiteness, the predictive distribution of the Nystr¨ m Apo proximation is given by: p( f∗ |y) = N Kf,∗ [Qf,f + σ2 I]−1 y, K∗,∗ − Kf,∗ [Qf,f + σ2 I]−1 Kf,∗ , noise noise (32) but the predictive variance is not guaranteed to be positive. [sent-383, score-0.598]

92 Whereas all sparse approximations we have presented until now are totally independent of the choice of covariance function, for the RVM this choice is restricted to covariance functions that can be expressed as finite expansions in terms of some basis functions. [sent-412, score-0.339]

93 Being degenerate GPs in exactly the same way as the SoR (presented in Section 4), the RVM does also suffer from unreasonable predictive variances. [sent-413, score-0.295]

94 Rasmussen and Qui˜ onero-Candela (2005) show that the predictive distributions of RVMs can also be healed by n augmentation, see Section 8. [sent-414, score-0.25]

95 Once the αi have been learnt, denoting by m the number of surviving relevance vectors, the complexity of computing the predictive distribution of the RVM is O (m) for mean and O (m2 ) for the variance. [sent-415, score-0.283]

96 Our approach consists of two steps, first 1) we recast the approximation in terms of approximations to the prior, and second 2) we introduce inducing variables u and the idea of conditional independence given u. [sent-421, score-0.546]

97 For example, the SGPP by Snelson 1955 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN and Ghahramani (2006) contains two ideas, 1) a likelihood approximation and 2) the idea of varying the inducing inputs continuously; these two ideas could easily be used independently, and incorporated in other methods. [sent-427, score-0.525]

98 Similarly, the BCM by Tresp (2000) contains two independent ideas 1) a block diagonal assumption, and 2) the (transductive) idea of choosing the test inputs as the inducing variables. [sent-428, score-0.532]

99 Although augmentation has attractive predictive properties, it is computationally expensive. [sent-437, score-0.431]

100 Our analysis suggests that a new interesting approximation would come from combining the best possible approximation (PITC) with the most powerful selection method for the inducing inputs. [sent-441, score-0.417]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dtc', 0.367), ('sor', 0.367), ('inducing', 0.347), ('predictive', 0.25), ('fitc', 0.2), ('gp', 0.188), ('augmentation', 0.181), ('xu', 0.168), ('pitc', 0.167), ('augmented', 0.139), ('qf', 0.121), ('transduction', 0.116), ('andela', 0.111), ('nonero', 0.111), ('pproximate', 0.111), ('covariance', 0.104), ('qui', 0.103), ('bcm', 0.1), ('latent', 0.095), ('asmussen', 0.093), ('parse', 0.093), ('rocess', 0.093), ('snelson', 0.093), ('rasmussen', 0.091), ('inputs', 0.081), ('conditional', 0.077), ('egression', 0.075), ('massachussetts', 0.075), ('rvm', 0.075), ('carl', 0.075), ('unifying', 0.069), ('sparse', 0.069), ('prior', 0.067), ('qsor', 0.067), ('sod', 0.067), ('nystr', 0.066), ('conditionals', 0.065), ('gaussian', 0.065), ('edward', 0.063), ('gps', 0.063), ('approximations', 0.062), ('aussian', 0.061), ('kf', 0.06), ('williams', 0.059), ('blockdiag', 0.056), ('qdtc', 0.056), ('test', 0.055), ('remark', 0.052), ('noise', 0.049), ('volker', 0.047), ('joaquin', 0.045), ('ksor', 0.045), ('qfitc', 0.045), ('ui', 0.043), ('wu', 0.043), ('transductive', 0.041), ('diag', 0.041), ('likelihood', 0.039), ('fi', 0.038), ('seeger', 0.038), ('uncertainties', 0.037), ('doesn', 0.037), ('effective', 0.036), ('approximation', 0.035), ('csat', 0.035), ('fully', 0.035), ('implied', 0.034), ('cheaper', 0.033), ('nonsensical', 0.033), ('qpitc', 0.033), ('sgpp', 0.033), ('relevance', 0.033), ('ghahramani', 0.033), ('tresp', 0.033), ('exact', 0.032), ('posterior', 0.032), ('joint', 0.03), ('processes', 0.03), ('du', 0.029), ('training', 0.028), ('lehel', 0.028), ('df', 0.028), ('christopher', 0.027), ('conditionally', 0.026), ('block', 0.026), ('process', 0.025), ('independence', 0.025), ('deterministic', 0.025), ('schwaighofer', 0.025), ('zc', 0.025), ('hyperparameters', 0.024), ('ideas', 0.023), ('bulk', 0.023), ('committee', 0.023), ('turning', 0.023), ('unreasonable', 0.023), ('fn', 0.022), ('degenerate', 0.022), ('adtc', 0.022), ('anton', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression

Author: Joaquin Quiñonero-Candela, Carl Edward Rasmussen

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints. Keywords: Gaussian process, probabilistic regression, sparse approximation, Bayesian committee machine Regression models based on Gaussian processes (GPs) are simple to implement, flexible, fully probabilistic models, and thus a powerful tool in many areas of application. Their main limitation is that memory requirements and computational demands grow as the square and cube respectively, of the number of training cases n, effectively limiting a direct implementation to problems with at most a few thousand cases. To overcome the computational limitations numerous authors have recently suggested a wealth of sparse approximations. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, and the remaining variables are given some approximate, but computationally cheaper treatment. However, the published algorithms have widely different motivations, emphasis and exposition, so it is difficult to get an overview (see Rasmussen and Williams, 2006, chapter 8) of how they relate to each other, and which can be expected to give rise to the best algorithms. In this paper we provide a unifying view of sparse approximations for GP regression. Our approach is simple, but powerful: for each algorithm we analyze the posterior, and compute the effective prior which it is using. Thus, we reinterpret the algorithms as “exact inference with an approximated prior”, rather than the existing (ubiquitous) interpretation “approximate inference with the exact prior”. This approach has the advantage of directly expressing the approximations in terms of prior assumptions about the function, which makes the consequences of the approximations much easier to understand. While our view of the approximations is not the only one possible, it has the advantage of putting all existing probabilistic sparse approximations under one umbrella, thus enabling direct comparison and revealing the relation between them. In Section 1 we briefly introduce GP models for regression. In Section 2 we present our unifying framework and write out the key equations in preparation for the unifying analysis of sparse c 2005 Joaquin Qui˜ onero-Candela and Carl Edward Rasmussen. n ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN algorithms in Sections 4-7. The relation of transduction and augmentation to our sparse framework is covered in Section 8. All our approximations are written in terms of a new set of inducing variables. The choice of these variables is itself a challenging problem, and is discussed in Section 9. We comment on a few special approximations outside our general scheme in Section 10 and conclusions are drawn at the end. 1. Gaussian Processes for Regression Probabilistic regression is usually formulated as follows: given a training set D = {(xi , yi ), i = 1, . . . , n} of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi , compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test locations x∗ . In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by yi = f (xi ) + εi , where εi ∼ N (0, σ2 ) , noise (1) where σ2 is the variance of the noise. noise Definition 1 A Gaussian process (GP) is a collection of random variables, any finite number of which have consistent1 joint Gaussian distributions. Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to p(f|x1 , x2 , . . . , xn ) = N (0, K) , (2) where f = [ f1 , f2 , . . . , fn ] is a vector of latent function values, fi = f (xi ) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(xi , x j ). Note that the GP treats the latent function values fi as random variables, indexed by the corresponding input. In the following, for simplicity we will always neglect the explicit conditioning on the inputs; the GP model and all expressions are always conditional on the corresponding inputs. The GP model is concerned only with the conditional of the outputs given the inputs; we do not model anything about the inputs themselves. Remark 2 Note, that to adhere to a strict Bayesian formalism, the GP covariance function,3 which defines the prior, should not depend on the data (although it can depend on additional parameters). As we will see in later sections, some approximations are strictly equivalent to GPs, while others are not. That is, the implied prior may still be multivariate Gaussian, but the covariance function may be different for training and test cases. Definition 3 A Gaussian process is called degenerate iff the covariance function has a finite number of non-zero eigenvalues. 1. By consistency is meant simply that the random variables obey the usual rules of marginalization, etc. 2. For notational simplicity we exclusively use zero-mean priors. 3. The covariance function itself shouldn’t depend on the data, though its value at a specific pair of inputs of course will. 1940 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Degenerate GPs (such as e.g. with polynomial covariance function) correspond to finite linear (-in-the-parameters) models, whereas non-degenerate GPs (such as e.g. with squared exponential or RBF covariance function) do not. The prior for a finite m dimensional linear model only considers a universe of at most m linearly independent functions; this may often be too restrictive when n m. Note however, that non-degeneracy on its own doesn’t guarantee the existence of the “right kind” of flexibility for a given particular modelling task. For a more detailed background on GP models, see for example that of Rasmussen and Williams (2006). Inference in the GP model is simple: we put a joint GP prior on training and test latent values, f and f∗ 4 , and combine it with the likelihood5 p(y|f) using Bayes rule, to obtain the joint posterior p(f, f∗ )p(y|f) . p(y) p(f, f∗ |y) = (3) The final step needed to produce the desired posterior predictive distribution is to marginalize out the unwanted training set latent variables: p(f∗ |y) = Z 1 p(y) p(f, f∗ |y)df = Z p(y|f) p(f, f∗ ) df , (4) or in words: the predictive distribution is the marginal of the renormalized joint prior times the likelihood. The joint GP prior and the independent likelihood are both Gaussian p(f, f∗ ) = N 0, Kf,f K∗,f Kf,∗ K∗,∗ , and p(y|f) = N (f, σ2 I) , noise (5) where K is subscript by the variables between which the covariance is computed (and we use the asterisk ∗ as shorthand for f∗ ) and I is the identity matrix. Since both factors in the integral are Gaussian, the integral can be evaluated in closed form to give the Gaussian predictive distribution p(f∗ |y) = N K∗,f (Kf,f + σ2 I)−1 y, K∗,∗ − K∗,f (Kf,f + σ2 I)−1 Kf,∗ , noise noise (6) see the relevant Gaussian identity in appendix A. The problem with the above expression is that it requires inversion of a matrix of size n × n which requires O (n3 ) operations, where n is the number of training cases. Thus, the simple exact implementation can handle problems with at most a few thousand training cases. 2. A New Unifying View We now seek to modify the joint prior p(f∗ , f) from (5) in ways which will reduce the computational requirements from (6). Let us first rewrite that prior by introducing an additional set of m latent variables u = [u1 , . . . , um ] , which we call the inducing variables. These latent variables are values of the Gaussian process (as also f and f∗ ), corresponding to a set of input locations Xu , which we call the inducing inputs. Whereas the additional latent variables u are always marginalized out in the predictive distribution, the choice of inducing inputs does leave an imprint on the final solution. 4. We will mostly consider a vector of test cases f∗ (rather than a single f∗ ). 5. You may have been expecting the likelihood written as p(y|f∗ , f) but since the likelihood is conditionally independent of everything else given f, this makes no difference. 1941 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN The inducing variables will turn out to be generalizations of variables which other authors have referred to variously as “support points”, “active set” or “pseudo-inputs”. Particular sparse algorithms choose the inducing variables in various different ways; some algorithms chose the inducing inputs to be a subset of the training set, others not, as we will discuss in Section 9. For now consider any arbitrary inducing variables. Due to the consistency of Gaussian processes, we know that we can recover p(f∗ , f) by simply integrating (marginalizing) out u from the joint GP prior p(f∗ , f, u) p(f∗ , f) = Z p(f∗ , f, u) du = Z p(f∗ , f|u) p(u) du, where p(u) = N (0, Ku,u ) . (7) This is an exact expression. Now, we introduce the fundamental approximation which gives rise to almost all sparse approximations. We approximate the joint prior by assuming that f∗ and f are conditionally independent given u, see Figure 1, such that p(f∗ , f) q(f∗ , f) = Z q(f∗ |u) q(f|u) p(u) du . (8) The name inducing variable is motivated by the fact that f and f∗ can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f∗ |u) of the integral in (8). It will be useful for future reference to specify here the exact expressions for the two conditionals training conditional: test conditional: −1 p(f|u) = N (Kf,u Ku,u u, Kf,f − Qf,f ) , −1 p(f∗ |u) = N (K∗,u Ku,u u, K∗,∗ − Q∗,∗ ) , (9a) (9b) −1 where we have introduced the shorthand notation6 Qa,b Ka,u Ku,u Ku,b . We can readily identify the expressions in (9) as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. Note that the (positive semi-definite) covariance matrices in (9) have the form K − Q with the following interpretation: the prior covariance K minus a (non-negative definite) matrix Q quantifying how much information u provides about the variables in question (f or f∗ ). We emphasize that all the sparse methods discussed in the paper correspond simply to different approximations to the conditionals in (9), and throughout we use the exact likelihood and inducing prior p(y|f) = N (f, σ2 I) , and p(u) = N (0, Ku,u ) . (10) noise 3. The Subset of Data (SoD) Approximation Before we get started with the more sophisticated approximations, we mention as a baseline method the simplest possible sparse approximation (which doesn’t fall inside our general scheme): use only a subset of the data (SoD). The computational complexity is reduced to O (m3 ), where m < n. We would not generally expect SoD to be a competitive method, since it would seem impossible (even with fairly redundant data and a good choice of the subset) to get a realistic picture of the 6. Note, that Qa,b depends on u although this is not explicit in the notation. 1942 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗ Figure 1: Graphical model of the relation between the inducing variables u, the training latent functions values f = [ f1 , . . . , fn ] and the test function value f∗ . The thick horizontal line represents a set of fully connected nodes. The observations y1 , . . . , yn , y∗ (not shown) would dangle individually from the corresponding latent values, by way of the exact (factored) likelihood (5). Left graph: the fully connected graph corresponds to the case where no approximation is made to the full joint Gaussian process distribution between these variables. The inducing variables u are superfluous in this case, since all latent function values can communicate with all others. Right graph: assumption of conditional independence between training and test function values given u. This gives rise to the separation between training and test conditionals from (8). Notice that having cut the communication path between training and test latent function values, information from f can only be transmitted to f∗ via the inducing variables u. uncertainties, when only a part of the training data is even considered. We include it here mostly as a baseline against which to compare better sparse approximations. In Figure 5 top, left we see how the SoD method produces wide predictive distributions, when training on a randomly selected subset of 10 cases. A fair comparison to other methods would take into account that the computational complexity is independent of n as opposed to other more advanced methods. These extra computational resources could be spent in a number of ways, e.g. larger m, or an active (rather than random) selection of the m points. In this paper we will concentrate on understanding the theoretical foundations of the various approximations rather than investigating the necessary heuristics needed to turn the approximation schemes into actually practical algorithms. 4. The Subset of Regressors (SoR) Approximation The Subset of Regressors (SoR) algorithm was given by Silverman (1985), and mentioned again by Wahba et al. (1999). It was then adapted by Smola and Bartlett (2001) to propose a sparse greedy approximation to Gaussian process regression. SoR models are finite linear-in-the-parameters models with a particular prior on the weights. For any input x∗ , the corresponding function value f∗ is given by: f∗ = K∗,u wu , with −1 p(wu ) = N (0, Ku,u ) , (11) where there is one weight associated to each inducing input in Xu . Note that the covariance matrix for the prior on the weights is the inverse of that on u, such that we recover the exact GP prior on u, 1943 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN which is Gaussian with zero mean and covariance u = Ku,u wu ⇒ uu = Ku,u wu wu Ku,u = Ku,u . (12) −1 Using the effective prior on u and the fact that wu = Ku,u u we can redefine the SoR model in an equivalent, more intuitive way: −1 f∗ = K∗,u Ku,u u , with u ∼ N (0, Ku,u ) . (13) We are now ready to integrate the SoR model in our unifying framework. Given that there is a deterministic relation between any f∗ and u, the approximate conditional distributions in the integral in eq. (8) are given by: −1 qSoR (f|u) = N (Kf,u Ku,u u, 0) , and −1 qSoR (f∗ |u) = N (K∗,u Ku,u u, 0) , (14) with zero conditional covariance, compare to (9). The effective prior implied by the SoR approximation is easily obtained from (8), giving qSoR (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f Q∗,∗ , (15) −1 where we recall Qa,b Ka,u Ku,u Ku,b . A more descriptive name for this method, would be the Deterministic Inducing Conditional (DIC) approximation. We see that this approximate prior is degenerate. There are only m degrees of freedom in the model, which implies that only m linearly independent functions can be drawn from the prior. The m + 1-th one is a linear combination of the previous. For example, in a very low noise regime, the posterior could be severely constrained by only m training cases. The degeneracy of the prior causes unreasonable predictive distributions. Indeed, the approximate prior over functions is so restrictive, that given enough data only a very limited family of functions will be plausible under the posterior, leading to overconfident predictive variances. This is a general problem of finite linear models with small numbers of weights (for more details see Rasmussen and Qui˜ onero-Candela, 2005). Figure 5, top, right panel, illustrates the unreasonable n predictive uncertainties of the SoR approximation on a toy dataset.7 The predictive distribution is obtained by using the SoR approximate prior (15) instead of the true prior in (4). For each algorithm we give two forms of the predictive distribution, one which is easy to interpret, and the other which is economical to compute with: qSoR (f∗ |y) = N Q∗,f (Qf,f + σ2 I)−1 y, Q∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ , noise noise = N σ K∗,u Σ Ku,f y, K∗,u ΣKu,∗ , −2 (16a) (16b) where we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 . Equation (16a) is readily recognized as the regular prediction equation (6), except that the covariance K has everywhere been replaced by Q, which was already suggested by (15). This corresponds to replacing the covariance function k with −1 kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The new covariance function has rank (at most) m. Thus we have the following 7. Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances of the SoD, or a more accurate computationally costly alternative (more details are given by Qui˜ onero-Candela, 2004, Chapter 3). n 1944 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Remark 4 The SoR approximation is equivalent to exact inference in the degenerate Gaussian −1 process with covariance function kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The equivalent (16b) is computationally cheaper, and with (11) in mind, Σ is the covariance of the posterior on the weights wu . Note that as opposed to the subset of data method, all training cases are taken into account. The computational complexity is O (nm2 ) initially, and O (m) and O (m2 ) per test case for the predictive mean and variance respectively. 5. The Deterministic Training Conditional (DTC) Approximation Taking up ideas already contained in the work of Csat´ and Opper (2002), Seeger et al. (2003) o recently proposed another sparse approximation to Gaussian process regression, which does not suffer from the nonsensical predictive uncertainties of the SoR approximation, but that interestingly leads to exactly the same predictive mean. Seeger et al. (2003), who called the method Projected Latent Variables (PLV), presented the method as relying on a likelihood approximation, based on −1 the projection f = Kf,u Ku,u u: p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, σ2 I) . noise (17) The method has also been called the Projected Process Approximation (PPA) by Rasmussen and Williams (2006, Chapter 8). One way of obtaining an equivalent model is to retain the usual likelihood, but to impose a deterministic training conditional and the exact test conditional from eq. (9b) −1 qDTC (f|u) = N (Kf,u Ku,u u, 0), and qDTC (f∗ |u) = p(f∗ |u) . (18) This reformulation has the advantage of allowing us to stick to our view of exact inference (with exact likelihood) with approximate priors. Indeed, under this model the conditional distribution of f given u is identical to that of the SoR, given in the left of (14). A systematic name for this approximation is the Deterministic Training Conditional (DTC). The fundamental difference with SoR is that DTC uses the exact test conditional (9b) instead of the deterministic relation between f∗ and u of SoR. The joint prior implied by DTC is given by: qDTC (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f K∗,∗ , (19) which is surprisingly similar to the effective prior implied by the SoR approximation (15). The fundamental difference is that under the DTC approximation f∗ has a prior variance of its own, given by K∗,∗ . This prior variance reverses the behaviour of the predictive uncertainties, and turns them into sensible ones, see Figure 5 for an illustration. The predictive distribution is now given by: qDTC (f∗ |y) = N (Q∗,f (Qf,f + σ2 I)−1 y, K∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ noise noise = N σ K∗,u Σ Ku,f y, K∗,∗ − Q∗,∗ + K∗,u ΣK∗,u , −2 (20a) (20b) where again we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 as in (16). The predictive mean for the DTC is identical to that of the SoR approximation (16), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (which is larger, since K∗,∗ − Q∗,∗ is positive definite). This added term is the predictive variance of the posterior of f∗ conditioned on u. It grows to the prior variance K∗,∗ as x∗ moves far from the inducing inputs in Xu . 1945 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r f1 f2 fn f∗ Figure 2: Graphical model for the FITC approximation. Compared to those in Figure 1, all edges between latent function values have been removed: the latent function values are conditionally fully independent given the inducing variables u. Although strictly speaking the SoR and DTC approximations could also be represented by this graph, note that both further assume a deterministic relation between f and u. Remark 5 The only difference between the predictive distribution of DTC and SoR is the variance. The predictive variance of DTC is never smaller than that of SoR. Note, that since the covariances for training cases and test cases are computed differently, see (19), it follows that Remark 6 The DTC approximation does not correspond exactly to a Gaussian process, as the covariance between latent values depends on whether they are considered training or test cases, violating consistency, see Definition 1. The computational complexity has the same order as for SoR. 6. The Fully Independent Training Conditional (FITC) Approximation Recently Snelson and Ghahramani (2006) proposed another likelihood approximation to speed up Gaussian process regression, which they called Sparse Gaussian Processes using Pseudo-inputs (SGPP). While the DTC is based on the likelihood approximation given by (17), the SGPP proposes a more sophisticated likelihood approximation with a richer covariance p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, diag[Kf,f − Qf,f ] + σ2 I) , noise (21) where diag[A] is a diagonal matrix whose elements match the diagonal of A. As we did in (18) for the DTC, we provide an alternative equivalent formulation called Fully Independent Training Conditional (FITC) based on the inducing conditionals: n qFITC (f|u) = ∏ p( fi |u) = N i=1 −1 Kf,u Ku,u u, diag[Kf,f −Qf,f ] , and qFITC ( f∗ |u) = p( f∗ |u) . (22) We see that as opposed to SoR and DTC, FITC does not impose a deterministic relation between f and u. Instead of ignoring the variance, FITC proposes an approximation to the training conditional distribution of f given u as a further independence assumption. In addition, the exact test conditional from (9b) is used in (22), although for reasons which will become clear towards the end of this 1946 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION section, we initially consider only a single test case, f∗ . The corresponding graphical model is given in Figure 2. The effective prior implied by the FITC is given by qFITC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ . (23) Note, that the sole difference between the DTC and FITC is that in the top left corner of the implied prior covariance, FITC replaces the approximate covariances of DTC by the exact ones on the diagonal. The predictive distribution is qFITC ( f∗ |y) = N Q∗,f (Qf,f + Λ)−1 y, K∗,∗ − Q∗,f (Qf,f + Λ)−1 Qf,∗ (24a) = N K∗,u ΣKu,f Λ−1 y, K∗,∗ − Q∗,∗ + K∗,u ΣKu,∗ , (24b) where we have defined Σ = (Ku,u + Ku,f Λ−1 Kf,u )−1 and Λ = diag[Kf,f − Qf,f + σ2 I ]. The compunoise tational complexity is identical to that of SoR and DTC. So far we have only considered a single test case. There are two options for joint predictions, either 1) use the exact full test conditional from (9b), or 2) extend the additional factorizing assumption to the test conditional. Although Snelson and Ghahramani (2006) don’t explicitly discuss joint predictions, it would seem that they probably intend the second option. Whereas the additional independence assumption for the test cases is not really necessary for computational reasons, it does affect the nature of the approximation. Under option 1) the training and test covariance are computed differently, and thus this does not correspond to our strict definition of a GP model, but Remark 7 Iff the assumption of full independence is extended to the test conditional, the FITC approximation is equivalent to exact inference in a non-degenerate Gaussian process with covariance function kFIC (xi , x j ) = kSoR (xi , x j ) + δi, j [k(xi , x j ) − kSoR (xi , x j )], where δi, j is Kronecker’s delta. A logical name for the method where the conditionals (training and test) are always forced to be fully independent would be the Fully Independent Conditional (FIC) approximation. The effective prior implied by FIC is: qFIC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f Q∗,∗ − diag[Q∗,∗ − K∗,∗ ] . (25) 7. The Partially Independent Training Conditional (PITC) Approximation In the previous section we saw how to improve the DTC approximation by approximating the training conditional with an independent distribution, i.e. one with a diagonal covariance matrix. In this section we will further improve the approximation (while remaining computationally attractive) by extending the training conditional to have a block diagonal covariance: −1 qPITC (f|u) = N Kf,u Ku,u u, blockdiag[Kf,f − Qf,f ] , and qPITC (f∗ |u) = p(f∗ |u) . (26) where blockdiag[A] is a block diagonal matrix (where the blocking structure is not explicitly stated). We represent graphically the PITC approximation in Figure 3. Developing this analogously to the FITC approximation from the previous section, we get the joint prior qPITC (f, f∗ ) = N 0, Qf,f − blockdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ 1947 , (27) ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r fI fI fI f∗ 1 2 k Figure 3: Graphical representation of the PITC approximation. The set of latent function values fIi indexed by the the set of indices Ii is fully connected. The PITC differs from FITC (see graph in Fig. 2) in that conditional independence is now between the k groups of training latent function values. This corresponds to the block diagonal approximation to the true training conditional given in (26). and the predictive distribution is identical to (24), except for the alternative definition of Λ = blockdiag[Kf,f − Qf,f + σ2 I ]. An identical expression was obtained by Schwaighofer and Tresp noise (2003, Sect. 3), developing from the original Bayesian committee machine (BCM) by Tresp (2000). The relationship to the FITC was pointed out by Lehel Csat´ . The BCM was originally proposed as o a transductive learner (i.e. where the test inputs have to be known before training), and the inducing inputs Xu were chosen to be the test inputs. We discuss transduction in detail in the next section. It is important to realize that the BCM proposes two orthogonal ideas: first, the block diagonal structure of the partially independent training conditional, and second setting the inducing inputs to be the test inputs. These two ideas can be used independently and in Section 8 we propose using the first without the second. The computational complexity of the PITC approximation depends on the blocking structure imposed in (26). A reasonable choice, also recommended by Tresp (2000) may be to choose k = n/m blocks, each of size m × m. The computational complexity remains O (nm2 ). Since in the PITC model the covariance is computed differently for training and test cases Remark 8 The PITC approximation does not correspond exactly to a Gaussian process. This is because computing covariances requires knowing whether points are from the training- or test-set, (27). One can obtain a Gaussian process from the PITC by extending the partial conditional independence assumption to the test conditional, as we did in Remark 7 for the FITC. 8. Transduction and Augmentation The idea of transduction is that one should restrict the goal of learning to prediction on a prespecified set of test cases, rather than trying to learn an entire function (induction) and then evaluate it at the test inputs. There may be no universally agreed upon definition of transduction. In this paper we use Definition 9 Transduction occurs only if the predictive distribution depends on other test inputs. This operational definition excludes models for which there exist an equivalent inductive counterpart. According to this definition, it is irrelevant when the bulk of the computation takes place. 1948 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u f∗   e  ¡     ¡ e   e ¡     ¡     e e   ¡       e ¡   r r r fI fI fI 1 2 k Figure 4: Two views on Augmentation. One view is to see that the test latent function value f∗ is now part of the inducing variables u and therefore has access to the training latent function values. An equivalent view is to consider that we have dropped the assumption of conditional independence between f∗ and the training latent function values. Even if f∗ has now direct access to each of the training fi , these still need to go through u to talk to each other if they fall in conditionally independent blocks. We have in this figure decided to recycle the graph for PITC from Figure 3 to show that all approximations we have presented can be augmented, irrespective of what the approximation for the training conditional is. There are several different possible motivations for transduction: 1) transduction is somehow easier than induction (Vapnik, 1995), 2) the test inputs may reveal important information, which should be used during training. This motivation drives models in semi-supervised learning (studied mostly in the context of classification) and 3) for approximate algorithms one may be able to limit the discrepancies of the approximation at the test points. For exact GP models it seems that the first reason doesn’t really apply. If you make predictions at the test points that are consistent with a GP, then it is trivial inside the GP framework to extend these to any other input points, and in effect we have done induction. The second reason seems more interesting. However, in a standard GP setting, it is a consequence of the consistency property, see Remark 2, that predictions at one test input are independent of the location of any other test inputs. Therefore transduction can not be married with exact GPs: Remark 10 Transduction can not occur in exact Gaussian process models. Whereas this holds for the usual setting of GPs, it could be different in non-standard situations where e.g. the covariance function depends on the empirical input densities. Transduction can occur in the sparse approximation to GPs, by making the choice of inducing variables depend on the test inputs. The BCM from the previous section, where Xu = X∗ (where X∗ are the test inputs) is an example of this. Since the inducing variables are connected to all other nodes (see Figure 3) we would expect the approximation to be good at u = f∗ , which is what we care about for predictions, relating to reason 3) above. While this reasoning is sound, it is not necessarily a sufficient consideration for getting a good model. The model has to be able to simultaneously explain the training targets as well and if the choice of u makes this difficult, the posterior at the points of interest may be distorted. Thus, the choice of u should be governed by the ability to model the conditional of the latents given the inputs, and not solely by the density of the (test) inputs. The main drawback of transduction is that by its nature it doesn’t provide a predictive model in the way inductive models do. In the usual GP model one can do the bulk of the computation 1949 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN involved in the predictive distributions (e.g. matrix inversion) before seeing the test cases, enabling fast computation of test predictions. It is interesting that whereas other methods spend much effort trying to optimize the inducing variables, the BCM simply uses the test set. The quality of the BCM approximation depends then on the particular location of the test inputs, upon which one usually does not have any control. We now see that there may be a better method, eliminating the drawback of transduction, namely use the PITC approximation, but choose the u’s carefully (see Section 9), don’t just use the test set. 8.1 Augmentation An idea closely related to transduction, but not covered by our definition, is augmentation, which in contrast to transduction is done individually for each test case. Since in the previous sections, we haven’t assumed anything about u, we can simply augment the set of inducing variables by f∗ (i.e. have one additional inducing variable equal to the current test latent), and see what happens in the predictive distributions for the different methods. Let’s first investigate the consequences for the test conditional from (9b). Note, the interpretation of the covariance matrix K∗,∗ − Q∗,∗ was “the prior covariance minus the information which u provides about f∗ ”. It is clear that the augmented u (with f∗ ) provides all possible information about f∗ , and consequently Q∗,∗ = K∗,∗ . An equivalent view on augmentation is that the assumption of conditional independence between f∗ and f is dropped. This is seen trivially by adding edges between f∗ and the fi in the graphical model, Figure 4. Augmentation was originally proposed by Rasmussen (2002), and applied in detail to the SoR with RBF covariance by Qui˜ onero-Candela (2004). Because the SoR is a finite linear model, and n the basis functions are local (Gaussian bumps), the predictive distributions can be very misleading. For example, when making predictions far away from the center of any basis function, all basis functions have insignificant magnitudes, and the prediction (averaged over the posterior) will be close to zero, with very small error-bars; this is the opposite of the desired behaviour, where we would expect the error-bars to grow as we move away from the training cases. Here augmentation makes a particularly big difference turning the nonsensical predictive distribution into a reasonable one, by ensuring that there is always a basis function centered on the test case. Compare the nonaugmented to the augmented SoR in Figure 5. An analogous Gaussian process based finite linear model that has recently been healed by augmentation is the relevance vector machine (Rasmussen and Qui˜ onero-Candela, 2005). n Although augmentation was initially proposed for a narrow set of circumstances, it is easily applied to any of the approximations discussed. Of course, augmentation doesn’t make any sense for an exact, non-degenerate Gaussian process model (a GP with a covariance function that has a feature-space which is infinite dimensional, i.e. with basis functions everywhere). Remark 11 A full non-degenerate Gaussian process cannot be augmented, since the corresponding f∗ would already be connected to all other variables in the graphical model. But augmentation does make sense for sparse approximations to GPs. The more general process view on augmentation has several advantages over the basis function view. It is not completely clear from the basis function view, which basis function should be used for augmentation. For example, Rasmussen and Qui˜ onero-Candela (2005) successfully apply augn mentation using basis functions that have a zero contribution at the test location! In the process view 1950 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION however, it seems clear that one would chose the additional inducing variable to be f∗ , to minimize the effects of the approximations. Let us compute the effective prior for the augmented SoR. Given that f∗ is in the inducing set, the test conditional is not an approximation and we can rewrite the integral leading to the effective prior: Z qASoR (f∗ , f) = qSoR (f| f∗ , u) p( f∗ , u) du . (28) It is interesting to notice that this is also the effective prior that would result from augmenting the DTC approximation, since qSoR (f| f∗ , u) = qDTC (f| f∗ , u). Remark 12 Augmented SoR (ASoR) is equivalent to augmented DTC (ADTC). Augmented DTC only differs from DTC in the additional presence of f∗ among the inducing variables in the training conditional. We can only expect augmented DTC to be a more accurate approximation than DTC, since adding an additional inducing variable can only help capture information from y. Therefore Remark 13 DTC is a less accurate (but cheaper) approximation than augmented SoR. We saw previously in Section 5 that the DTC approximation does not suffer from the nonsensical predictive variances of the SoR. The equivalence between the augmented SoR and augmented DTC is another way of seeing how augmentation reverses the misbehaviour of SoR. The predictive distribution of the augmented SoR is obtained by adding f∗ to u in (20). Prediction with an augmented sparse model comes at a higher computational cost, since now f∗ directly interacts with all of f and not just with u. For each new test case, updating the augmented Σ in the predictive equation (for example (20b) for DTC) implies computing the vector matrix product K∗,f Kf,u with complexity O (nm). This is clearly higher than the O (m) for the mean, and O (m2 ) for the predictive distribution of all the non-augmented methods we have discussed. Augmentation seems to be only really necessary for methods that make a severe approximation to the test conditional, like the SoR. For methods that make little or no approximation to the test conditional, it is difficult to predict the degree to which augmentation would help. However, one can see by giving f∗ access to all of the training latent function values in f, one would expect augmentation to give less under-confident predictive distributions near the training data. Figure 5 clearly shows that augmented DTC (equivalent to augmented SoR) has a superior predictive distribution (both mean and variance) than standard DTC. Note however that in the figure we have purposely chosen a too short lengthscale to enhance visualization. Quantitatively, this superiority was experimentally assessed by Qui˜ onero-Candela (2004, Table 3.1). Augmentation hasn’t been n compared to the more advanced approximations FITC and PITC, and the figure would change in the more realistic scenario where the inducing inputs and hyperparameters are learnt (Snelson and Ghahramani, 2006). Transductive methods like the BCM can be seen as joint augmentation, and one could potentially use it for any of the methods presented. It seems that the good performance of the BCM could essentially stem from augmentation, the presence of the other test inputs in the inducing set being probably of little benefit. Joint augmentation might bring some computational advantage, but won’t change the scaling: note that augmenting m times at a cost of O (nm) apiece implies the same O (nm2 ) total cost as the jointly augmented BCM. 1951 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN SoD SoR 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 DTC 0 5 10 15 5 10 15 5 10 15 ASoR/ADTC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 FITC 0 PITC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 0 Figure 5: Toy example with identical covariance function and hyperparameters. The squared exponential covariance function is used, and a slightly too short lengthscale is chosen on purpose to emphasize the different behaviour of the predictive uncertainties. The dots are the training points, the crosses are the targets corresponding to the inducing inputs, randomly selected from the training set. The solid line is the mean of the predictive distribution, and the dotted lines show the 95% confidence interval of the predictions. Augmented DTC (ADTC) is equivalent to augmented SoR (ASoR), see Remark 12. 1952 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION 9. On the Choice of the Inducing Variables We have until now assumed that the inducing inputs Xu were given. Traditionally, sparse models have very often been built upon a carefully chosen subset of the training inputs. This concept is probably best exemplified in the popular support vector machine (Cortes and Vapnik, 1995). In sparse Gaussian processes it has also been suggested to select the inducing inputs Xu from among the training inputs. Since this involves a prohibitive combinatorial optimization, greedy optimization approaches have been suggested using various selection criteria like online learning (Csat´ and o Opper, 2002), greedy posterior maximization (Smola and Bartlett, 2001), maximum information gain (Seeger et al., 2003), matching pursuit (Keerthi and Chu, 2006), and probably more. As discussed in the previous section, selecting the inducing inputs from among the test inputs has also been considered in transductive settings. Recently, Snelson and Ghahramani (2006) have proposed to relax the constraint that the inducing variables must be a subset of training/test cases, turning the discrete selection problem into one of continuous optimization. One may hope that finding a good solution is easier in the continuous than the discrete case, although finding the global optimum is intractable in both cases. And perhaps the less restrictive choice can lead to better performance in very sparse models. Which optimality criterion should be used to set the inducing inputs? Departing from a fully Bayesian treatment which would involve defining priors on Xu , one could maximize the marginal likelihood (also called the evidence) with respect to Xu , an approach also followed by Snelson and Ghahramani (2006). Each of the approximate methods proposed involves a different effective prior, and hence its own particular effective marginal likelihood conditioned on the inducing inputs q(y|Xu ) = ZZ p(y|f) q(f|u) p(u|Xu )du df = Z p(y|f) q(f|Xu )df , (29) which of course is independent of the test conditional. We have in the above equation explicitly conditioned on the inducing inputs Xu . Using Gaussian identities, the effective marginal likelihood is very easily obtained by adding a ridge σ2 I (from the likelihood) to the covariance of effective noise prior on f. Using the appropriate definitions of Λ, the log marginal likelihood becomes 1 log q(y|Xu ) = − 2 log |Qf,f + Λ| − 1 y (Qf,f + Λ)−1 y − n log(2π) , 2 2 (30) where ΛSoR = ΛDTC = σ2 I, ΛFITC = diag[Kf,f − Qf,f ] + σ2 I, and ΛPITC = blockdiag[Kf,f − noise noise Qf,f ] + σ2 I. The computational cost of the marginal likelihood is O (nm2 ) for all methods, that of noise its gradient with respect to one element of Xu is O (nm). This of course implies that the complexity of computing the gradient wrt. to the whole of Xu is O (dnm2 ), where d is the dimension of the input space. It has been proposed to maximize the effective posterior instead of the effective marginal likelihood (Smola and Bartlett, 2001). However this is potentially dangerous and can lead to overfitting. Maximizing the whole evidence instead is sound and comes at an identical computational cost (for a deeper analysis see Qui˜ onero-Candela, 2004, Sect. 3.3.5 and Fig. 3.2). n The marginal likelihood has traditionally been used to learn the hyperparameters of GPs in the non fully Bayesian treatment (see for example Williams and Rasmussen, 1996). For the sparse approximations presented here, once you are learning Xu it is straightforward to allow for learning hyperparameters (of the covariance function) during the same optimization, and there is no need to interleave optimization of u with learning of the hyperparameters as it has been proposed for example by Seeger et al. (2003). 1953 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN 10. Other Methods In this section we briefly mention two approximations which don’t fit in our unifying scheme, since one doesn’t correspond to a proper probabilistic model, and the other one uses a particular construction for the covariance function, rather than allowing any general covariance function. 10.1 The Nystr¨ m Approximation o The Nystr¨ m Approximation for speeding up GP regression was originally proposed by Williams o and Seeger (2001), and then questioned by Williams et al. (2002). Like SoR and DTC, the Nystr¨ m o Approximation for GP regression approximates the prior covariance of f by Qf,f . However, unlike these methods, the Nystr¨ m Approximation is not based on a generative probabilistic model. The o prior covariance between f∗ and f is taken to be exact, which is inconsistent with the prior covariance on f: Qf,f Kf,∗ . (31) q(f, f∗ ) = N 0, K∗,f K∗,∗ As a result we cannot derive this method from our unifying framework, nor represent it with a graphical model. Worse, the resulting prior covariance matrix is not even guaranteed to be positive definite, allowing the predictive variances to be negative. Notice that replacing Kf,∗ by Qf,∗ in (31) is enough to make the prior covariance positive definite, and one obtains the DTC approximation. Remark 14 The Nystr¨ m Approximation does not correspond to a well-formed probabilistic model. o Ignoring any quibbles about positive definiteness, the predictive distribution of the Nystr¨ m Apo proximation is given by: p( f∗ |y) = N Kf,∗ [Qf,f + σ2 I]−1 y, K∗,∗ − Kf,∗ [Qf,f + σ2 I]−1 Kf,∗ , noise noise (32) but the predictive variance is not guaranteed to be positive. The computational cost is O (nm2 ). 10.2 The Relevance Vector Machine The relevance vector machine, introduced by Tipping (2001), is a finite linear model with an independent Gaussian prior imposed on the weights. For any input x∗ , the corresponding function output is given by: f∗ = φ∗ w , with p(w|A) = N (0, A) , (33) where φ∗ = [φ1 (x), . . . , φm (x)] is the (row) vector of responses of the m basis functions, and A = diag(α1 , . . . , αm ) is the diagonal matrix of joint prior precisions (inverse variances) of the weights. The αi are learnt by maximizing the RVM evidence (obtained by also assuming Gaussian additive iid. noise, see (1)), and for the typical case of rich enough sets of basis functions many of the precisions go to infinity effectively pruning out the corresponding weights (for a very interesting analysis see Wipf et al., 2004). The RVM is thus a sparse method and the surviving basis functions are called relevance vectors. Note that since the RVM is a finite linear model with Gaussian priors on the weights, it can be seen as a Gaussian process: Remark 15 The RVM is equivalent to a degenerate Gaussian process with covariance function kRVM (xi , x j ) = φi A−1 φ j = ∑m α−1 φk (xi ) φk (x j ), k=1 k 1954 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION q(f∗ |u) q(f|u) GP exact exact SoR determ. determ. DTC exact determ. FITC (exact) fully indep. PITC exact partially indep. Method joint prior covariance Kf,f Kf,∗ K∗,f K∗,∗ Qf,f Qf,∗ Q∗,f Q∗,∗ Qf,f Qf,∗ Q∗,f K∗,∗ Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ Qf,f − blokdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ GP? √ √ √ ( ) Table 1: Summary of the way approximations are built. All these methods are detailed in the previous sections. The initial cost and that of the mean and variance per test case are respectively n2 , n and n2 for the exact GP, and nm2 , m and m2 for all other methods. The “GP?” column indicates whether the approximation is equivalent to a GP. For FITC see Remark 7. as was also pointed out by Tipping (2001, eq. (59)). Whereas all sparse approximations we have presented until now are totally independent of the choice of covariance function, for the RVM this choice is restricted to covariance functions that can be expressed as finite expansions in terms of some basis functions. Being degenerate GPs in exactly the same way as the SoR (presented in Section 4), the RVM does also suffer from unreasonable predictive variances. Rasmussen and Qui˜ onero-Candela (2005) show that the predictive distributions of RVMs can also be healed by n augmentation, see Section 8. Once the αi have been learnt, denoting by m the number of surviving relevance vectors, the complexity of computing the predictive distribution of the RVM is O (m) for mean and O (m2 ) for the variance. RVMs are often used with radial basis functions centered on the training inputs. One potentially interesting extension to the RVM would be to learn the locations of the centers of the basis functions, in the same way as proposed by Snelson and Ghahramani (2006) for the FITC approximation, see Section 6. This is a curious reminiscence of learning the centers in RBF Networks. 11. Conclusions We have provided a unifying framework for sparse approximations to Gaussian processes for regression. Our approach consists of two steps, first 1) we recast the approximation in terms of approximations to the prior, and second 2) we introduce inducing variables u and the idea of conditional independence given u. We recover all existing sparse methods by making further simplifications of the covariances of the training and test conditionals, see Table 1 for a summary. Previous methods were presented based on different approximation paradigms (e.g. likelihood approximations, projection methods, matrix approximations, minimization of Kullback-Leibler divergence, etc), making direct comparison difficult. Under our unifying view we deconstruct methods, making it clear which building blocks they are based upon. For example, the SGPP by Snelson 1955 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN and Ghahramani (2006) contains two ideas, 1) a likelihood approximation and 2) the idea of varying the inducing inputs continuously; these two ideas could easily be used independently, and incorporated in other methods. Similarly, the BCM by Tresp (2000) contains two independent ideas 1) a block diagonal assumption, and 2) the (transductive) idea of choosing the test inputs as the inducing variables. Finally we note that although all three ideas of 1) transductively setting u = f∗ , 2) augmentation and 3) continuous optimization of Xu have been proposed in very specific settings, in fact they are completely general ideas, which can be applied to any of the approximation schemes considered. We have ranked the approximation according to how close they are to the corresponding full GP. However, the performance in practical situations may not always follow this theoretical ranking since the approximations might exhibit properties (not present in the full GP) which may be particularly suitable for specific datasets. This may make the interpretation of empirical comparisons challenging. A further complication arises when adding the necessary heuristics for turning the theoretical constructs into practical algorithms. We have not described full algorithms in this paper, but are currently working on a detailed empirical study (in preparation, see also Rasmussen and Williams, 2006, chapter 8). We note that the order of the computational complexity is identical for all the methods considered, O (nm2 ). This highlights that there is no computational excuse for using gross approximations, such as assuming deterministic relationships, in particular one should probably think twice before using SoR or even DTC. Although augmentation has attractive predictive properties, it is computationally expensive. It remains unclear whether augmentation could be beneficial on a fixed computational budget. We have only considered the simpler case of regression in this paper, but sparseness is also commonly sought in classification settings. It should not be difficult to cast probabilistic approximation methods such as Expectation Propagation (EP) or the Laplace method (for a comparison, see Kuss and Rasmussen, 2005) into our unifying framework. Our analysis suggests that a new interesting approximation would come from combining the best possible approximation (PITC) with the most powerful selection method for the inducing inputs. This would correspond to a non-transductive version of the BCM. We would evade the necessity of knowing the test set before doing the bulk of the computation, and we could hope to supersede the superior performance reported by Snelson and Ghahramani (2006) for very sparse approximations. Acknowledgments Thanks to Neil Lawrence for arranging the 2005 Gaussian Process Round Table meeting in Sheffield, which provided much inspiration to this paper. Special thanks to Olivier Chapelle, Lehel Csat´ , o Zoubin Ghahramani, Matthias Seeger, Ed Snelson and Chris Williams for helpful discussions, and to three anonymous reviewers. Both authors were supported by the German Research Council (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. 1956 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Appendix A. Gaussian and Matrix Identities In this appendix we provide identities used to manipulate matrices and Gaussian distributions throughout the paper. Let x and y be jointly Gaussian x y µx µy ∼ N , A C C B , (34) then the marginal and the conditional are given by x ∼ N (µx , A) , and x|y ∼ N µx +C B−1 (y − µy ), A −C B−1C (35) Also, the product of a Gaussian in x with a Gaussian in a linear projection P x is again a Gaussian, although unnormalized N (x|a, A) N (P x|b, B) = zc N (x|c,C) , (36) where C = A−1 + P B−1 P −1 c = C A−1 a + P B−1 b . , The normalizing constant zc is gaussian in the means a and b of the two Gaussians: m 1 1 zc = (2 π)− 2 |B + P A P |− 2 exp − 2 (b − P a) B+PAP −1 (b − P a) . (37) The matrix inversion lemma, also known as the Woodbury, Sherman & Morrison formula states that: (Z +UWV )−1 = Z −1 − Z −1U(W −1 +V Z −1U)−1V Z −1 , (38) assuming the relevant inverses all exist. Here Z is n × n, W is m × m and U and V are both of size n × m; consequently if Z −1 is known, and a low rank (ie. m < n) perturbation are made to Z as in left hand side of eq. (38), considerable speedup can be achieved. References Corinna Cortes and Vladimir Vapnik. Support-vector network. Machine Learning, 20(3):273–297, 1995. Lehel Csat´ and Manfred Opper. Sparse online Gaussian processes. Neural Computation, 14(3): o 641–669, 2002. Sathiya Keerthi and Wei Chu. A Matching Pursuit approach to sparse Gaussian process regression. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, Cambridge, Massachussetts, 2006. The MIT Press. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, pages 1679–1704, 2005. Joaquin Qui˜ onero-Candela. Learning with Uncertainty – Gaussian Processes and Relevance Vecn tor Machines. PhD thesis, Technical University of Denmark, Lyngby, Denmark, 2004. Carl Edward Rasmussen. Reduced rank Gaussian process learning. Technical report, Gatsby Computational Neuroscience Unit, UCL, 2002. 1957 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN Carl Edward Rasmussen and Joaquin Qui˜ onero-Candela. Healing the relevance vector machine by n augmentation. In International Conference on Machine Learning, 2005. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT press, 2006. Anton Schwaighofer and Volker Tresp. Transductive and inductive methods for approximate Gaussian process regression. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 953–960, Cambridge, Massachussetts, 2003. The MIT Press. Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Christopher M. Bishop and Brendan J. Frey, editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. Bernhard W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Roy. Stat. Soc. B, 47(1):1–52, 1985. (with discussion). Alexander J. Smola and Peter L. Bartlett. Sparse greedy Gaussian process regression. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625, Cambridge, Massachussetts, 2001. The MIT Press. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems o 18, Cambridge, Massachussetts, 2006. The MIT Press. Michael E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211–244, 2001. Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, and Barbara Klein. The biasvariance tradeoff and the randomized GACV. In Michael S. Kerns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 620–626, Cambridge, Massachussetts, 1999. The MIT Press. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, Massachussetts, 1996. The MIT Press. Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations of the Nystr¨ m method for Gaussiam process prediction. Technical report, University o of Edinburgh, Edinburgh, Scotland, 2002. 1958 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Christopher K. I. Williams and Mathias Seeger. Using the Nystr¨ m method to speed up kernel o machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, Massachussetts, 2001. The MIT Press. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on sparse Bayesian learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16, Cambridge, Massachussetts, 2004. The MIT Press. 1959

2 0.18372932 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

3 0.10219465 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection

4 0.081858262 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

Author: Neil Lawrence

Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation

5 0.081249915 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

Author: Rie Kubota Ando, Tong Zhang

Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.

6 0.045680705 12 jmlr-2005-An MDP-Based Recommender System

7 0.038328528 32 jmlr-2005-Expectation Consistent Approximate Inference

8 0.037434466 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

9 0.036964178 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

10 0.034897409 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

11 0.033050381 54 jmlr-2005-Managing Diversity in Regression Ensembles

12 0.03287638 71 jmlr-2005-Variational Message Passing

13 0.031165896 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

14 0.030089146 39 jmlr-2005-Information Bottleneck for Gaussian Variables

15 0.027675696 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

16 0.026702372 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

17 0.025988841 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

18 0.024959084 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

19 0.02425169 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

20 0.024065914 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, -0.137), (2, -0.088), (3, -0.129), (4, 0.303), (5, 0.019), (6, 0.204), (7, 0.034), (8, 0.01), (9, 0.332), (10, 0.03), (11, 0.08), (12, 0.01), (13, -0.044), (14, -0.136), (15, -0.047), (16, 0.027), (17, -0.031), (18, 0.152), (19, -0.156), (20, -0.053), (21, 0.056), (22, 0.016), (23, 0.061), (24, -0.014), (25, -0.048), (26, 0.013), (27, 0.093), (28, -0.001), (29, 0.051), (30, 0.047), (31, 0.004), (32, 0.015), (33, -0.104), (34, 0.012), (35, -0.093), (36, 0.033), (37, 0.082), (38, -0.151), (39, -0.112), (40, 0.091), (41, 0.21), (42, 0.035), (43, 0.107), (44, -0.095), (45, -0.072), (46, 0.213), (47, -0.017), (48, 0.021), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95686388 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression

Author: Joaquin Quiñonero-Candela, Carl Edward Rasmussen

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints. Keywords: Gaussian process, probabilistic regression, sparse approximation, Bayesian committee machine Regression models based on Gaussian processes (GPs) are simple to implement, flexible, fully probabilistic models, and thus a powerful tool in many areas of application. Their main limitation is that memory requirements and computational demands grow as the square and cube respectively, of the number of training cases n, effectively limiting a direct implementation to problems with at most a few thousand cases. To overcome the computational limitations numerous authors have recently suggested a wealth of sparse approximations. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, and the remaining variables are given some approximate, but computationally cheaper treatment. However, the published algorithms have widely different motivations, emphasis and exposition, so it is difficult to get an overview (see Rasmussen and Williams, 2006, chapter 8) of how they relate to each other, and which can be expected to give rise to the best algorithms. In this paper we provide a unifying view of sparse approximations for GP regression. Our approach is simple, but powerful: for each algorithm we analyze the posterior, and compute the effective prior which it is using. Thus, we reinterpret the algorithms as “exact inference with an approximated prior”, rather than the existing (ubiquitous) interpretation “approximate inference with the exact prior”. This approach has the advantage of directly expressing the approximations in terms of prior assumptions about the function, which makes the consequences of the approximations much easier to understand. While our view of the approximations is not the only one possible, it has the advantage of putting all existing probabilistic sparse approximations under one umbrella, thus enabling direct comparison and revealing the relation between them. In Section 1 we briefly introduce GP models for regression. In Section 2 we present our unifying framework and write out the key equations in preparation for the unifying analysis of sparse c 2005 Joaquin Qui˜ onero-Candela and Carl Edward Rasmussen. n ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN algorithms in Sections 4-7. The relation of transduction and augmentation to our sparse framework is covered in Section 8. All our approximations are written in terms of a new set of inducing variables. The choice of these variables is itself a challenging problem, and is discussed in Section 9. We comment on a few special approximations outside our general scheme in Section 10 and conclusions are drawn at the end. 1. Gaussian Processes for Regression Probabilistic regression is usually formulated as follows: given a training set D = {(xi , yi ), i = 1, . . . , n} of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi , compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test locations x∗ . In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by yi = f (xi ) + εi , where εi ∼ N (0, σ2 ) , noise (1) where σ2 is the variance of the noise. noise Definition 1 A Gaussian process (GP) is a collection of random variables, any finite number of which have consistent1 joint Gaussian distributions. Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to p(f|x1 , x2 , . . . , xn ) = N (0, K) , (2) where f = [ f1 , f2 , . . . , fn ] is a vector of latent function values, fi = f (xi ) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(xi , x j ). Note that the GP treats the latent function values fi as random variables, indexed by the corresponding input. In the following, for simplicity we will always neglect the explicit conditioning on the inputs; the GP model and all expressions are always conditional on the corresponding inputs. The GP model is concerned only with the conditional of the outputs given the inputs; we do not model anything about the inputs themselves. Remark 2 Note, that to adhere to a strict Bayesian formalism, the GP covariance function,3 which defines the prior, should not depend on the data (although it can depend on additional parameters). As we will see in later sections, some approximations are strictly equivalent to GPs, while others are not. That is, the implied prior may still be multivariate Gaussian, but the covariance function may be different for training and test cases. Definition 3 A Gaussian process is called degenerate iff the covariance function has a finite number of non-zero eigenvalues. 1. By consistency is meant simply that the random variables obey the usual rules of marginalization, etc. 2. For notational simplicity we exclusively use zero-mean priors. 3. The covariance function itself shouldn’t depend on the data, though its value at a specific pair of inputs of course will. 1940 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Degenerate GPs (such as e.g. with polynomial covariance function) correspond to finite linear (-in-the-parameters) models, whereas non-degenerate GPs (such as e.g. with squared exponential or RBF covariance function) do not. The prior for a finite m dimensional linear model only considers a universe of at most m linearly independent functions; this may often be too restrictive when n m. Note however, that non-degeneracy on its own doesn’t guarantee the existence of the “right kind” of flexibility for a given particular modelling task. For a more detailed background on GP models, see for example that of Rasmussen and Williams (2006). Inference in the GP model is simple: we put a joint GP prior on training and test latent values, f and f∗ 4 , and combine it with the likelihood5 p(y|f) using Bayes rule, to obtain the joint posterior p(f, f∗ )p(y|f) . p(y) p(f, f∗ |y) = (3) The final step needed to produce the desired posterior predictive distribution is to marginalize out the unwanted training set latent variables: p(f∗ |y) = Z 1 p(y) p(f, f∗ |y)df = Z p(y|f) p(f, f∗ ) df , (4) or in words: the predictive distribution is the marginal of the renormalized joint prior times the likelihood. The joint GP prior and the independent likelihood are both Gaussian p(f, f∗ ) = N 0, Kf,f K∗,f Kf,∗ K∗,∗ , and p(y|f) = N (f, σ2 I) , noise (5) where K is subscript by the variables between which the covariance is computed (and we use the asterisk ∗ as shorthand for f∗ ) and I is the identity matrix. Since both factors in the integral are Gaussian, the integral can be evaluated in closed form to give the Gaussian predictive distribution p(f∗ |y) = N K∗,f (Kf,f + σ2 I)−1 y, K∗,∗ − K∗,f (Kf,f + σ2 I)−1 Kf,∗ , noise noise (6) see the relevant Gaussian identity in appendix A. The problem with the above expression is that it requires inversion of a matrix of size n × n which requires O (n3 ) operations, where n is the number of training cases. Thus, the simple exact implementation can handle problems with at most a few thousand training cases. 2. A New Unifying View We now seek to modify the joint prior p(f∗ , f) from (5) in ways which will reduce the computational requirements from (6). Let us first rewrite that prior by introducing an additional set of m latent variables u = [u1 , . . . , um ] , which we call the inducing variables. These latent variables are values of the Gaussian process (as also f and f∗ ), corresponding to a set of input locations Xu , which we call the inducing inputs. Whereas the additional latent variables u are always marginalized out in the predictive distribution, the choice of inducing inputs does leave an imprint on the final solution. 4. We will mostly consider a vector of test cases f∗ (rather than a single f∗ ). 5. You may have been expecting the likelihood written as p(y|f∗ , f) but since the likelihood is conditionally independent of everything else given f, this makes no difference. 1941 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN The inducing variables will turn out to be generalizations of variables which other authors have referred to variously as “support points”, “active set” or “pseudo-inputs”. Particular sparse algorithms choose the inducing variables in various different ways; some algorithms chose the inducing inputs to be a subset of the training set, others not, as we will discuss in Section 9. For now consider any arbitrary inducing variables. Due to the consistency of Gaussian processes, we know that we can recover p(f∗ , f) by simply integrating (marginalizing) out u from the joint GP prior p(f∗ , f, u) p(f∗ , f) = Z p(f∗ , f, u) du = Z p(f∗ , f|u) p(u) du, where p(u) = N (0, Ku,u ) . (7) This is an exact expression. Now, we introduce the fundamental approximation which gives rise to almost all sparse approximations. We approximate the joint prior by assuming that f∗ and f are conditionally independent given u, see Figure 1, such that p(f∗ , f) q(f∗ , f) = Z q(f∗ |u) q(f|u) p(u) du . (8) The name inducing variable is motivated by the fact that f and f∗ can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f∗ |u) of the integral in (8). It will be useful for future reference to specify here the exact expressions for the two conditionals training conditional: test conditional: −1 p(f|u) = N (Kf,u Ku,u u, Kf,f − Qf,f ) , −1 p(f∗ |u) = N (K∗,u Ku,u u, K∗,∗ − Q∗,∗ ) , (9a) (9b) −1 where we have introduced the shorthand notation6 Qa,b Ka,u Ku,u Ku,b . We can readily identify the expressions in (9) as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. Note that the (positive semi-definite) covariance matrices in (9) have the form K − Q with the following interpretation: the prior covariance K minus a (non-negative definite) matrix Q quantifying how much information u provides about the variables in question (f or f∗ ). We emphasize that all the sparse methods discussed in the paper correspond simply to different approximations to the conditionals in (9), and throughout we use the exact likelihood and inducing prior p(y|f) = N (f, σ2 I) , and p(u) = N (0, Ku,u ) . (10) noise 3. The Subset of Data (SoD) Approximation Before we get started with the more sophisticated approximations, we mention as a baseline method the simplest possible sparse approximation (which doesn’t fall inside our general scheme): use only a subset of the data (SoD). The computational complexity is reduced to O (m3 ), where m < n. We would not generally expect SoD to be a competitive method, since it would seem impossible (even with fairly redundant data and a good choice of the subset) to get a realistic picture of the 6. Note, that Qa,b depends on u although this is not explicit in the notation. 1942 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗ Figure 1: Graphical model of the relation between the inducing variables u, the training latent functions values f = [ f1 , . . . , fn ] and the test function value f∗ . The thick horizontal line represents a set of fully connected nodes. The observations y1 , . . . , yn , y∗ (not shown) would dangle individually from the corresponding latent values, by way of the exact (factored) likelihood (5). Left graph: the fully connected graph corresponds to the case where no approximation is made to the full joint Gaussian process distribution between these variables. The inducing variables u are superfluous in this case, since all latent function values can communicate with all others. Right graph: assumption of conditional independence between training and test function values given u. This gives rise to the separation between training and test conditionals from (8). Notice that having cut the communication path between training and test latent function values, information from f can only be transmitted to f∗ via the inducing variables u. uncertainties, when only a part of the training data is even considered. We include it here mostly as a baseline against which to compare better sparse approximations. In Figure 5 top, left we see how the SoD method produces wide predictive distributions, when training on a randomly selected subset of 10 cases. A fair comparison to other methods would take into account that the computational complexity is independent of n as opposed to other more advanced methods. These extra computational resources could be spent in a number of ways, e.g. larger m, or an active (rather than random) selection of the m points. In this paper we will concentrate on understanding the theoretical foundations of the various approximations rather than investigating the necessary heuristics needed to turn the approximation schemes into actually practical algorithms. 4. The Subset of Regressors (SoR) Approximation The Subset of Regressors (SoR) algorithm was given by Silverman (1985), and mentioned again by Wahba et al. (1999). It was then adapted by Smola and Bartlett (2001) to propose a sparse greedy approximation to Gaussian process regression. SoR models are finite linear-in-the-parameters models with a particular prior on the weights. For any input x∗ , the corresponding function value f∗ is given by: f∗ = K∗,u wu , with −1 p(wu ) = N (0, Ku,u ) , (11) where there is one weight associated to each inducing input in Xu . Note that the covariance matrix for the prior on the weights is the inverse of that on u, such that we recover the exact GP prior on u, 1943 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN which is Gaussian with zero mean and covariance u = Ku,u wu ⇒ uu = Ku,u wu wu Ku,u = Ku,u . (12) −1 Using the effective prior on u and the fact that wu = Ku,u u we can redefine the SoR model in an equivalent, more intuitive way: −1 f∗ = K∗,u Ku,u u , with u ∼ N (0, Ku,u ) . (13) We are now ready to integrate the SoR model in our unifying framework. Given that there is a deterministic relation between any f∗ and u, the approximate conditional distributions in the integral in eq. (8) are given by: −1 qSoR (f|u) = N (Kf,u Ku,u u, 0) , and −1 qSoR (f∗ |u) = N (K∗,u Ku,u u, 0) , (14) with zero conditional covariance, compare to (9). The effective prior implied by the SoR approximation is easily obtained from (8), giving qSoR (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f Q∗,∗ , (15) −1 where we recall Qa,b Ka,u Ku,u Ku,b . A more descriptive name for this method, would be the Deterministic Inducing Conditional (DIC) approximation. We see that this approximate prior is degenerate. There are only m degrees of freedom in the model, which implies that only m linearly independent functions can be drawn from the prior. The m + 1-th one is a linear combination of the previous. For example, in a very low noise regime, the posterior could be severely constrained by only m training cases. The degeneracy of the prior causes unreasonable predictive distributions. Indeed, the approximate prior over functions is so restrictive, that given enough data only a very limited family of functions will be plausible under the posterior, leading to overconfident predictive variances. This is a general problem of finite linear models with small numbers of weights (for more details see Rasmussen and Qui˜ onero-Candela, 2005). Figure 5, top, right panel, illustrates the unreasonable n predictive uncertainties of the SoR approximation on a toy dataset.7 The predictive distribution is obtained by using the SoR approximate prior (15) instead of the true prior in (4). For each algorithm we give two forms of the predictive distribution, one which is easy to interpret, and the other which is economical to compute with: qSoR (f∗ |y) = N Q∗,f (Qf,f + σ2 I)−1 y, Q∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ , noise noise = N σ K∗,u Σ Ku,f y, K∗,u ΣKu,∗ , −2 (16a) (16b) where we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 . Equation (16a) is readily recognized as the regular prediction equation (6), except that the covariance K has everywhere been replaced by Q, which was already suggested by (15). This corresponds to replacing the covariance function k with −1 kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The new covariance function has rank (at most) m. Thus we have the following 7. Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances of the SoD, or a more accurate computationally costly alternative (more details are given by Qui˜ onero-Candela, 2004, Chapter 3). n 1944 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Remark 4 The SoR approximation is equivalent to exact inference in the degenerate Gaussian −1 process with covariance function kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The equivalent (16b) is computationally cheaper, and with (11) in mind, Σ is the covariance of the posterior on the weights wu . Note that as opposed to the subset of data method, all training cases are taken into account. The computational complexity is O (nm2 ) initially, and O (m) and O (m2 ) per test case for the predictive mean and variance respectively. 5. The Deterministic Training Conditional (DTC) Approximation Taking up ideas already contained in the work of Csat´ and Opper (2002), Seeger et al. (2003) o recently proposed another sparse approximation to Gaussian process regression, which does not suffer from the nonsensical predictive uncertainties of the SoR approximation, but that interestingly leads to exactly the same predictive mean. Seeger et al. (2003), who called the method Projected Latent Variables (PLV), presented the method as relying on a likelihood approximation, based on −1 the projection f = Kf,u Ku,u u: p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, σ2 I) . noise (17) The method has also been called the Projected Process Approximation (PPA) by Rasmussen and Williams (2006, Chapter 8). One way of obtaining an equivalent model is to retain the usual likelihood, but to impose a deterministic training conditional and the exact test conditional from eq. (9b) −1 qDTC (f|u) = N (Kf,u Ku,u u, 0), and qDTC (f∗ |u) = p(f∗ |u) . (18) This reformulation has the advantage of allowing us to stick to our view of exact inference (with exact likelihood) with approximate priors. Indeed, under this model the conditional distribution of f given u is identical to that of the SoR, given in the left of (14). A systematic name for this approximation is the Deterministic Training Conditional (DTC). The fundamental difference with SoR is that DTC uses the exact test conditional (9b) instead of the deterministic relation between f∗ and u of SoR. The joint prior implied by DTC is given by: qDTC (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f K∗,∗ , (19) which is surprisingly similar to the effective prior implied by the SoR approximation (15). The fundamental difference is that under the DTC approximation f∗ has a prior variance of its own, given by K∗,∗ . This prior variance reverses the behaviour of the predictive uncertainties, and turns them into sensible ones, see Figure 5 for an illustration. The predictive distribution is now given by: qDTC (f∗ |y) = N (Q∗,f (Qf,f + σ2 I)−1 y, K∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ noise noise = N σ K∗,u Σ Ku,f y, K∗,∗ − Q∗,∗ + K∗,u ΣK∗,u , −2 (20a) (20b) where again we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 as in (16). The predictive mean for the DTC is identical to that of the SoR approximation (16), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (which is larger, since K∗,∗ − Q∗,∗ is positive definite). This added term is the predictive variance of the posterior of f∗ conditioned on u. It grows to the prior variance K∗,∗ as x∗ moves far from the inducing inputs in Xu . 1945 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r f1 f2 fn f∗ Figure 2: Graphical model for the FITC approximation. Compared to those in Figure 1, all edges between latent function values have been removed: the latent function values are conditionally fully independent given the inducing variables u. Although strictly speaking the SoR and DTC approximations could also be represented by this graph, note that both further assume a deterministic relation between f and u. Remark 5 The only difference between the predictive distribution of DTC and SoR is the variance. The predictive variance of DTC is never smaller than that of SoR. Note, that since the covariances for training cases and test cases are computed differently, see (19), it follows that Remark 6 The DTC approximation does not correspond exactly to a Gaussian process, as the covariance between latent values depends on whether they are considered training or test cases, violating consistency, see Definition 1. The computational complexity has the same order as for SoR. 6. The Fully Independent Training Conditional (FITC) Approximation Recently Snelson and Ghahramani (2006) proposed another likelihood approximation to speed up Gaussian process regression, which they called Sparse Gaussian Processes using Pseudo-inputs (SGPP). While the DTC is based on the likelihood approximation given by (17), the SGPP proposes a more sophisticated likelihood approximation with a richer covariance p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, diag[Kf,f − Qf,f ] + σ2 I) , noise (21) where diag[A] is a diagonal matrix whose elements match the diagonal of A. As we did in (18) for the DTC, we provide an alternative equivalent formulation called Fully Independent Training Conditional (FITC) based on the inducing conditionals: n qFITC (f|u) = ∏ p( fi |u) = N i=1 −1 Kf,u Ku,u u, diag[Kf,f −Qf,f ] , and qFITC ( f∗ |u) = p( f∗ |u) . (22) We see that as opposed to SoR and DTC, FITC does not impose a deterministic relation between f and u. Instead of ignoring the variance, FITC proposes an approximation to the training conditional distribution of f given u as a further independence assumption. In addition, the exact test conditional from (9b) is used in (22), although for reasons which will become clear towards the end of this 1946 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION section, we initially consider only a single test case, f∗ . The corresponding graphical model is given in Figure 2. The effective prior implied by the FITC is given by qFITC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ . (23) Note, that the sole difference between the DTC and FITC is that in the top left corner of the implied prior covariance, FITC replaces the approximate covariances of DTC by the exact ones on the diagonal. The predictive distribution is qFITC ( f∗ |y) = N Q∗,f (Qf,f + Λ)−1 y, K∗,∗ − Q∗,f (Qf,f + Λ)−1 Qf,∗ (24a) = N K∗,u ΣKu,f Λ−1 y, K∗,∗ − Q∗,∗ + K∗,u ΣKu,∗ , (24b) where we have defined Σ = (Ku,u + Ku,f Λ−1 Kf,u )−1 and Λ = diag[Kf,f − Qf,f + σ2 I ]. The compunoise tational complexity is identical to that of SoR and DTC. So far we have only considered a single test case. There are two options for joint predictions, either 1) use the exact full test conditional from (9b), or 2) extend the additional factorizing assumption to the test conditional. Although Snelson and Ghahramani (2006) don’t explicitly discuss joint predictions, it would seem that they probably intend the second option. Whereas the additional independence assumption for the test cases is not really necessary for computational reasons, it does affect the nature of the approximation. Under option 1) the training and test covariance are computed differently, and thus this does not correspond to our strict definition of a GP model, but Remark 7 Iff the assumption of full independence is extended to the test conditional, the FITC approximation is equivalent to exact inference in a non-degenerate Gaussian process with covariance function kFIC (xi , x j ) = kSoR (xi , x j ) + δi, j [k(xi , x j ) − kSoR (xi , x j )], where δi, j is Kronecker’s delta. A logical name for the method where the conditionals (training and test) are always forced to be fully independent would be the Fully Independent Conditional (FIC) approximation. The effective prior implied by FIC is: qFIC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f Q∗,∗ − diag[Q∗,∗ − K∗,∗ ] . (25) 7. The Partially Independent Training Conditional (PITC) Approximation In the previous section we saw how to improve the DTC approximation by approximating the training conditional with an independent distribution, i.e. one with a diagonal covariance matrix. In this section we will further improve the approximation (while remaining computationally attractive) by extending the training conditional to have a block diagonal covariance: −1 qPITC (f|u) = N Kf,u Ku,u u, blockdiag[Kf,f − Qf,f ] , and qPITC (f∗ |u) = p(f∗ |u) . (26) where blockdiag[A] is a block diagonal matrix (where the blocking structure is not explicitly stated). We represent graphically the PITC approximation in Figure 3. Developing this analogously to the FITC approximation from the previous section, we get the joint prior qPITC (f, f∗ ) = N 0, Qf,f − blockdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ 1947 , (27) ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r fI fI fI f∗ 1 2 k Figure 3: Graphical representation of the PITC approximation. The set of latent function values fIi indexed by the the set of indices Ii is fully connected. The PITC differs from FITC (see graph in Fig. 2) in that conditional independence is now between the k groups of training latent function values. This corresponds to the block diagonal approximation to the true training conditional given in (26). and the predictive distribution is identical to (24), except for the alternative definition of Λ = blockdiag[Kf,f − Qf,f + σ2 I ]. An identical expression was obtained by Schwaighofer and Tresp noise (2003, Sect. 3), developing from the original Bayesian committee machine (BCM) by Tresp (2000). The relationship to the FITC was pointed out by Lehel Csat´ . The BCM was originally proposed as o a transductive learner (i.e. where the test inputs have to be known before training), and the inducing inputs Xu were chosen to be the test inputs. We discuss transduction in detail in the next section. It is important to realize that the BCM proposes two orthogonal ideas: first, the block diagonal structure of the partially independent training conditional, and second setting the inducing inputs to be the test inputs. These two ideas can be used independently and in Section 8 we propose using the first without the second. The computational complexity of the PITC approximation depends on the blocking structure imposed in (26). A reasonable choice, also recommended by Tresp (2000) may be to choose k = n/m blocks, each of size m × m. The computational complexity remains O (nm2 ). Since in the PITC model the covariance is computed differently for training and test cases Remark 8 The PITC approximation does not correspond exactly to a Gaussian process. This is because computing covariances requires knowing whether points are from the training- or test-set, (27). One can obtain a Gaussian process from the PITC by extending the partial conditional independence assumption to the test conditional, as we did in Remark 7 for the FITC. 8. Transduction and Augmentation The idea of transduction is that one should restrict the goal of learning to prediction on a prespecified set of test cases, rather than trying to learn an entire function (induction) and then evaluate it at the test inputs. There may be no universally agreed upon definition of transduction. In this paper we use Definition 9 Transduction occurs only if the predictive distribution depends on other test inputs. This operational definition excludes models for which there exist an equivalent inductive counterpart. According to this definition, it is irrelevant when the bulk of the computation takes place. 1948 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u f∗   e  ¡     ¡ e   e ¡     ¡     e e   ¡       e ¡   r r r fI fI fI 1 2 k Figure 4: Two views on Augmentation. One view is to see that the test latent function value f∗ is now part of the inducing variables u and therefore has access to the training latent function values. An equivalent view is to consider that we have dropped the assumption of conditional independence between f∗ and the training latent function values. Even if f∗ has now direct access to each of the training fi , these still need to go through u to talk to each other if they fall in conditionally independent blocks. We have in this figure decided to recycle the graph for PITC from Figure 3 to show that all approximations we have presented can be augmented, irrespective of what the approximation for the training conditional is. There are several different possible motivations for transduction: 1) transduction is somehow easier than induction (Vapnik, 1995), 2) the test inputs may reveal important information, which should be used during training. This motivation drives models in semi-supervised learning (studied mostly in the context of classification) and 3) for approximate algorithms one may be able to limit the discrepancies of the approximation at the test points. For exact GP models it seems that the first reason doesn’t really apply. If you make predictions at the test points that are consistent with a GP, then it is trivial inside the GP framework to extend these to any other input points, and in effect we have done induction. The second reason seems more interesting. However, in a standard GP setting, it is a consequence of the consistency property, see Remark 2, that predictions at one test input are independent of the location of any other test inputs. Therefore transduction can not be married with exact GPs: Remark 10 Transduction can not occur in exact Gaussian process models. Whereas this holds for the usual setting of GPs, it could be different in non-standard situations where e.g. the covariance function depends on the empirical input densities. Transduction can occur in the sparse approximation to GPs, by making the choice of inducing variables depend on the test inputs. The BCM from the previous section, where Xu = X∗ (where X∗ are the test inputs) is an example of this. Since the inducing variables are connected to all other nodes (see Figure 3) we would expect the approximation to be good at u = f∗ , which is what we care about for predictions, relating to reason 3) above. While this reasoning is sound, it is not necessarily a sufficient consideration for getting a good model. The model has to be able to simultaneously explain the training targets as well and if the choice of u makes this difficult, the posterior at the points of interest may be distorted. Thus, the choice of u should be governed by the ability to model the conditional of the latents given the inputs, and not solely by the density of the (test) inputs. The main drawback of transduction is that by its nature it doesn’t provide a predictive model in the way inductive models do. In the usual GP model one can do the bulk of the computation 1949 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN involved in the predictive distributions (e.g. matrix inversion) before seeing the test cases, enabling fast computation of test predictions. It is interesting that whereas other methods spend much effort trying to optimize the inducing variables, the BCM simply uses the test set. The quality of the BCM approximation depends then on the particular location of the test inputs, upon which one usually does not have any control. We now see that there may be a better method, eliminating the drawback of transduction, namely use the PITC approximation, but choose the u’s carefully (see Section 9), don’t just use the test set. 8.1 Augmentation An idea closely related to transduction, but not covered by our definition, is augmentation, which in contrast to transduction is done individually for each test case. Since in the previous sections, we haven’t assumed anything about u, we can simply augment the set of inducing variables by f∗ (i.e. have one additional inducing variable equal to the current test latent), and see what happens in the predictive distributions for the different methods. Let’s first investigate the consequences for the test conditional from (9b). Note, the interpretation of the covariance matrix K∗,∗ − Q∗,∗ was “the prior covariance minus the information which u provides about f∗ ”. It is clear that the augmented u (with f∗ ) provides all possible information about f∗ , and consequently Q∗,∗ = K∗,∗ . An equivalent view on augmentation is that the assumption of conditional independence between f∗ and f is dropped. This is seen trivially by adding edges between f∗ and the fi in the graphical model, Figure 4. Augmentation was originally proposed by Rasmussen (2002), and applied in detail to the SoR with RBF covariance by Qui˜ onero-Candela (2004). Because the SoR is a finite linear model, and n the basis functions are local (Gaussian bumps), the predictive distributions can be very misleading. For example, when making predictions far away from the center of any basis function, all basis functions have insignificant magnitudes, and the prediction (averaged over the posterior) will be close to zero, with very small error-bars; this is the opposite of the desired behaviour, where we would expect the error-bars to grow as we move away from the training cases. Here augmentation makes a particularly big difference turning the nonsensical predictive distribution into a reasonable one, by ensuring that there is always a basis function centered on the test case. Compare the nonaugmented to the augmented SoR in Figure 5. An analogous Gaussian process based finite linear model that has recently been healed by augmentation is the relevance vector machine (Rasmussen and Qui˜ onero-Candela, 2005). n Although augmentation was initially proposed for a narrow set of circumstances, it is easily applied to any of the approximations discussed. Of course, augmentation doesn’t make any sense for an exact, non-degenerate Gaussian process model (a GP with a covariance function that has a feature-space which is infinite dimensional, i.e. with basis functions everywhere). Remark 11 A full non-degenerate Gaussian process cannot be augmented, since the corresponding f∗ would already be connected to all other variables in the graphical model. But augmentation does make sense for sparse approximations to GPs. The more general process view on augmentation has several advantages over the basis function view. It is not completely clear from the basis function view, which basis function should be used for augmentation. For example, Rasmussen and Qui˜ onero-Candela (2005) successfully apply augn mentation using basis functions that have a zero contribution at the test location! In the process view 1950 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION however, it seems clear that one would chose the additional inducing variable to be f∗ , to minimize the effects of the approximations. Let us compute the effective prior for the augmented SoR. Given that f∗ is in the inducing set, the test conditional is not an approximation and we can rewrite the integral leading to the effective prior: Z qASoR (f∗ , f) = qSoR (f| f∗ , u) p( f∗ , u) du . (28) It is interesting to notice that this is also the effective prior that would result from augmenting the DTC approximation, since qSoR (f| f∗ , u) = qDTC (f| f∗ , u). Remark 12 Augmented SoR (ASoR) is equivalent to augmented DTC (ADTC). Augmented DTC only differs from DTC in the additional presence of f∗ among the inducing variables in the training conditional. We can only expect augmented DTC to be a more accurate approximation than DTC, since adding an additional inducing variable can only help capture information from y. Therefore Remark 13 DTC is a less accurate (but cheaper) approximation than augmented SoR. We saw previously in Section 5 that the DTC approximation does not suffer from the nonsensical predictive variances of the SoR. The equivalence between the augmented SoR and augmented DTC is another way of seeing how augmentation reverses the misbehaviour of SoR. The predictive distribution of the augmented SoR is obtained by adding f∗ to u in (20). Prediction with an augmented sparse model comes at a higher computational cost, since now f∗ directly interacts with all of f and not just with u. For each new test case, updating the augmented Σ in the predictive equation (for example (20b) for DTC) implies computing the vector matrix product K∗,f Kf,u with complexity O (nm). This is clearly higher than the O (m) for the mean, and O (m2 ) for the predictive distribution of all the non-augmented methods we have discussed. Augmentation seems to be only really necessary for methods that make a severe approximation to the test conditional, like the SoR. For methods that make little or no approximation to the test conditional, it is difficult to predict the degree to which augmentation would help. However, one can see by giving f∗ access to all of the training latent function values in f, one would expect augmentation to give less under-confident predictive distributions near the training data. Figure 5 clearly shows that augmented DTC (equivalent to augmented SoR) has a superior predictive distribution (both mean and variance) than standard DTC. Note however that in the figure we have purposely chosen a too short lengthscale to enhance visualization. Quantitatively, this superiority was experimentally assessed by Qui˜ onero-Candela (2004, Table 3.1). Augmentation hasn’t been n compared to the more advanced approximations FITC and PITC, and the figure would change in the more realistic scenario where the inducing inputs and hyperparameters are learnt (Snelson and Ghahramani, 2006). Transductive methods like the BCM can be seen as joint augmentation, and one could potentially use it for any of the methods presented. It seems that the good performance of the BCM could essentially stem from augmentation, the presence of the other test inputs in the inducing set being probably of little benefit. Joint augmentation might bring some computational advantage, but won’t change the scaling: note that augmenting m times at a cost of O (nm) apiece implies the same O (nm2 ) total cost as the jointly augmented BCM. 1951 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN SoD SoR 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 DTC 0 5 10 15 5 10 15 5 10 15 ASoR/ADTC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 FITC 0 PITC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 0 Figure 5: Toy example with identical covariance function and hyperparameters. The squared exponential covariance function is used, and a slightly too short lengthscale is chosen on purpose to emphasize the different behaviour of the predictive uncertainties. The dots are the training points, the crosses are the targets corresponding to the inducing inputs, randomly selected from the training set. The solid line is the mean of the predictive distribution, and the dotted lines show the 95% confidence interval of the predictions. Augmented DTC (ADTC) is equivalent to augmented SoR (ASoR), see Remark 12. 1952 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION 9. On the Choice of the Inducing Variables We have until now assumed that the inducing inputs Xu were given. Traditionally, sparse models have very often been built upon a carefully chosen subset of the training inputs. This concept is probably best exemplified in the popular support vector machine (Cortes and Vapnik, 1995). In sparse Gaussian processes it has also been suggested to select the inducing inputs Xu from among the training inputs. Since this involves a prohibitive combinatorial optimization, greedy optimization approaches have been suggested using various selection criteria like online learning (Csat´ and o Opper, 2002), greedy posterior maximization (Smola and Bartlett, 2001), maximum information gain (Seeger et al., 2003), matching pursuit (Keerthi and Chu, 2006), and probably more. As discussed in the previous section, selecting the inducing inputs from among the test inputs has also been considered in transductive settings. Recently, Snelson and Ghahramani (2006) have proposed to relax the constraint that the inducing variables must be a subset of training/test cases, turning the discrete selection problem into one of continuous optimization. One may hope that finding a good solution is easier in the continuous than the discrete case, although finding the global optimum is intractable in both cases. And perhaps the less restrictive choice can lead to better performance in very sparse models. Which optimality criterion should be used to set the inducing inputs? Departing from a fully Bayesian treatment which would involve defining priors on Xu , one could maximize the marginal likelihood (also called the evidence) with respect to Xu , an approach also followed by Snelson and Ghahramani (2006). Each of the approximate methods proposed involves a different effective prior, and hence its own particular effective marginal likelihood conditioned on the inducing inputs q(y|Xu ) = ZZ p(y|f) q(f|u) p(u|Xu )du df = Z p(y|f) q(f|Xu )df , (29) which of course is independent of the test conditional. We have in the above equation explicitly conditioned on the inducing inputs Xu . Using Gaussian identities, the effective marginal likelihood is very easily obtained by adding a ridge σ2 I (from the likelihood) to the covariance of effective noise prior on f. Using the appropriate definitions of Λ, the log marginal likelihood becomes 1 log q(y|Xu ) = − 2 log |Qf,f + Λ| − 1 y (Qf,f + Λ)−1 y − n log(2π) , 2 2 (30) where ΛSoR = ΛDTC = σ2 I, ΛFITC = diag[Kf,f − Qf,f ] + σ2 I, and ΛPITC = blockdiag[Kf,f − noise noise Qf,f ] + σ2 I. The computational cost of the marginal likelihood is O (nm2 ) for all methods, that of noise its gradient with respect to one element of Xu is O (nm). This of course implies that the complexity of computing the gradient wrt. to the whole of Xu is O (dnm2 ), where d is the dimension of the input space. It has been proposed to maximize the effective posterior instead of the effective marginal likelihood (Smola and Bartlett, 2001). However this is potentially dangerous and can lead to overfitting. Maximizing the whole evidence instead is sound and comes at an identical computational cost (for a deeper analysis see Qui˜ onero-Candela, 2004, Sect. 3.3.5 and Fig. 3.2). n The marginal likelihood has traditionally been used to learn the hyperparameters of GPs in the non fully Bayesian treatment (see for example Williams and Rasmussen, 1996). For the sparse approximations presented here, once you are learning Xu it is straightforward to allow for learning hyperparameters (of the covariance function) during the same optimization, and there is no need to interleave optimization of u with learning of the hyperparameters as it has been proposed for example by Seeger et al. (2003). 1953 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN 10. Other Methods In this section we briefly mention two approximations which don’t fit in our unifying scheme, since one doesn’t correspond to a proper probabilistic model, and the other one uses a particular construction for the covariance function, rather than allowing any general covariance function. 10.1 The Nystr¨ m Approximation o The Nystr¨ m Approximation for speeding up GP regression was originally proposed by Williams o and Seeger (2001), and then questioned by Williams et al. (2002). Like SoR and DTC, the Nystr¨ m o Approximation for GP regression approximates the prior covariance of f by Qf,f . However, unlike these methods, the Nystr¨ m Approximation is not based on a generative probabilistic model. The o prior covariance between f∗ and f is taken to be exact, which is inconsistent with the prior covariance on f: Qf,f Kf,∗ . (31) q(f, f∗ ) = N 0, K∗,f K∗,∗ As a result we cannot derive this method from our unifying framework, nor represent it with a graphical model. Worse, the resulting prior covariance matrix is not even guaranteed to be positive definite, allowing the predictive variances to be negative. Notice that replacing Kf,∗ by Qf,∗ in (31) is enough to make the prior covariance positive definite, and one obtains the DTC approximation. Remark 14 The Nystr¨ m Approximation does not correspond to a well-formed probabilistic model. o Ignoring any quibbles about positive definiteness, the predictive distribution of the Nystr¨ m Apo proximation is given by: p( f∗ |y) = N Kf,∗ [Qf,f + σ2 I]−1 y, K∗,∗ − Kf,∗ [Qf,f + σ2 I]−1 Kf,∗ , noise noise (32) but the predictive variance is not guaranteed to be positive. The computational cost is O (nm2 ). 10.2 The Relevance Vector Machine The relevance vector machine, introduced by Tipping (2001), is a finite linear model with an independent Gaussian prior imposed on the weights. For any input x∗ , the corresponding function output is given by: f∗ = φ∗ w , with p(w|A) = N (0, A) , (33) where φ∗ = [φ1 (x), . . . , φm (x)] is the (row) vector of responses of the m basis functions, and A = diag(α1 , . . . , αm ) is the diagonal matrix of joint prior precisions (inverse variances) of the weights. The αi are learnt by maximizing the RVM evidence (obtained by also assuming Gaussian additive iid. noise, see (1)), and for the typical case of rich enough sets of basis functions many of the precisions go to infinity effectively pruning out the corresponding weights (for a very interesting analysis see Wipf et al., 2004). The RVM is thus a sparse method and the surviving basis functions are called relevance vectors. Note that since the RVM is a finite linear model with Gaussian priors on the weights, it can be seen as a Gaussian process: Remark 15 The RVM is equivalent to a degenerate Gaussian process with covariance function kRVM (xi , x j ) = φi A−1 φ j = ∑m α−1 φk (xi ) φk (x j ), k=1 k 1954 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION q(f∗ |u) q(f|u) GP exact exact SoR determ. determ. DTC exact determ. FITC (exact) fully indep. PITC exact partially indep. Method joint prior covariance Kf,f Kf,∗ K∗,f K∗,∗ Qf,f Qf,∗ Q∗,f Q∗,∗ Qf,f Qf,∗ Q∗,f K∗,∗ Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ Qf,f − blokdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ GP? √ √ √ ( ) Table 1: Summary of the way approximations are built. All these methods are detailed in the previous sections. The initial cost and that of the mean and variance per test case are respectively n2 , n and n2 for the exact GP, and nm2 , m and m2 for all other methods. The “GP?” column indicates whether the approximation is equivalent to a GP. For FITC see Remark 7. as was also pointed out by Tipping (2001, eq. (59)). Whereas all sparse approximations we have presented until now are totally independent of the choice of covariance function, for the RVM this choice is restricted to covariance functions that can be expressed as finite expansions in terms of some basis functions. Being degenerate GPs in exactly the same way as the SoR (presented in Section 4), the RVM does also suffer from unreasonable predictive variances. Rasmussen and Qui˜ onero-Candela (2005) show that the predictive distributions of RVMs can also be healed by n augmentation, see Section 8. Once the αi have been learnt, denoting by m the number of surviving relevance vectors, the complexity of computing the predictive distribution of the RVM is O (m) for mean and O (m2 ) for the variance. RVMs are often used with radial basis functions centered on the training inputs. One potentially interesting extension to the RVM would be to learn the locations of the centers of the basis functions, in the same way as proposed by Snelson and Ghahramani (2006) for the FITC approximation, see Section 6. This is a curious reminiscence of learning the centers in RBF Networks. 11. Conclusions We have provided a unifying framework for sparse approximations to Gaussian processes for regression. Our approach consists of two steps, first 1) we recast the approximation in terms of approximations to the prior, and second 2) we introduce inducing variables u and the idea of conditional independence given u. We recover all existing sparse methods by making further simplifications of the covariances of the training and test conditionals, see Table 1 for a summary. Previous methods were presented based on different approximation paradigms (e.g. likelihood approximations, projection methods, matrix approximations, minimization of Kullback-Leibler divergence, etc), making direct comparison difficult. Under our unifying view we deconstruct methods, making it clear which building blocks they are based upon. For example, the SGPP by Snelson 1955 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN and Ghahramani (2006) contains two ideas, 1) a likelihood approximation and 2) the idea of varying the inducing inputs continuously; these two ideas could easily be used independently, and incorporated in other methods. Similarly, the BCM by Tresp (2000) contains two independent ideas 1) a block diagonal assumption, and 2) the (transductive) idea of choosing the test inputs as the inducing variables. Finally we note that although all three ideas of 1) transductively setting u = f∗ , 2) augmentation and 3) continuous optimization of Xu have been proposed in very specific settings, in fact they are completely general ideas, which can be applied to any of the approximation schemes considered. We have ranked the approximation according to how close they are to the corresponding full GP. However, the performance in practical situations may not always follow this theoretical ranking since the approximations might exhibit properties (not present in the full GP) which may be particularly suitable for specific datasets. This may make the interpretation of empirical comparisons challenging. A further complication arises when adding the necessary heuristics for turning the theoretical constructs into practical algorithms. We have not described full algorithms in this paper, but are currently working on a detailed empirical study (in preparation, see also Rasmussen and Williams, 2006, chapter 8). We note that the order of the computational complexity is identical for all the methods considered, O (nm2 ). This highlights that there is no computational excuse for using gross approximations, such as assuming deterministic relationships, in particular one should probably think twice before using SoR or even DTC. Although augmentation has attractive predictive properties, it is computationally expensive. It remains unclear whether augmentation could be beneficial on a fixed computational budget. We have only considered the simpler case of regression in this paper, but sparseness is also commonly sought in classification settings. It should not be difficult to cast probabilistic approximation methods such as Expectation Propagation (EP) or the Laplace method (for a comparison, see Kuss and Rasmussen, 2005) into our unifying framework. Our analysis suggests that a new interesting approximation would come from combining the best possible approximation (PITC) with the most powerful selection method for the inducing inputs. This would correspond to a non-transductive version of the BCM. We would evade the necessity of knowing the test set before doing the bulk of the computation, and we could hope to supersede the superior performance reported by Snelson and Ghahramani (2006) for very sparse approximations. Acknowledgments Thanks to Neil Lawrence for arranging the 2005 Gaussian Process Round Table meeting in Sheffield, which provided much inspiration to this paper. Special thanks to Olivier Chapelle, Lehel Csat´ , o Zoubin Ghahramani, Matthias Seeger, Ed Snelson and Chris Williams for helpful discussions, and to three anonymous reviewers. Both authors were supported by the German Research Council (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. 1956 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Appendix A. Gaussian and Matrix Identities In this appendix we provide identities used to manipulate matrices and Gaussian distributions throughout the paper. Let x and y be jointly Gaussian x y µx µy ∼ N , A C C B , (34) then the marginal and the conditional are given by x ∼ N (µx , A) , and x|y ∼ N µx +C B−1 (y − µy ), A −C B−1C (35) Also, the product of a Gaussian in x with a Gaussian in a linear projection P x is again a Gaussian, although unnormalized N (x|a, A) N (P x|b, B) = zc N (x|c,C) , (36) where C = A−1 + P B−1 P −1 c = C A−1 a + P B−1 b . , The normalizing constant zc is gaussian in the means a and b of the two Gaussians: m 1 1 zc = (2 π)− 2 |B + P A P |− 2 exp − 2 (b − P a) B+PAP −1 (b − P a) . (37) The matrix inversion lemma, also known as the Woodbury, Sherman & Morrison formula states that: (Z +UWV )−1 = Z −1 − Z −1U(W −1 +V Z −1U)−1V Z −1 , (38) assuming the relevant inverses all exist. Here Z is n × n, W is m × m and U and V are both of size n × m; consequently if Z −1 is known, and a low rank (ie. m < n) perturbation are made to Z as in left hand side of eq. (38), considerable speedup can be achieved. References Corinna Cortes and Vladimir Vapnik. Support-vector network. Machine Learning, 20(3):273–297, 1995. Lehel Csat´ and Manfred Opper. Sparse online Gaussian processes. Neural Computation, 14(3): o 641–669, 2002. Sathiya Keerthi and Wei Chu. A Matching Pursuit approach to sparse Gaussian process regression. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, Cambridge, Massachussetts, 2006. The MIT Press. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, pages 1679–1704, 2005. Joaquin Qui˜ onero-Candela. Learning with Uncertainty – Gaussian Processes and Relevance Vecn tor Machines. PhD thesis, Technical University of Denmark, Lyngby, Denmark, 2004. Carl Edward Rasmussen. Reduced rank Gaussian process learning. Technical report, Gatsby Computational Neuroscience Unit, UCL, 2002. 1957 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN Carl Edward Rasmussen and Joaquin Qui˜ onero-Candela. Healing the relevance vector machine by n augmentation. In International Conference on Machine Learning, 2005. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT press, 2006. Anton Schwaighofer and Volker Tresp. Transductive and inductive methods for approximate Gaussian process regression. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 953–960, Cambridge, Massachussetts, 2003. The MIT Press. Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Christopher M. Bishop and Brendan J. Frey, editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. Bernhard W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Roy. Stat. Soc. B, 47(1):1–52, 1985. (with discussion). Alexander J. Smola and Peter L. Bartlett. Sparse greedy Gaussian process regression. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625, Cambridge, Massachussetts, 2001. The MIT Press. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems o 18, Cambridge, Massachussetts, 2006. The MIT Press. Michael E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211–244, 2001. Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, and Barbara Klein. The biasvariance tradeoff and the randomized GACV. In Michael S. Kerns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 620–626, Cambridge, Massachussetts, 1999. The MIT Press. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, Massachussetts, 1996. The MIT Press. Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations of the Nystr¨ m method for Gaussiam process prediction. Technical report, University o of Edinburgh, Edinburgh, Scotland, 2002. 1958 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Christopher K. I. Williams and Mathias Seeger. Using the Nystr¨ m method to speed up kernel o machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, Massachussetts, 2001. The MIT Press. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on sparse Bayesian learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16, Cambridge, Massachussetts, 2004. The MIT Press. 1959

2 0.59954995 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

3 0.33876225 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

Author: Neil Lawrence

Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation

4 0.33360046 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection

5 0.31357071 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

Author: Rie Kubota Ando, Tong Zhang

Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.

6 0.20270398 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

7 0.19350028 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

8 0.17858304 39 jmlr-2005-Information Bottleneck for Gaussian Variables

9 0.17700216 54 jmlr-2005-Managing Diversity in Regression Ensembles

10 0.16260858 41 jmlr-2005-Kernel Methods for Measuring Independence

11 0.15067531 12 jmlr-2005-An MDP-Based Recommender System

12 0.14730786 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

13 0.14577699 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

14 0.1361652 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

15 0.12713881 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

16 0.12340676 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

17 0.1156973 47 jmlr-2005-Learning from Examples as an Inverse Problem

18 0.11355811 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

19 0.10859406 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

20 0.10569097 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.011), (17, 0.019), (19, 0.013), (36, 0.052), (37, 0.567), (43, 0.02), (47, 0.018), (52, 0.052), (59, 0.018), (70, 0.063), (88, 0.06), (90, 0.012), (94, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93718761 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers

Author: David Wingate, Kevin D. Seppi

Abstract: The performance of value and policy iteration can be dramatically improved by eliminating redundant or useless backups, and by backing up states in the right order. We study several methods designed to accelerate these iterative solvers, including prioritization, partitioning, and variable reordering. We generate a family of algorithms by combining several of the methods discussed, and present extensive empirical evidence demonstrating that performance can improve by several orders of magnitude for many problems, while preserving accuracy and convergence guarantees. Keywords: Markov Decision Processes, value iteration, policy iteration, prioritized sweeping, dynamic programming

same-paper 2 0.92583501 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression

Author: Joaquin Quiñonero-Candela, Carl Edward Rasmussen

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints. Keywords: Gaussian process, probabilistic regression, sparse approximation, Bayesian committee machine Regression models based on Gaussian processes (GPs) are simple to implement, flexible, fully probabilistic models, and thus a powerful tool in many areas of application. Their main limitation is that memory requirements and computational demands grow as the square and cube respectively, of the number of training cases n, effectively limiting a direct implementation to problems with at most a few thousand cases. To overcome the computational limitations numerous authors have recently suggested a wealth of sparse approximations. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, and the remaining variables are given some approximate, but computationally cheaper treatment. However, the published algorithms have widely different motivations, emphasis and exposition, so it is difficult to get an overview (see Rasmussen and Williams, 2006, chapter 8) of how they relate to each other, and which can be expected to give rise to the best algorithms. In this paper we provide a unifying view of sparse approximations for GP regression. Our approach is simple, but powerful: for each algorithm we analyze the posterior, and compute the effective prior which it is using. Thus, we reinterpret the algorithms as “exact inference with an approximated prior”, rather than the existing (ubiquitous) interpretation “approximate inference with the exact prior”. This approach has the advantage of directly expressing the approximations in terms of prior assumptions about the function, which makes the consequences of the approximations much easier to understand. While our view of the approximations is not the only one possible, it has the advantage of putting all existing probabilistic sparse approximations under one umbrella, thus enabling direct comparison and revealing the relation between them. In Section 1 we briefly introduce GP models for regression. In Section 2 we present our unifying framework and write out the key equations in preparation for the unifying analysis of sparse c 2005 Joaquin Qui˜ onero-Candela and Carl Edward Rasmussen. n ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN algorithms in Sections 4-7. The relation of transduction and augmentation to our sparse framework is covered in Section 8. All our approximations are written in terms of a new set of inducing variables. The choice of these variables is itself a challenging problem, and is discussed in Section 9. We comment on a few special approximations outside our general scheme in Section 10 and conclusions are drawn at the end. 1. Gaussian Processes for Regression Probabilistic regression is usually formulated as follows: given a training set D = {(xi , yi ), i = 1, . . . , n} of n pairs of (vectorial) inputs xi and noisy (real, scalar) outputs yi , compute the predictive distribution of the function values f∗ (or noisy y∗ ) at test locations x∗ . In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by yi = f (xi ) + εi , where εi ∼ N (0, σ2 ) , noise (1) where σ2 is the variance of the noise. noise Definition 1 A Gaussian process (GP) is a collection of random variables, any finite number of which have consistent1 joint Gaussian distributions. Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to p(f|x1 , x2 , . . . , xn ) = N (0, K) , (2) where f = [ f1 , f2 , . . . , fn ] is a vector of latent function values, fi = f (xi ) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(xi , x j ). Note that the GP treats the latent function values fi as random variables, indexed by the corresponding input. In the following, for simplicity we will always neglect the explicit conditioning on the inputs; the GP model and all expressions are always conditional on the corresponding inputs. The GP model is concerned only with the conditional of the outputs given the inputs; we do not model anything about the inputs themselves. Remark 2 Note, that to adhere to a strict Bayesian formalism, the GP covariance function,3 which defines the prior, should not depend on the data (although it can depend on additional parameters). As we will see in later sections, some approximations are strictly equivalent to GPs, while others are not. That is, the implied prior may still be multivariate Gaussian, but the covariance function may be different for training and test cases. Definition 3 A Gaussian process is called degenerate iff the covariance function has a finite number of non-zero eigenvalues. 1. By consistency is meant simply that the random variables obey the usual rules of marginalization, etc. 2. For notational simplicity we exclusively use zero-mean priors. 3. The covariance function itself shouldn’t depend on the data, though its value at a specific pair of inputs of course will. 1940 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Degenerate GPs (such as e.g. with polynomial covariance function) correspond to finite linear (-in-the-parameters) models, whereas non-degenerate GPs (such as e.g. with squared exponential or RBF covariance function) do not. The prior for a finite m dimensional linear model only considers a universe of at most m linearly independent functions; this may often be too restrictive when n m. Note however, that non-degeneracy on its own doesn’t guarantee the existence of the “right kind” of flexibility for a given particular modelling task. For a more detailed background on GP models, see for example that of Rasmussen and Williams (2006). Inference in the GP model is simple: we put a joint GP prior on training and test latent values, f and f∗ 4 , and combine it with the likelihood5 p(y|f) using Bayes rule, to obtain the joint posterior p(f, f∗ )p(y|f) . p(y) p(f, f∗ |y) = (3) The final step needed to produce the desired posterior predictive distribution is to marginalize out the unwanted training set latent variables: p(f∗ |y) = Z 1 p(y) p(f, f∗ |y)df = Z p(y|f) p(f, f∗ ) df , (4) or in words: the predictive distribution is the marginal of the renormalized joint prior times the likelihood. The joint GP prior and the independent likelihood are both Gaussian p(f, f∗ ) = N 0, Kf,f K∗,f Kf,∗ K∗,∗ , and p(y|f) = N (f, σ2 I) , noise (5) where K is subscript by the variables between which the covariance is computed (and we use the asterisk ∗ as shorthand for f∗ ) and I is the identity matrix. Since both factors in the integral are Gaussian, the integral can be evaluated in closed form to give the Gaussian predictive distribution p(f∗ |y) = N K∗,f (Kf,f + σ2 I)−1 y, K∗,∗ − K∗,f (Kf,f + σ2 I)−1 Kf,∗ , noise noise (6) see the relevant Gaussian identity in appendix A. The problem with the above expression is that it requires inversion of a matrix of size n × n which requires O (n3 ) operations, where n is the number of training cases. Thus, the simple exact implementation can handle problems with at most a few thousand training cases. 2. A New Unifying View We now seek to modify the joint prior p(f∗ , f) from (5) in ways which will reduce the computational requirements from (6). Let us first rewrite that prior by introducing an additional set of m latent variables u = [u1 , . . . , um ] , which we call the inducing variables. These latent variables are values of the Gaussian process (as also f and f∗ ), corresponding to a set of input locations Xu , which we call the inducing inputs. Whereas the additional latent variables u are always marginalized out in the predictive distribution, the choice of inducing inputs does leave an imprint on the final solution. 4. We will mostly consider a vector of test cases f∗ (rather than a single f∗ ). 5. You may have been expecting the likelihood written as p(y|f∗ , f) but since the likelihood is conditionally independent of everything else given f, this makes no difference. 1941 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN The inducing variables will turn out to be generalizations of variables which other authors have referred to variously as “support points”, “active set” or “pseudo-inputs”. Particular sparse algorithms choose the inducing variables in various different ways; some algorithms chose the inducing inputs to be a subset of the training set, others not, as we will discuss in Section 9. For now consider any arbitrary inducing variables. Due to the consistency of Gaussian processes, we know that we can recover p(f∗ , f) by simply integrating (marginalizing) out u from the joint GP prior p(f∗ , f, u) p(f∗ , f) = Z p(f∗ , f, u) du = Z p(f∗ , f|u) p(u) du, where p(u) = N (0, Ku,u ) . (7) This is an exact expression. Now, we introduce the fundamental approximation which gives rise to almost all sparse approximations. We approximate the joint prior by assuming that f∗ and f are conditionally independent given u, see Figure 1, such that p(f∗ , f) q(f∗ , f) = Z q(f∗ |u) q(f|u) p(u) du . (8) The name inducing variable is motivated by the fact that f and f∗ can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f∗ |u) of the integral in (8). It will be useful for future reference to specify here the exact expressions for the two conditionals training conditional: test conditional: −1 p(f|u) = N (Kf,u Ku,u u, Kf,f − Qf,f ) , −1 p(f∗ |u) = N (K∗,u Ku,u u, K∗,∗ − Q∗,∗ ) , (9a) (9b) −1 where we have introduced the shorthand notation6 Qa,b Ka,u Ku,u Ku,b . We can readily identify the expressions in (9) as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. Note that the (positive semi-definite) covariance matrices in (9) have the form K − Q with the following interpretation: the prior covariance K minus a (non-negative definite) matrix Q quantifying how much information u provides about the variables in question (f or f∗ ). We emphasize that all the sparse methods discussed in the paper correspond simply to different approximations to the conditionals in (9), and throughout we use the exact likelihood and inducing prior p(y|f) = N (f, σ2 I) , and p(u) = N (0, Ku,u ) . (10) noise 3. The Subset of Data (SoD) Approximation Before we get started with the more sophisticated approximations, we mention as a baseline method the simplest possible sparse approximation (which doesn’t fall inside our general scheme): use only a subset of the data (SoD). The computational complexity is reduced to O (m3 ), where m < n. We would not generally expect SoD to be a competitive method, since it would seem impossible (even with fairly redundant data and a good choice of the subset) to get a realistic picture of the 6. Note, that Qa,b depends on u although this is not explicit in the notation. 1942 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r r r fn f1 f2 f∗ Figure 1: Graphical model of the relation between the inducing variables u, the training latent functions values f = [ f1 , . . . , fn ] and the test function value f∗ . The thick horizontal line represents a set of fully connected nodes. The observations y1 , . . . , yn , y∗ (not shown) would dangle individually from the corresponding latent values, by way of the exact (factored) likelihood (5). Left graph: the fully connected graph corresponds to the case where no approximation is made to the full joint Gaussian process distribution between these variables. The inducing variables u are superfluous in this case, since all latent function values can communicate with all others. Right graph: assumption of conditional independence between training and test function values given u. This gives rise to the separation between training and test conditionals from (8). Notice that having cut the communication path between training and test latent function values, information from f can only be transmitted to f∗ via the inducing variables u. uncertainties, when only a part of the training data is even considered. We include it here mostly as a baseline against which to compare better sparse approximations. In Figure 5 top, left we see how the SoD method produces wide predictive distributions, when training on a randomly selected subset of 10 cases. A fair comparison to other methods would take into account that the computational complexity is independent of n as opposed to other more advanced methods. These extra computational resources could be spent in a number of ways, e.g. larger m, or an active (rather than random) selection of the m points. In this paper we will concentrate on understanding the theoretical foundations of the various approximations rather than investigating the necessary heuristics needed to turn the approximation schemes into actually practical algorithms. 4. The Subset of Regressors (SoR) Approximation The Subset of Regressors (SoR) algorithm was given by Silverman (1985), and mentioned again by Wahba et al. (1999). It was then adapted by Smola and Bartlett (2001) to propose a sparse greedy approximation to Gaussian process regression. SoR models are finite linear-in-the-parameters models with a particular prior on the weights. For any input x∗ , the corresponding function value f∗ is given by: f∗ = K∗,u wu , with −1 p(wu ) = N (0, Ku,u ) , (11) where there is one weight associated to each inducing input in Xu . Note that the covariance matrix for the prior on the weights is the inverse of that on u, such that we recover the exact GP prior on u, 1943 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN which is Gaussian with zero mean and covariance u = Ku,u wu ⇒ uu = Ku,u wu wu Ku,u = Ku,u . (12) −1 Using the effective prior on u and the fact that wu = Ku,u u we can redefine the SoR model in an equivalent, more intuitive way: −1 f∗ = K∗,u Ku,u u , with u ∼ N (0, Ku,u ) . (13) We are now ready to integrate the SoR model in our unifying framework. Given that there is a deterministic relation between any f∗ and u, the approximate conditional distributions in the integral in eq. (8) are given by: −1 qSoR (f|u) = N (Kf,u Ku,u u, 0) , and −1 qSoR (f∗ |u) = N (K∗,u Ku,u u, 0) , (14) with zero conditional covariance, compare to (9). The effective prior implied by the SoR approximation is easily obtained from (8), giving qSoR (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f Q∗,∗ , (15) −1 where we recall Qa,b Ka,u Ku,u Ku,b . A more descriptive name for this method, would be the Deterministic Inducing Conditional (DIC) approximation. We see that this approximate prior is degenerate. There are only m degrees of freedom in the model, which implies that only m linearly independent functions can be drawn from the prior. The m + 1-th one is a linear combination of the previous. For example, in a very low noise regime, the posterior could be severely constrained by only m training cases. The degeneracy of the prior causes unreasonable predictive distributions. Indeed, the approximate prior over functions is so restrictive, that given enough data only a very limited family of functions will be plausible under the posterior, leading to overconfident predictive variances. This is a general problem of finite linear models with small numbers of weights (for more details see Rasmussen and Qui˜ onero-Candela, 2005). Figure 5, top, right panel, illustrates the unreasonable n predictive uncertainties of the SoR approximation on a toy dataset.7 The predictive distribution is obtained by using the SoR approximate prior (15) instead of the true prior in (4). For each algorithm we give two forms of the predictive distribution, one which is easy to interpret, and the other which is economical to compute with: qSoR (f∗ |y) = N Q∗,f (Qf,f + σ2 I)−1 y, Q∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ , noise noise = N σ K∗,u Σ Ku,f y, K∗,u ΣKu,∗ , −2 (16a) (16b) where we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 . Equation (16a) is readily recognized as the regular prediction equation (6), except that the covariance K has everywhere been replaced by Q, which was already suggested by (15). This corresponds to replacing the covariance function k with −1 kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The new covariance function has rank (at most) m. Thus we have the following 7. Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances of the SoD, or a more accurate computationally costly alternative (more details are given by Qui˜ onero-Candela, 2004, Chapter 3). n 1944 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Remark 4 The SoR approximation is equivalent to exact inference in the degenerate Gaussian −1 process with covariance function kSoR (xi , x j ) = k(xi , u)Ku,u k(u, x j ). The equivalent (16b) is computationally cheaper, and with (11) in mind, Σ is the covariance of the posterior on the weights wu . Note that as opposed to the subset of data method, all training cases are taken into account. The computational complexity is O (nm2 ) initially, and O (m) and O (m2 ) per test case for the predictive mean and variance respectively. 5. The Deterministic Training Conditional (DTC) Approximation Taking up ideas already contained in the work of Csat´ and Opper (2002), Seeger et al. (2003) o recently proposed another sparse approximation to Gaussian process regression, which does not suffer from the nonsensical predictive uncertainties of the SoR approximation, but that interestingly leads to exactly the same predictive mean. Seeger et al. (2003), who called the method Projected Latent Variables (PLV), presented the method as relying on a likelihood approximation, based on −1 the projection f = Kf,u Ku,u u: p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, σ2 I) . noise (17) The method has also been called the Projected Process Approximation (PPA) by Rasmussen and Williams (2006, Chapter 8). One way of obtaining an equivalent model is to retain the usual likelihood, but to impose a deterministic training conditional and the exact test conditional from eq. (9b) −1 qDTC (f|u) = N (Kf,u Ku,u u, 0), and qDTC (f∗ |u) = p(f∗ |u) . (18) This reformulation has the advantage of allowing us to stick to our view of exact inference (with exact likelihood) with approximate priors. Indeed, under this model the conditional distribution of f given u is identical to that of the SoR, given in the left of (14). A systematic name for this approximation is the Deterministic Training Conditional (DTC). The fundamental difference with SoR is that DTC uses the exact test conditional (9b) instead of the deterministic relation between f∗ and u of SoR. The joint prior implied by DTC is given by: qDTC (f, f∗ ) = N 0, Qf,f Qf,∗ Q∗,f K∗,∗ , (19) which is surprisingly similar to the effective prior implied by the SoR approximation (15). The fundamental difference is that under the DTC approximation f∗ has a prior variance of its own, given by K∗,∗ . This prior variance reverses the behaviour of the predictive uncertainties, and turns them into sensible ones, see Figure 5 for an illustration. The predictive distribution is now given by: qDTC (f∗ |y) = N (Q∗,f (Qf,f + σ2 I)−1 y, K∗,∗ − Q∗,f (Qf,f + σ2 I)−1 Qf,∗ noise noise = N σ K∗,u Σ Ku,f y, K∗,∗ − Q∗,∗ + K∗,u ΣK∗,u , −2 (20a) (20b) where again we have defined Σ = (σ−2 Ku,f Kf,u + Ku,u )−1 as in (16). The predictive mean for the DTC is identical to that of the SoR approximation (16), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (which is larger, since K∗,∗ − Q∗,∗ is positive definite). This added term is the predictive variance of the posterior of f∗ conditioned on u. It grows to the prior variance K∗,∗ as x∗ moves far from the inducing inputs in Xu . 1945 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r f1 f2 fn f∗ Figure 2: Graphical model for the FITC approximation. Compared to those in Figure 1, all edges between latent function values have been removed: the latent function values are conditionally fully independent given the inducing variables u. Although strictly speaking the SoR and DTC approximations could also be represented by this graph, note that both further assume a deterministic relation between f and u. Remark 5 The only difference between the predictive distribution of DTC and SoR is the variance. The predictive variance of DTC is never smaller than that of SoR. Note, that since the covariances for training cases and test cases are computed differently, see (19), it follows that Remark 6 The DTC approximation does not correspond exactly to a Gaussian process, as the covariance between latent values depends on whether they are considered training or test cases, violating consistency, see Definition 1. The computational complexity has the same order as for SoR. 6. The Fully Independent Training Conditional (FITC) Approximation Recently Snelson and Ghahramani (2006) proposed another likelihood approximation to speed up Gaussian process regression, which they called Sparse Gaussian Processes using Pseudo-inputs (SGPP). While the DTC is based on the likelihood approximation given by (17), the SGPP proposes a more sophisticated likelihood approximation with a richer covariance p(y|f) −1 q(y|u) = N (Kf,u Ku,u u, diag[Kf,f − Qf,f ] + σ2 I) , noise (21) where diag[A] is a diagonal matrix whose elements match the diagonal of A. As we did in (18) for the DTC, we provide an alternative equivalent formulation called Fully Independent Training Conditional (FITC) based on the inducing conditionals: n qFITC (f|u) = ∏ p( fi |u) = N i=1 −1 Kf,u Ku,u u, diag[Kf,f −Qf,f ] , and qFITC ( f∗ |u) = p( f∗ |u) . (22) We see that as opposed to SoR and DTC, FITC does not impose a deterministic relation between f and u. Instead of ignoring the variance, FITC proposes an approximation to the training conditional distribution of f given u as a further independence assumption. In addition, the exact test conditional from (9b) is used in (22), although for reasons which will become clear towards the end of this 1946 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION section, we initially consider only a single test case, f∗ . The corresponding graphical model is given in Figure 2. The effective prior implied by the FITC is given by qFITC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ . (23) Note, that the sole difference between the DTC and FITC is that in the top left corner of the implied prior covariance, FITC replaces the approximate covariances of DTC by the exact ones on the diagonal. The predictive distribution is qFITC ( f∗ |y) = N Q∗,f (Qf,f + Λ)−1 y, K∗,∗ − Q∗,f (Qf,f + Λ)−1 Qf,∗ (24a) = N K∗,u ΣKu,f Λ−1 y, K∗,∗ − Q∗,∗ + K∗,u ΣKu,∗ , (24b) where we have defined Σ = (Ku,u + Ku,f Λ−1 Kf,u )−1 and Λ = diag[Kf,f − Qf,f + σ2 I ]. The compunoise tational complexity is identical to that of SoR and DTC. So far we have only considered a single test case. There are two options for joint predictions, either 1) use the exact full test conditional from (9b), or 2) extend the additional factorizing assumption to the test conditional. Although Snelson and Ghahramani (2006) don’t explicitly discuss joint predictions, it would seem that they probably intend the second option. Whereas the additional independence assumption for the test cases is not really necessary for computational reasons, it does affect the nature of the approximation. Under option 1) the training and test covariance are computed differently, and thus this does not correspond to our strict definition of a GP model, but Remark 7 Iff the assumption of full independence is extended to the test conditional, the FITC approximation is equivalent to exact inference in a non-degenerate Gaussian process with covariance function kFIC (xi , x j ) = kSoR (xi , x j ) + δi, j [k(xi , x j ) − kSoR (xi , x j )], where δi, j is Kronecker’s delta. A logical name for the method where the conditionals (training and test) are always forced to be fully independent would be the Fully Independent Conditional (FIC) approximation. The effective prior implied by FIC is: qFIC (f, f∗ ) = N 0, Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f Q∗,∗ − diag[Q∗,∗ − K∗,∗ ] . (25) 7. The Partially Independent Training Conditional (PITC) Approximation In the previous section we saw how to improve the DTC approximation by approximating the training conditional with an independent distribution, i.e. one with a diagonal covariance matrix. In this section we will further improve the approximation (while remaining computationally attractive) by extending the training conditional to have a block diagonal covariance: −1 qPITC (f|u) = N Kf,u Ku,u u, blockdiag[Kf,f − Qf,f ] , and qPITC (f∗ |u) = p(f∗ |u) . (26) where blockdiag[A] is a block diagonal matrix (where the blocking structure is not explicitly stated). We represent graphically the PITC approximation in Figure 3. Developing this analogously to the FITC approximation from the previous section, we get the joint prior qPITC (f, f∗ ) = N 0, Qf,f − blockdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ 1947 , (27) ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN u    e ¡   ¡ e  e    ¡  e   ¡  e   ¡   e ¡   r r r fI fI fI f∗ 1 2 k Figure 3: Graphical representation of the PITC approximation. The set of latent function values fIi indexed by the the set of indices Ii is fully connected. The PITC differs from FITC (see graph in Fig. 2) in that conditional independence is now between the k groups of training latent function values. This corresponds to the block diagonal approximation to the true training conditional given in (26). and the predictive distribution is identical to (24), except for the alternative definition of Λ = blockdiag[Kf,f − Qf,f + σ2 I ]. An identical expression was obtained by Schwaighofer and Tresp noise (2003, Sect. 3), developing from the original Bayesian committee machine (BCM) by Tresp (2000). The relationship to the FITC was pointed out by Lehel Csat´ . The BCM was originally proposed as o a transductive learner (i.e. where the test inputs have to be known before training), and the inducing inputs Xu were chosen to be the test inputs. We discuss transduction in detail in the next section. It is important to realize that the BCM proposes two orthogonal ideas: first, the block diagonal structure of the partially independent training conditional, and second setting the inducing inputs to be the test inputs. These two ideas can be used independently and in Section 8 we propose using the first without the second. The computational complexity of the PITC approximation depends on the blocking structure imposed in (26). A reasonable choice, also recommended by Tresp (2000) may be to choose k = n/m blocks, each of size m × m. The computational complexity remains O (nm2 ). Since in the PITC model the covariance is computed differently for training and test cases Remark 8 The PITC approximation does not correspond exactly to a Gaussian process. This is because computing covariances requires knowing whether points are from the training- or test-set, (27). One can obtain a Gaussian process from the PITC by extending the partial conditional independence assumption to the test conditional, as we did in Remark 7 for the FITC. 8. Transduction and Augmentation The idea of transduction is that one should restrict the goal of learning to prediction on a prespecified set of test cases, rather than trying to learn an entire function (induction) and then evaluate it at the test inputs. There may be no universally agreed upon definition of transduction. In this paper we use Definition 9 Transduction occurs only if the predictive distribution depends on other test inputs. This operational definition excludes models for which there exist an equivalent inductive counterpart. According to this definition, it is irrelevant when the bulk of the computation takes place. 1948 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION u f∗   e  ¡     ¡ e   e ¡     ¡     e e   ¡       e ¡   r r r fI fI fI 1 2 k Figure 4: Two views on Augmentation. One view is to see that the test latent function value f∗ is now part of the inducing variables u and therefore has access to the training latent function values. An equivalent view is to consider that we have dropped the assumption of conditional independence between f∗ and the training latent function values. Even if f∗ has now direct access to each of the training fi , these still need to go through u to talk to each other if they fall in conditionally independent blocks. We have in this figure decided to recycle the graph for PITC from Figure 3 to show that all approximations we have presented can be augmented, irrespective of what the approximation for the training conditional is. There are several different possible motivations for transduction: 1) transduction is somehow easier than induction (Vapnik, 1995), 2) the test inputs may reveal important information, which should be used during training. This motivation drives models in semi-supervised learning (studied mostly in the context of classification) and 3) for approximate algorithms one may be able to limit the discrepancies of the approximation at the test points. For exact GP models it seems that the first reason doesn’t really apply. If you make predictions at the test points that are consistent with a GP, then it is trivial inside the GP framework to extend these to any other input points, and in effect we have done induction. The second reason seems more interesting. However, in a standard GP setting, it is a consequence of the consistency property, see Remark 2, that predictions at one test input are independent of the location of any other test inputs. Therefore transduction can not be married with exact GPs: Remark 10 Transduction can not occur in exact Gaussian process models. Whereas this holds for the usual setting of GPs, it could be different in non-standard situations where e.g. the covariance function depends on the empirical input densities. Transduction can occur in the sparse approximation to GPs, by making the choice of inducing variables depend on the test inputs. The BCM from the previous section, where Xu = X∗ (where X∗ are the test inputs) is an example of this. Since the inducing variables are connected to all other nodes (see Figure 3) we would expect the approximation to be good at u = f∗ , which is what we care about for predictions, relating to reason 3) above. While this reasoning is sound, it is not necessarily a sufficient consideration for getting a good model. The model has to be able to simultaneously explain the training targets as well and if the choice of u makes this difficult, the posterior at the points of interest may be distorted. Thus, the choice of u should be governed by the ability to model the conditional of the latents given the inputs, and not solely by the density of the (test) inputs. The main drawback of transduction is that by its nature it doesn’t provide a predictive model in the way inductive models do. In the usual GP model one can do the bulk of the computation 1949 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN involved in the predictive distributions (e.g. matrix inversion) before seeing the test cases, enabling fast computation of test predictions. It is interesting that whereas other methods spend much effort trying to optimize the inducing variables, the BCM simply uses the test set. The quality of the BCM approximation depends then on the particular location of the test inputs, upon which one usually does not have any control. We now see that there may be a better method, eliminating the drawback of transduction, namely use the PITC approximation, but choose the u’s carefully (see Section 9), don’t just use the test set. 8.1 Augmentation An idea closely related to transduction, but not covered by our definition, is augmentation, which in contrast to transduction is done individually for each test case. Since in the previous sections, we haven’t assumed anything about u, we can simply augment the set of inducing variables by f∗ (i.e. have one additional inducing variable equal to the current test latent), and see what happens in the predictive distributions for the different methods. Let’s first investigate the consequences for the test conditional from (9b). Note, the interpretation of the covariance matrix K∗,∗ − Q∗,∗ was “the prior covariance minus the information which u provides about f∗ ”. It is clear that the augmented u (with f∗ ) provides all possible information about f∗ , and consequently Q∗,∗ = K∗,∗ . An equivalent view on augmentation is that the assumption of conditional independence between f∗ and f is dropped. This is seen trivially by adding edges between f∗ and the fi in the graphical model, Figure 4. Augmentation was originally proposed by Rasmussen (2002), and applied in detail to the SoR with RBF covariance by Qui˜ onero-Candela (2004). Because the SoR is a finite linear model, and n the basis functions are local (Gaussian bumps), the predictive distributions can be very misleading. For example, when making predictions far away from the center of any basis function, all basis functions have insignificant magnitudes, and the prediction (averaged over the posterior) will be close to zero, with very small error-bars; this is the opposite of the desired behaviour, where we would expect the error-bars to grow as we move away from the training cases. Here augmentation makes a particularly big difference turning the nonsensical predictive distribution into a reasonable one, by ensuring that there is always a basis function centered on the test case. Compare the nonaugmented to the augmented SoR in Figure 5. An analogous Gaussian process based finite linear model that has recently been healed by augmentation is the relevance vector machine (Rasmussen and Qui˜ onero-Candela, 2005). n Although augmentation was initially proposed for a narrow set of circumstances, it is easily applied to any of the approximations discussed. Of course, augmentation doesn’t make any sense for an exact, non-degenerate Gaussian process model (a GP with a covariance function that has a feature-space which is infinite dimensional, i.e. with basis functions everywhere). Remark 11 A full non-degenerate Gaussian process cannot be augmented, since the corresponding f∗ would already be connected to all other variables in the graphical model. But augmentation does make sense for sparse approximations to GPs. The more general process view on augmentation has several advantages over the basis function view. It is not completely clear from the basis function view, which basis function should be used for augmentation. For example, Rasmussen and Qui˜ onero-Candela (2005) successfully apply augn mentation using basis functions that have a zero contribution at the test location! In the process view 1950 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION however, it seems clear that one would chose the additional inducing variable to be f∗ , to minimize the effects of the approximations. Let us compute the effective prior for the augmented SoR. Given that f∗ is in the inducing set, the test conditional is not an approximation and we can rewrite the integral leading to the effective prior: Z qASoR (f∗ , f) = qSoR (f| f∗ , u) p( f∗ , u) du . (28) It is interesting to notice that this is also the effective prior that would result from augmenting the DTC approximation, since qSoR (f| f∗ , u) = qDTC (f| f∗ , u). Remark 12 Augmented SoR (ASoR) is equivalent to augmented DTC (ADTC). Augmented DTC only differs from DTC in the additional presence of f∗ among the inducing variables in the training conditional. We can only expect augmented DTC to be a more accurate approximation than DTC, since adding an additional inducing variable can only help capture information from y. Therefore Remark 13 DTC is a less accurate (but cheaper) approximation than augmented SoR. We saw previously in Section 5 that the DTC approximation does not suffer from the nonsensical predictive variances of the SoR. The equivalence between the augmented SoR and augmented DTC is another way of seeing how augmentation reverses the misbehaviour of SoR. The predictive distribution of the augmented SoR is obtained by adding f∗ to u in (20). Prediction with an augmented sparse model comes at a higher computational cost, since now f∗ directly interacts with all of f and not just with u. For each new test case, updating the augmented Σ in the predictive equation (for example (20b) for DTC) implies computing the vector matrix product K∗,f Kf,u with complexity O (nm). This is clearly higher than the O (m) for the mean, and O (m2 ) for the predictive distribution of all the non-augmented methods we have discussed. Augmentation seems to be only really necessary for methods that make a severe approximation to the test conditional, like the SoR. For methods that make little or no approximation to the test conditional, it is difficult to predict the degree to which augmentation would help. However, one can see by giving f∗ access to all of the training latent function values in f, one would expect augmentation to give less under-confident predictive distributions near the training data. Figure 5 clearly shows that augmented DTC (equivalent to augmented SoR) has a superior predictive distribution (both mean and variance) than standard DTC. Note however that in the figure we have purposely chosen a too short lengthscale to enhance visualization. Quantitatively, this superiority was experimentally assessed by Qui˜ onero-Candela (2004, Table 3.1). Augmentation hasn’t been n compared to the more advanced approximations FITC and PITC, and the figure would change in the more realistic scenario where the inducing inputs and hyperparameters are learnt (Snelson and Ghahramani, 2006). Transductive methods like the BCM can be seen as joint augmentation, and one could potentially use it for any of the methods presented. It seems that the good performance of the BCM could essentially stem from augmentation, the presence of the other test inputs in the inducing set being probably of little benefit. Joint augmentation might bring some computational advantage, but won’t change the scaling: note that augmenting m times at a cost of O (nm) apiece implies the same O (nm2 ) total cost as the jointly augmented BCM. 1951 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN SoD SoR 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 DTC 0 5 10 15 5 10 15 5 10 15 ASoR/ADTC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 FITC 0 PITC 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −15 −10 −5 0 5 10 15 −1.5 −15 −10 −5 0 Figure 5: Toy example with identical covariance function and hyperparameters. The squared exponential covariance function is used, and a slightly too short lengthscale is chosen on purpose to emphasize the different behaviour of the predictive uncertainties. The dots are the training points, the crosses are the targets corresponding to the inducing inputs, randomly selected from the training set. The solid line is the mean of the predictive distribution, and the dotted lines show the 95% confidence interval of the predictions. Augmented DTC (ADTC) is equivalent to augmented SoR (ASoR), see Remark 12. 1952 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION 9. On the Choice of the Inducing Variables We have until now assumed that the inducing inputs Xu were given. Traditionally, sparse models have very often been built upon a carefully chosen subset of the training inputs. This concept is probably best exemplified in the popular support vector machine (Cortes and Vapnik, 1995). In sparse Gaussian processes it has also been suggested to select the inducing inputs Xu from among the training inputs. Since this involves a prohibitive combinatorial optimization, greedy optimization approaches have been suggested using various selection criteria like online learning (Csat´ and o Opper, 2002), greedy posterior maximization (Smola and Bartlett, 2001), maximum information gain (Seeger et al., 2003), matching pursuit (Keerthi and Chu, 2006), and probably more. As discussed in the previous section, selecting the inducing inputs from among the test inputs has also been considered in transductive settings. Recently, Snelson and Ghahramani (2006) have proposed to relax the constraint that the inducing variables must be a subset of training/test cases, turning the discrete selection problem into one of continuous optimization. One may hope that finding a good solution is easier in the continuous than the discrete case, although finding the global optimum is intractable in both cases. And perhaps the less restrictive choice can lead to better performance in very sparse models. Which optimality criterion should be used to set the inducing inputs? Departing from a fully Bayesian treatment which would involve defining priors on Xu , one could maximize the marginal likelihood (also called the evidence) with respect to Xu , an approach also followed by Snelson and Ghahramani (2006). Each of the approximate methods proposed involves a different effective prior, and hence its own particular effective marginal likelihood conditioned on the inducing inputs q(y|Xu ) = ZZ p(y|f) q(f|u) p(u|Xu )du df = Z p(y|f) q(f|Xu )df , (29) which of course is independent of the test conditional. We have in the above equation explicitly conditioned on the inducing inputs Xu . Using Gaussian identities, the effective marginal likelihood is very easily obtained by adding a ridge σ2 I (from the likelihood) to the covariance of effective noise prior on f. Using the appropriate definitions of Λ, the log marginal likelihood becomes 1 log q(y|Xu ) = − 2 log |Qf,f + Λ| − 1 y (Qf,f + Λ)−1 y − n log(2π) , 2 2 (30) where ΛSoR = ΛDTC = σ2 I, ΛFITC = diag[Kf,f − Qf,f ] + σ2 I, and ΛPITC = blockdiag[Kf,f − noise noise Qf,f ] + σ2 I. The computational cost of the marginal likelihood is O (nm2 ) for all methods, that of noise its gradient with respect to one element of Xu is O (nm). This of course implies that the complexity of computing the gradient wrt. to the whole of Xu is O (dnm2 ), where d is the dimension of the input space. It has been proposed to maximize the effective posterior instead of the effective marginal likelihood (Smola and Bartlett, 2001). However this is potentially dangerous and can lead to overfitting. Maximizing the whole evidence instead is sound and comes at an identical computational cost (for a deeper analysis see Qui˜ onero-Candela, 2004, Sect. 3.3.5 and Fig. 3.2). n The marginal likelihood has traditionally been used to learn the hyperparameters of GPs in the non fully Bayesian treatment (see for example Williams and Rasmussen, 1996). For the sparse approximations presented here, once you are learning Xu it is straightforward to allow for learning hyperparameters (of the covariance function) during the same optimization, and there is no need to interleave optimization of u with learning of the hyperparameters as it has been proposed for example by Seeger et al. (2003). 1953 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN 10. Other Methods In this section we briefly mention two approximations which don’t fit in our unifying scheme, since one doesn’t correspond to a proper probabilistic model, and the other one uses a particular construction for the covariance function, rather than allowing any general covariance function. 10.1 The Nystr¨ m Approximation o The Nystr¨ m Approximation for speeding up GP regression was originally proposed by Williams o and Seeger (2001), and then questioned by Williams et al. (2002). Like SoR and DTC, the Nystr¨ m o Approximation for GP regression approximates the prior covariance of f by Qf,f . However, unlike these methods, the Nystr¨ m Approximation is not based on a generative probabilistic model. The o prior covariance between f∗ and f is taken to be exact, which is inconsistent with the prior covariance on f: Qf,f Kf,∗ . (31) q(f, f∗ ) = N 0, K∗,f K∗,∗ As a result we cannot derive this method from our unifying framework, nor represent it with a graphical model. Worse, the resulting prior covariance matrix is not even guaranteed to be positive definite, allowing the predictive variances to be negative. Notice that replacing Kf,∗ by Qf,∗ in (31) is enough to make the prior covariance positive definite, and one obtains the DTC approximation. Remark 14 The Nystr¨ m Approximation does not correspond to a well-formed probabilistic model. o Ignoring any quibbles about positive definiteness, the predictive distribution of the Nystr¨ m Apo proximation is given by: p( f∗ |y) = N Kf,∗ [Qf,f + σ2 I]−1 y, K∗,∗ − Kf,∗ [Qf,f + σ2 I]−1 Kf,∗ , noise noise (32) but the predictive variance is not guaranteed to be positive. The computational cost is O (nm2 ). 10.2 The Relevance Vector Machine The relevance vector machine, introduced by Tipping (2001), is a finite linear model with an independent Gaussian prior imposed on the weights. For any input x∗ , the corresponding function output is given by: f∗ = φ∗ w , with p(w|A) = N (0, A) , (33) where φ∗ = [φ1 (x), . . . , φm (x)] is the (row) vector of responses of the m basis functions, and A = diag(α1 , . . . , αm ) is the diagonal matrix of joint prior precisions (inverse variances) of the weights. The αi are learnt by maximizing the RVM evidence (obtained by also assuming Gaussian additive iid. noise, see (1)), and for the typical case of rich enough sets of basis functions many of the precisions go to infinity effectively pruning out the corresponding weights (for a very interesting analysis see Wipf et al., 2004). The RVM is thus a sparse method and the surviving basis functions are called relevance vectors. Note that since the RVM is a finite linear model with Gaussian priors on the weights, it can be seen as a Gaussian process: Remark 15 The RVM is equivalent to a degenerate Gaussian process with covariance function kRVM (xi , x j ) = φi A−1 φ j = ∑m α−1 φk (xi ) φk (x j ), k=1 k 1954 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION q(f∗ |u) q(f|u) GP exact exact SoR determ. determ. DTC exact determ. FITC (exact) fully indep. PITC exact partially indep. Method joint prior covariance Kf,f Kf,∗ K∗,f K∗,∗ Qf,f Qf,∗ Q∗,f Q∗,∗ Qf,f Qf,∗ Q∗,f K∗,∗ Qf,f − diag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ Qf,f − blokdiag[Qf,f − Kf,f ] Qf,∗ Q∗,f K∗,∗ GP? √ √ √ ( ) Table 1: Summary of the way approximations are built. All these methods are detailed in the previous sections. The initial cost and that of the mean and variance per test case are respectively n2 , n and n2 for the exact GP, and nm2 , m and m2 for all other methods. The “GP?” column indicates whether the approximation is equivalent to a GP. For FITC see Remark 7. as was also pointed out by Tipping (2001, eq. (59)). Whereas all sparse approximations we have presented until now are totally independent of the choice of covariance function, for the RVM this choice is restricted to covariance functions that can be expressed as finite expansions in terms of some basis functions. Being degenerate GPs in exactly the same way as the SoR (presented in Section 4), the RVM does also suffer from unreasonable predictive variances. Rasmussen and Qui˜ onero-Candela (2005) show that the predictive distributions of RVMs can also be healed by n augmentation, see Section 8. Once the αi have been learnt, denoting by m the number of surviving relevance vectors, the complexity of computing the predictive distribution of the RVM is O (m) for mean and O (m2 ) for the variance. RVMs are often used with radial basis functions centered on the training inputs. One potentially interesting extension to the RVM would be to learn the locations of the centers of the basis functions, in the same way as proposed by Snelson and Ghahramani (2006) for the FITC approximation, see Section 6. This is a curious reminiscence of learning the centers in RBF Networks. 11. Conclusions We have provided a unifying framework for sparse approximations to Gaussian processes for regression. Our approach consists of two steps, first 1) we recast the approximation in terms of approximations to the prior, and second 2) we introduce inducing variables u and the idea of conditional independence given u. We recover all existing sparse methods by making further simplifications of the covariances of the training and test conditionals, see Table 1 for a summary. Previous methods were presented based on different approximation paradigms (e.g. likelihood approximations, projection methods, matrix approximations, minimization of Kullback-Leibler divergence, etc), making direct comparison difficult. Under our unifying view we deconstruct methods, making it clear which building blocks they are based upon. For example, the SGPP by Snelson 1955 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN and Ghahramani (2006) contains two ideas, 1) a likelihood approximation and 2) the idea of varying the inducing inputs continuously; these two ideas could easily be used independently, and incorporated in other methods. Similarly, the BCM by Tresp (2000) contains two independent ideas 1) a block diagonal assumption, and 2) the (transductive) idea of choosing the test inputs as the inducing variables. Finally we note that although all three ideas of 1) transductively setting u = f∗ , 2) augmentation and 3) continuous optimization of Xu have been proposed in very specific settings, in fact they are completely general ideas, which can be applied to any of the approximation schemes considered. We have ranked the approximation according to how close they are to the corresponding full GP. However, the performance in practical situations may not always follow this theoretical ranking since the approximations might exhibit properties (not present in the full GP) which may be particularly suitable for specific datasets. This may make the interpretation of empirical comparisons challenging. A further complication arises when adding the necessary heuristics for turning the theoretical constructs into practical algorithms. We have not described full algorithms in this paper, but are currently working on a detailed empirical study (in preparation, see also Rasmussen and Williams, 2006, chapter 8). We note that the order of the computational complexity is identical for all the methods considered, O (nm2 ). This highlights that there is no computational excuse for using gross approximations, such as assuming deterministic relationships, in particular one should probably think twice before using SoR or even DTC. Although augmentation has attractive predictive properties, it is computationally expensive. It remains unclear whether augmentation could be beneficial on a fixed computational budget. We have only considered the simpler case of regression in this paper, but sparseness is also commonly sought in classification settings. It should not be difficult to cast probabilistic approximation methods such as Expectation Propagation (EP) or the Laplace method (for a comparison, see Kuss and Rasmussen, 2005) into our unifying framework. Our analysis suggests that a new interesting approximation would come from combining the best possible approximation (PITC) with the most powerful selection method for the inducing inputs. This would correspond to a non-transductive version of the BCM. We would evade the necessity of knowing the test set before doing the bulk of the computation, and we could hope to supersede the superior performance reported by Snelson and Ghahramani (2006) for very sparse approximations. Acknowledgments Thanks to Neil Lawrence for arranging the 2005 Gaussian Process Round Table meeting in Sheffield, which provided much inspiration to this paper. Special thanks to Olivier Chapelle, Lehel Csat´ , o Zoubin Ghahramani, Matthias Seeger, Ed Snelson and Chris Williams for helpful discussions, and to three anonymous reviewers. Both authors were supported by the German Research Council (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. 1956 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Appendix A. Gaussian and Matrix Identities In this appendix we provide identities used to manipulate matrices and Gaussian distributions throughout the paper. Let x and y be jointly Gaussian x y µx µy ∼ N , A C C B , (34) then the marginal and the conditional are given by x ∼ N (µx , A) , and x|y ∼ N µx +C B−1 (y − µy ), A −C B−1C (35) Also, the product of a Gaussian in x with a Gaussian in a linear projection P x is again a Gaussian, although unnormalized N (x|a, A) N (P x|b, B) = zc N (x|c,C) , (36) where C = A−1 + P B−1 P −1 c = C A−1 a + P B−1 b . , The normalizing constant zc is gaussian in the means a and b of the two Gaussians: m 1 1 zc = (2 π)− 2 |B + P A P |− 2 exp − 2 (b − P a) B+PAP −1 (b − P a) . (37) The matrix inversion lemma, also known as the Woodbury, Sherman & Morrison formula states that: (Z +UWV )−1 = Z −1 − Z −1U(W −1 +V Z −1U)−1V Z −1 , (38) assuming the relevant inverses all exist. Here Z is n × n, W is m × m and U and V are both of size n × m; consequently if Z −1 is known, and a low rank (ie. m < n) perturbation are made to Z as in left hand side of eq. (38), considerable speedup can be achieved. References Corinna Cortes and Vladimir Vapnik. Support-vector network. Machine Learning, 20(3):273–297, 1995. Lehel Csat´ and Manfred Opper. Sparse online Gaussian processes. Neural Computation, 14(3): o 641–669, 2002. Sathiya Keerthi and Wei Chu. A Matching Pursuit approach to sparse Gaussian process regression. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, Cambridge, Massachussetts, 2006. The MIT Press. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, pages 1679–1704, 2005. Joaquin Qui˜ onero-Candela. Learning with Uncertainty – Gaussian Processes and Relevance Vecn tor Machines. PhD thesis, Technical University of Denmark, Lyngby, Denmark, 2004. Carl Edward Rasmussen. Reduced rank Gaussian process learning. Technical report, Gatsby Computational Neuroscience Unit, UCL, 2002. 1957 ˜ Q UI NONERO -C ANDELA AND R ASMUSSEN Carl Edward Rasmussen and Joaquin Qui˜ onero-Candela. Healing the relevance vector machine by n augmentation. In International Conference on Machine Learning, 2005. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT press, 2006. Anton Schwaighofer and Volker Tresp. Transductive and inductive methods for approximate Gaussian process regression. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 953–960, Cambridge, Massachussetts, 2003. The MIT Press. Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Christopher M. Bishop and Brendan J. Frey, editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. Bernhard W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Roy. Stat. Soc. B, 47(1):1–52, 1985. (with discussion). Alexander J. Smola and Peter L. Bartlett. Sparse greedy Gaussian process regression. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625, Cambridge, Massachussetts, 2001. The MIT Press. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems o 18, Cambridge, Massachussetts, 2006. The MIT Press. Michael E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:211–244, 2001. Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. Grace Wahba, Xiwu Lin, Fangyu Gao, Dong Xiang, Ronald Klein, and Barbara Klein. The biasvariance tradeoff and the randomized GACV. In Michael S. Kerns, Sara A. Solla, and David A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 620–626, Cambridge, Massachussetts, 1999. The MIT Press. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, Massachussetts, 1996. The MIT Press. Christopher K. I. Williams, Carl Edward Rasmussen, Anton Schwaighofer, and Volker Tresp. Observations of the Nystr¨ m method for Gaussiam process prediction. Technical report, University o of Edinburgh, Edinburgh, Scotland, 2002. 1958 S PARSE A PPROXIMATE G AUSSIAN P ROCESS R EGRESSION Christopher K. I. Williams and Mathias Seeger. Using the Nystr¨ m method to speed up kernel o machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, Massachussetts, 2001. The MIT Press. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on sparse Bayesian learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨ lkopf, editors, Advances in Neural Information o Processing Systems 16, Cambridge, Massachussetts, 2004. The MIT Press. 1959

3 0.51607621 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, fitted value iteration, optimal control

4 0.3801921 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

5 0.37399262 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata

Author: Josh Bongard, Hod Lipson

Abstract: This paper describes an active learning approach to the problem of grammatical inference, specifically the inference of deterministic finite automata (DFAs). We refer to the algorithm as the estimation-exploration algorithm (EEA). This approach differs from previous passive and active learning approaches to grammatical inference in that training data is actively proposed by the algorithm, rather than passively receiving training data from some external teacher. Here we show that this algorithm outperforms one version of the most powerful set of algorithms for grammatical inference, evidence driven state merging (EDSM), on randomly-generated DFAs. The performance increase is due to the fact that the EDSM algorithm only works well for DFAs with specific balances (percentage of positive labelings), while the EEA is more consistent over a wider range of balances. Based on this finding we propose a more general method for generating DFAs to be used in the development of future grammatical inference algorithms. Keywords: grammatical inference, evolutionary computation, deterministic finite automata, active learning, system identification

6 0.3731131 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

7 0.35082239 36 jmlr-2005-Gaussian Processes for Ordinal Regression

8 0.34060854 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

9 0.33913887 54 jmlr-2005-Managing Diversity in Regression Ensembles

10 0.33852136 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

11 0.33591643 5 jmlr-2005-A Generalization Error for Q-Learning

12 0.3267858 71 jmlr-2005-Variational Message Passing

13 0.32350871 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

14 0.32174999 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

15 0.31752643 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

16 0.31422397 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

17 0.31320471 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

18 0.31150243 1 jmlr-2005-A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes

19 0.3096655 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

20 0.30966464 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application