nips nips2009 nips2009-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Gaussian process regression with Student-t likelihood Pasi Jyl¨ nki a Department of Biomedical Engineering and Computational Science Helsinki University of Technology Finland pasi. [sent-1, score-0.127]
2 fi Jarno Vanhatalo Department of Biomedical Engineering and Computational Science Helsinki University of Technology Finland jarno. [sent-3, score-0.159]
3 fi Aki Vehtari Department of Biomedical Engineering and Computational Science Finland Helsinki University of Technology aki. [sent-5, score-0.159]
4 fi Abstract In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. [sent-7, score-0.335]
5 A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. [sent-9, score-0.314]
6 In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. [sent-11, score-0.328]
7 We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. [sent-12, score-0.295]
8 1 Introduction A commonly used observation model in the Gaussian process (GP) regression is the Normal distribution. [sent-13, score-0.176]
9 However, a known limitation with the Gaussian observation model is its non-robustness, and replacing the normal distribution with a heavy-tailed one, such as the Student-t distribution, can be useful in problems with outlying observations. [sent-15, score-0.222]
10 If both the prior and the likelihood are Gaussian, the posterior is Gaussian with mean between the prior mean and the observations. [sent-16, score-0.458]
11 Thus, outlying observations may significantly reduce the accuracy of the inference. [sent-18, score-0.156]
12 For example, a single corrupted observation may pull the posterior expectation of the unknown function value considerably far from the level described by the other observations (see Figure 1). [sent-19, score-0.381]
13 A robust, or outlier-prone, observation model would, however, weight down the outlying observations the more, the further away they are from the other observations and prior mean. [sent-20, score-0.364]
14 Student-t observation model with linear regression was studied already by West [4] and Geweke [5], and Neal [6] introduced it for GP regression. [sent-23, score-0.176]
15 Other robust observation models include, for example, mixtures of Gaussians, Laplace 1 (a) Gaussian observation model. [sent-24, score-0.268]
16 A common approach has been to use the scalemixture representation of the Student-t distribution [5], which enables Gibbs sampling [5, 6], and a factorized variational approximation (VB) for the posterior inference [7, 11]. [sent-31, score-0.526]
17 Here, we discuss the properties of the GP regression with a Student-t likelihood and utilize the Laplace approximation for the approximate inference. [sent-32, score-0.328]
18 We discuss the known weaknesses of the approximation scheme and show that in practice it works very well and quickly. [sent-33, score-0.178]
19 We show that the predictive performances are similar and that the Laplace’s method approximates the posterior covariance somewhat better than VB. [sent-35, score-0.389]
20 2 Robust regression with Gaussian processes Consider a regression problem, where the data comprise observations yi = f (xi ) + i at input locations X = {xi }n , where the observation errors 1 , . [sent-37, score-0.347]
21 The object of inference is the latent function f , which is given a Gaussian process prior. [sent-41, score-0.143]
22 In particular, at the observed input locations X the latent variables have a distribution p(f |X) = N (f |µ, Kf,f ), (1) where Kf,f is the covariance matrix and µ the mean function. [sent-43, score-0.212]
23 Each element in the covariance matrix is a realization of covariance function, [Kf,f ]ij = k(xi , xj ), which represents the prior assumptions of the smoothness of the latent function (for a detailed introduction on GP regression see [12]). [sent-45, score-0.372]
24 That is, the effect of a single conflicting observation to the posterior becomes asymptotically negligible as the observation approaches infinity. [sent-55, score-0.447]
25 This contrasts heavily with the Gaussian observation model where each observation influences the posterior no matter how far it is from the others. [sent-56, score-0.447]
26 1 Inference with the Laplace approximation The conditional posterior of the latent variables Our approach is motivated by the Laplace approximation in GP classification [14]. [sent-63, score-0.583]
27 A similar approximation has been considered by West [4] in the case of robust linear regression and by Rue et al. [sent-64, score-0.237]
28 2 The maximum a posterior estimate of the hyperparameters To find a maximum a posterior estimate (MAP) for the hyperparameters, we write p(θ| y) ∝ p(y |θ)p(θ), where p(y |θ) = p(y| f )p(f |X, θ)d f , (5) is the marginal likelihood. [sent-69, score-0.576]
29 To find an approximation, q(y |θ), for the marginal likelihood one can utilize the Laplace method second time [12]. [sent-70, score-0.168]
30 A Taylor expansion of the logarithm of the integrand in (5) around ˆ gives a Gaussian integral over f multiplied by a constant, giving f 1 1 T 1 f (6) log q(y |θ) = log p(y|ˆ) − ˆ K-1 ˆ − log | Kf,f | − log | K-1 +W|. [sent-71, score-0.275]
31 f f,f f f,f 2 2 2 The hyperparameters can then be optimized by maximizing the approximate log marginal posterior, log q(θ| y) ∝ log q(y |θ) + log p(θ). [sent-72, score-0.37]
32 q(f | y,θ) (7) (8) The predictive distribution of a new observation is obtained by marginalizing over the posterior distribution of f∗ q(y∗ |X, y, x∗ ) = p(y∗ |f∗ )q(f∗ |X, y, x∗ )df∗ , (9) which can be evaluated, for example, with a Gaussian quadrature integration. [sent-76, score-0.391]
33 4 Properties of the Laplace approximation The Student-t distribution is not log-concave, and therefore the posterior distribution may be multimodal. [sent-78, score-0.35]
34 The immediate concern from this is that a unimodal Laplace approximation may give a poor estimate for the posterior. [sent-79, score-0.181]
35 This is, however, a problem for all unimodal approximations, 3 Var(f|D),Var(f) 0 −15 4 −10 −5 0 latent value f p(f), p(f|D), p(y|f) prior likelih real posterior Laplace app VB approx 0 −15 5 2 0 −15 −10 −5 0 latent value f 5 Var(f|D),Var(f) 0. [sent-80, score-0.596]
36 The likelihood is centered at zero and the prior mean is altered. [sent-83, score-0.146]
37 The upper plots show the probability density functions and the lower plots the variance of the true posterior and its approximations as a function of the posterior mean. [sent-84, score-0.559]
38 An other concern is that the estimate of the posterior precision, Σ−1 = − log p(f | y, θ)|f =ˆ , is essentially uncontrolled. [sent-86, score-0.289]
39 However, at a posterior mode ˆ, the f f −1 Hessian Σ is always positive definite and in practice approximates the truth rather well according to our experiments. [sent-87, score-0.361]
40 If the optimization for f ends up in a saddle point or the mode is very flat, Σ−1 may be close to singular, which leads to problems in the implementation. [sent-88, score-0.128]
41 Consider a single observation yi = 0 from a Student-t distribution with a Gaussian prior for its mean, fi . [sent-90, score-0.384]
42 The dotted lines represent the situation, where the observation is a clear outlier in which case the posterior is very close to the prior (cf. [sent-92, score-0.49]
43 The solid lines represent a situation where the prior and data agree, and the dashed lines represent a situation where the prior and data conflict moderately. [sent-94, score-0.198]
44 The posterior of the mean is unimodal if Σ(fi )−1 = τi−2 + W (fi ) > 0, for all fi ∈ , where τi2 is the prior variance and W (fi ) is the Hessian of the negative log likelihood at fi (see equations (3) and √ (4)). [sent-95, score-0.925]
45 Therefore, the posterior distribution is unimodal if τi−2 > (ν + 1)/(8νσ 2 ), or in terms of variances if Var[yi |fi , ν, σ]/τi2 > (ν + 1)/(8(ν − 2)) (for ν > 2). [sent-97, score-0.285]
46 It follows that the most problematic situation for the Laplace approximation is when the prior is much wider than the √ ˆ likelihood. [sent-98, score-0.27]
47 Then in the case of a moderate conflict (|yi − fi | is close to 3νσ) the posterior may be multimodal (see the Figure 2(a)), meaning that it is unclear whether the observation is an outlier or not. [sent-99, score-0.595]
48 The negative values of W relate to a decrease in the posterior precision compared to the prior precision. [sent-102, score-0.373]
49 As long as the total precision remains positive it approximates the behavior of the true posterior rather well. [sent-103, score-0.307]
50 The Student-t likelihood leads to a decrease in the variance from prior to posterior only if the prior mean and the observation are consistent with each other as shown in the Figure 2. [sent-104, score-0.602]
51 This behavior is not captured with the factorized VB approximation [7], where W in q(f |θ, y) is replaced with a strictly positive diagonal that always increases the precision as illustrated in the Figure 2. [sent-105, score-0.242]
52 1 On the implementation Posterior mode of the latent variables The mode of the latent variables, ˆ, can be found with general optimization methods such as the f scaled conjugate gradients. [sent-107, score-0.362]
53 456 the E-step of the algorithm consists of evaluating the expectation E 1 ν+1 yi , fiold , ν, σ = , Vi νσ 2 + (yi − fiold )2 (12) after which the latent variables are updated in the M-step as ˆnew = (K-1 +V−1 )−1 V−1 y, f f,f (13) where V−1 is a diagonal matrix of the expectations in (12). [sent-111, score-0.312]
54 2 Approximate marginal likelihood Rasmussen and Williams [12] discuss a numerically stable formulation to evaluate the approximate marginal likelihood and its gradients with a classification model. [sent-117, score-0.307]
55 With the Student-t likelihood, we found the most stable formulation for (6) is n n 1 T f log Rii + log Lii , log q(y |θ) = log p(y|ˆ) − ˆ a − f 2 i=1 i=1 (15) where R and L are the Cholesky decomposition of Kf,f and Σ = (K-1 +W)−1 , and a is obtained f,f from the EM algorithm. [sent-119, score-0.248]
56 We write first the posterior covariance as Σ = (K-1 +W)−1 = (K-1 +e1 eT W11 + e2 eT W22 + . [sent-124, score-0.298]
57 4 in that the posterior covariance is updated using the result of the previous iteration as a prior. [sent-136, score-0.347]
58 Adding first the largest W we reduce Σ so that negative values of W are less problematic (compare to the dashed black line in the Figure 2(b)), and the updates are numerically more stable. [sent-138, score-0.202]
59 This ensures that the Cholesky update will remain positive definite and doubles the marginal variance instead. [sent-141, score-0.149]
60 However, in practice we never encountered any warnings in our experiments if the hyperparameters were initialized sensibly so that the prior was tight compared to the likelihood. [sent-142, score-0.144]
61 However, the most similar approaches to the Laplace approximation are the VB approximation [7, 11] and the one in INLA [15]. [sent-144, score-0.246]
62 The Gaussian approximation for p(f | y, θ) in INLA is the same as the Laplace approximation here with the covariance function replaced by a precision matrix. [sent-147, score-0.388]
63 [15] derive the approximation for the log marginal posterior, log p(θ| y), from p(θ| y) ≈ q(θ| y) ∝ p(y, f , θ) q(f |θ, y) f =ˆ f = p(y | f )p(f |θ)p(θ) q(f |θ, y) f =ˆ f . [sent-149, score-0.305]
64 This is exactly the same as the approximation derived in the section 3. [sent-151, score-0.123]
65 Taking the logarithm of (18) we end up in log q(θ| y) ∝ log q(y |θ) + log p(θ), where log q(y |θ) is given in (6). [sent-153, score-0.275]
66 The approximate ˜ ˜ ˜ distributions and the hyperparameters are updated in turns so that θ are updated with current esti˜ mate for θ and after that θ is updated with fixed θ. [sent-155, score-0.211]
67 ˆ ˆ The variational approximation for the conditional posterior is p(f | y, θ, V) ≈ N (f |m, A). [sent-156, score-0.445]
68 Here, -1 −1 −1 ˆ A = (Kf,f +V ) , and the iterative search for the posterior parameters m and A is the same as −1 the EM algorithm described in section 4 except that the update of E Vii in (12) is replaced with −1 2 old old 2 E Vii = (ν + 1)/(σ + Aii + (yi − mi ) ). [sent-157, score-0.281]
69 Thus, the Laplace and the variational approximation are very similar. [sent-158, score-0.218]
70 In practice, the posterior mode, m, is very close to the mode ˆ, and the main f difference between the approximations is in the covariance and the hyperparameter estimates. [sent-159, score-0.438]
71 The other difference is that in the Laplace approximation the scale parameters V are marginalized out and it approximates directly p(f | y, θ). [sent-196, score-0.188]
72 The predictive performance is measured with a root mean squared error (RMSE) and a negative log predictive density (NLP). [sent-203, score-0.25]
73 The Student-t model is inferred using the Laplace approximation (lapl), VB (vb) [7] and full MCMC (mcmc) [6]. [sent-207, score-0.123]
74 The Gaussian observation model, the Laplace ˆ approximation and VB are evaluated at θ, and in MCMC we sample θ. [sent-208, score-0.233]
75 The significance of the differences in performance is approximated using a Gaussian approximation for the distribution of the NLP and RMSE statistics [17]. [sent-211, score-0.123]
76 The inference time was the shortest with Gaussian observation model and the longest with the Student-t model utilizing full MCMC. [sent-214, score-0.143]
77 The Laplace approximation for the Student-t likelihood took in average 50% more time than the Gaussian model, and VB was in average 8-10 times slower than ˜ the Laplace approximation. [sent-215, score-0.184]
78 Figure 3 shows the mean and the variance of p(f |θ, y) for MCMC versus the Laplace approximation and VB. [sent-218, score-0.219]
79 The mean of the Laplace approximation and VB match equally well the mean of the MCMC solution, but VB underestimates the variance more than the Laplace approximation (see also the figure 2). [sent-219, score-0.433]
80 In the housing data, both approximations underestimate the variance remarkably for few data points (40 of 506) that were located as clusters at places where inputs, x are truncated along one or more dimension. [sent-220, score-0.217]
81 At these locations, the marginal posteriors were slightly skew and their tails were rather heavy, and thus a Gaussian approximation presumably underestimates the variance. [sent-221, score-0.267]
82 The degrees of freedom of the Student-t likelihood were optimized only in Neal data and Boston housing data using the Laplace approximation. [sent-222, score-0.173]
83 Optimizing ν was more problematic for VB than for the Laplace approximation probably because the factorized approximation makes it harder to identify ν. [sent-224, score-0.355]
84 The ˆ MAP estimates θ found by the Laplace approximation and VB were slightly different. [sent-225, score-0.123]
85 7 (a) Neal data (b) Friedman data (c) Boston housing data (d) Concrete data Figure 3: Scatter plot of the posterior mean and variance of the latent variables. [sent-227, score-0.545]
86 In each figure, left plot is for MCMC (x-axis) vs the Laplace approximation (y-axis) and the right plot is MCMC (x-axis) vs. [sent-229, score-0.123]
87 7 Discussion In our experiments we found that the predictive performance of both the Laplace approximation and the factorial VB is similar with the full MCMC. [sent-231, score-0.177]
88 Compared to the MCMC the Laplace approximation ˆ and VB estimate the posterior mean E[f |θ, y] similarly but VB underestimates the posterior variance ˆ y] more than the Laplace approximation. [sent-232, score-0.733]
89 Optimizing the hyperparameters is clearly faster Var[f |θ, with the Laplace approximation than with VB. [sent-233, score-0.187]
90 Both the Laplace and the VB approximation estimate the posterior precision as a sum of a prior precision and a diagonal matrix. [sent-234, score-0.49]
91 In VB the diagonal is strictly positive, whereas in the Laplace approximation the diagonal elements corresponding to outlying observations are negative. [sent-235, score-0.279]
92 The Laplace approximation is closer to the reality in that respect since the outlying observations have a negative effect on the (true) posterior precision. [sent-236, score-0.555]
93 Since a posteriori f and V are correlated, the marginal q(f ) underestimates the effect of marginalizing over the scale parameters. [sent-238, score-0.146]
94 The Laplace approximation, on the other hand, tries to estimate directly the posterior p(f ) of the latent variables. [sent-239, score-0.337]
95 Recently, Opper and Archambeau [19] discussed the relation between the Laplace approximation and VB, and proposed a variational approximation directly for the latent variables and tried it with a Cauchy likelihood (they did not perform extensive experiments though). [sent-240, score-0.512]
96 The advantage of VB is that the objective function (19) is a rigorous lower bound for p(y |θ), whereas the Laplace approximation (18) is not. [sent-243, score-0.123]
97 However, the marginal posteriors p(f | y, θ) in our experiments (inferred with MCMC) were so close to Gaussian that the Laplace approximation q(f |θ, y) should be very accurate and, thus, the approximation for p(θ| y) (18) should also be close to the truth (see also justifications in [15]). [sent-244, score-0.362]
98 However, the Student-t likelihood is problematic for EP since it is not log-concave, for which reason EPs estimate for the posterior covariance may become singular during the site updates [21]. [sent-246, score-0.538]
99 The reason for this is that the variance parameters of the site approximations may become negative. [sent-247, score-0.167]
100 Approximate Bayesian inference for latent a Gaussian models by using integrated nested Laplace approximations. [sent-337, score-0.169]
wordName wordTfidf (topN-words)
[('laplace', 0.537), ('vb', 0.392), ('wii', 0.261), ('posterior', 0.227), ('fi', 0.159), ('cholesky', 0.145), ('inla', 0.139), ('approximation', 0.123), ('housing', 0.112), ('outlying', 0.112), ('observation', 0.11), ('latent', 0.11), ('neal', 0.1), ('gaussian', 0.096), ('variational', 0.095), ('mcmc', 0.095), ('gp', 0.073), ('covariance', 0.071), ('mode', 0.071), ('rmse', 0.07), ('outlier', 0.07), ('nlp', 0.07), ('finnish', 0.07), ('rue', 0.07), ('concrete', 0.068), ('christopher', 0.066), ('finland', 0.066), ('boston', 0.066), ('regression', 0.066), ('variance', 0.065), ('hyperparameters', 0.064), ('log', 0.062), ('yi', 0.061), ('problematic', 0.061), ('likelihood', 0.061), ('underestimates', 0.06), ('unimodal', 0.058), ('marginal', 0.058), ('var', 0.056), ('predictive', 0.054), ('prior', 0.054), ('biomedical', 0.054), ('ii', 0.054), ('vii', 0.052), ('rejection', 0.049), ('updated', 0.049), ('utilize', 0.049), ('negative', 0.049), ('robust', 0.048), ('factorized', 0.048), ('helsinki', 0.047), ('chol', 0.046), ('downdate', 0.046), ('fiold', 0.046), ('vehtari', 0.046), ('observations', 0.044), ('precision', 0.043), ('friedman', 0.043), ('williams', 0.043), ('ict', 0.042), ('aki', 0.041), ('approximations', 0.04), ('numerically', 0.04), ('technology', 0.038), ('rasmussen', 0.038), ('factorizing', 0.037), ('approx', 0.037), ('gmrf', 0.037), ('opper', 0.037), ('approximates', 0.037), ('bayesian', 0.037), ('outliers', 0.036), ('se', 0.036), ('inference', 0.033), ('ld', 0.033), ('foundation', 0.033), ('situation', 0.032), ('reason', 0.032), ('mean', 0.031), ('gelman', 0.03), ('site', 0.03), ('west', 0.03), ('singular', 0.03), ('vi', 0.029), ('yn', 0.029), ('discuss', 0.029), ('close', 0.029), ('scale', 0.028), ('graduate', 0.028), ('saddle', 0.028), ('replaced', 0.028), ('hessian', 0.027), ('logarithm', 0.027), ('eq', 0.027), ('presumably', 0.026), ('nested', 0.026), ('practice', 0.026), ('updates', 0.026), ('dashed', 0.026), ('update', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
2 0.20123374 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes
Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1
3 0.14429989 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
4 0.12216941 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
Author: Jaakko Luttinen, Alexander T. Ihler
Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.
5 0.11134259 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
Author: Peter Carbonetto, Matthew King, Firas Hamze
Abstract: We describe a new algorithmic framework for inference in probabilistic models, and apply it to inference for latent Dirichlet allocation (LDA). Our framework adopts the methodology of variational inference, but unlike existing variational methods such as mean field and expectation propagation it is not restricted to tractable classes of approximating distributions. Our approach can also be viewed as a “population-based” sequential Monte Carlo (SMC) method, but unlike existing SMC methods there is no need to design the artificial sequence of distributions. Significantly, our framework offers a principled means to exchange the variance of an importance sampling estimate for the bias incurred through variational approximation. We conduct experiments on a difficult inference problem in population genetics, a problem that is related to inference for LDA. The results of these experiments suggest that our method can offer improvements in stability and accuracy over existing methods, and at a comparable cost. 1
6 0.11110144 43 nips-2009-Bayesian estimation of orientation preference maps
7 0.096666351 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
8 0.093163088 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
9 0.088281676 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
10 0.081394292 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
11 0.079457909 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
12 0.078977801 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
13 0.076152161 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features
14 0.07150387 97 nips-2009-Free energy score space
15 0.070702553 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
16 0.070607767 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations
17 0.067068823 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes
18 0.065882921 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
19 0.064881563 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
20 0.061183065 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation
topicId topicWeight
[(0, -0.203), (1, -0.025), (2, 0.019), (3, -0.028), (4, 0.076), (5, -0.149), (6, 0.168), (7, -0.025), (8, -0.019), (9, -0.079), (10, -0.039), (11, 0.0), (12, -0.017), (13, -0.006), (14, -0.007), (15, 0.008), (16, -0.068), (17, 0.116), (18, -0.015), (19, -0.225), (20, -0.044), (21, -0.022), (22, -0.106), (23, -0.037), (24, 0.013), (25, -0.043), (26, -0.098), (27, 0.064), (28, 0.159), (29, 0.006), (30, -0.032), (31, -0.009), (32, -0.028), (33, 0.095), (34, -0.057), (35, -0.003), (36, -0.09), (37, 0.006), (38, 0.034), (39, -0.001), (40, 0.102), (41, 0.016), (42, 0.02), (43, 0.053), (44, 0.064), (45, 0.132), (46, 0.103), (47, 0.065), (48, -0.092), (49, 0.102)]
simIndex simValue paperId paperTitle
same-paper 1 0.96015346 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
2 0.71764982 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
Author: Jaakko Luttinen, Alexander T. Ihler
Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.
3 0.66417742 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes
Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1
4 0.64258313 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features
Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla
Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overfitting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some specific criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases flexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added flexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will briefly state here the main definitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, defining how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that define the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kff ), where [Kff ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kff + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kff + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efficient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Definition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We define the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being defined by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that define the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore defined both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they define a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is defined as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and defining Z = {zi }m , the joint distribution of these samples is: i=1 Kff Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kff ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be defined for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better defined by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have defined Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) defines the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z define the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most significant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to fit in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the first name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals defining kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well defined. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being defined in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is defined by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is defined by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we define the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error defined here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overfitting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly find a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overfitting-prone data sets, and have found no noticeable overfitting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are defined by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added flexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good fit to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overfitting. To more faithfully approximate the full GP and avoid overfitting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classification, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classification as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9
5 0.62262797 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
6 0.60964471 43 nips-2009-Bayesian estimation of orientation preference maps
7 0.53207678 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
8 0.52666223 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
9 0.52053785 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation
10 0.51637793 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
11 0.49121979 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
12 0.48439184 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
13 0.41603076 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
14 0.40602863 97 nips-2009-Free energy score space
15 0.38219723 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
16 0.3795338 152 nips-2009-Measuring model complexity with the prior predictive
17 0.37522069 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing
18 0.36399838 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning
19 0.36363241 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
20 0.3468622 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations
topicId topicWeight
[(7, 0.013), (24, 0.047), (25, 0.056), (35, 0.091), (36, 0.123), (39, 0.033), (58, 0.116), (61, 0.031), (71, 0.062), (80, 0.231), (81, 0.022), (86, 0.075), (91, 0.015)]
simIndex simValue paperId paperTitle
1 0.91324371 233 nips-2009-Streaming Pointwise Mutual Information
Author: Benjamin V. Durme, Ashwin Lall
Abstract: Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
same-paper 2 0.84252506 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
3 0.69014853 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
Author: Marius Leordeanu, Martial Hebert, Rahul Sukthankar
Abstract: Graph matching and MAP inference are essential problems in computer vision and machine learning. We introduce a novel algorithm that can accommodate both problems and solve them efficiently. Recent graph matching algorithms are based on a general quadratic programming formulation, which takes in consideration both unary and second-order terms reflecting the similarities in local appearance as well as in the pairwise geometric relationships between the matched features. This problem is NP-hard, therefore most algorithms find approximate solutions by relaxing the original problem. They find the optimal continuous solution of the modified problem, ignoring during optimization the original discrete constraints. Then the continuous solution is quickly binarized at the end, but very little attention is put into this final discretization step. In this paper we argue that the stage in which a discrete solution is found is crucial for good performance. We propose an efficient algorithm, with climbing and convergence properties, that optimizes in the discrete domain the quadratic score, and it gives excellent results either by itself or by starting from the solution returned by any graph matching algorithm. In practice it outperforms state-or-the art graph matching algorithms and it also significantly improves their performance if used in combination. When applied to MAP inference, the algorithm is a parallel extension of Iterated Conditional Modes (ICM) with climbing and convergence properties that make it a compelling alternative to the sequential ICM. In our experiments on MAP inference our algorithm proved its effectiveness by significantly outperforming [13], ICM and Max-Product Belief Propagation. 1
4 0.68564284 211 nips-2009-Segmenting Scenes by Matching Image Composites
Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.
5 0.68431801 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
Author: Jaakko Luttinen, Alexander T. Ihler
Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.
6 0.68035442 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
7 0.67937112 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
8 0.67858452 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
9 0.67762583 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
10 0.67733419 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
11 0.67732704 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
12 0.67658561 113 nips-2009-Improving Existing Fault Recovery Policies
13 0.67482001 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
14 0.67373425 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
15 0.67225307 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
16 0.67152107 97 nips-2009-Free energy score space
17 0.67137712 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
18 0.6710971 3 nips-2009-AUC optimization and the two-sample problem
19 0.67087144 104 nips-2009-Group Sparse Coding
20 0.67054468 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies