nips nips2004 nips2004-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. [sent-6, score-0.205]
2 In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. [sent-7, score-0.335]
3 This step is nonparametric, in that it does not require a parametric form of covariance function. [sent-8, score-0.582]
4 In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. [sent-9, score-0.937]
5 We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. [sent-10, score-0.334]
6 For example, in the user preference modelling problem we will consider later, learning a preference model would amount to fitting a model based on only 20 samples of a user’s preference data. [sent-12, score-0.572]
7 Fortunately, there are situations where individual data sets are small, but data from similar scenarios can be obtained. [sent-13, score-0.171]
8 Returning to the example of preference modelling, data for many different users are typically available. [sent-14, score-0.198]
9 Typically, such problems have been handled by either mixed effects models or hierarchical Bayesian modelling. [sent-16, score-0.188]
10 In this paper we present a novel approach to hierarchical Bayesian modelling in the context of Gaussian process regression, with an application to recommender systems. [sent-17, score-0.511]
11 Here, hierarchical Bayesian modelling essentially means to learn the mean and covariance function of the Gaussian process. [sent-18, score-0.922]
12 In a first step, a common collaborative kernel matrix is learned from the data via a simple and efficient EM algorithm. [sent-19, score-0.685]
13 This circumvents the problem of kernel design, as no parametric form of kernel function is required here. [sent-20, score-0.385]
14 Thus, this form of learning a covariance matrix is also suited for problems with complex covariance structure (e. [sent-21, score-1.128]
15 A portion of the learned covariance matrix can be explained by the input features and, thus, generalized to new objects via a content-based kernel smoother. [sent-24, score-1.078]
16 Thus, in a second step, we generalize the covariance matrix (learned by the EM-algorithm) to new items using a generalized Nystr¨ m method. [sent-25, score-0.796]
17 The result is a complex content-based kernel which itself o is a weighted superposition of simple smoothing kernels. [sent-26, score-0.276]
18 This second part could also be applied to other situations where one needs to extrapolate a covariance matrix on a finite set (e. [sent-27, score-0.684]
19 2 casts Gaussian process regression in a hierarchical Bayesian framework, and shows the EM updates to learn the covariance matrix in the first step. [sent-32, score-0.865]
20 4, before applying the proposed methods as a recommender system for images in Sec. [sent-36, score-0.188]
21 1 Previous Work In statistics, modelling data from related scenarios is typically done via mixed effects models or hierarchical Bayesian (HB) modelling [6]. [sent-40, score-0.679]
22 In HB, parameters of models for individual scenarios (e. [sent-41, score-0.171]
23 users in recommender systems) are assumed to be drawn from a common (hyper)prior distribution, allowing the individual models to interact and regularize each other. [sent-43, score-0.306]
24 Based on sparse approximations of GPs, a more general GP multi-task learner with parametric covariance functions has been presented in [7]. [sent-47, score-0.551]
25 In contrast, the approach presented in this paper only considers covariance matrices (and is thus non-parametric) in the first step. [sent-48, score-0.54]
26 Only in a second extrapolation step, kernel smoothing leads to predictions based on a covariance function that is a data-driven combination of simple kernel functions. [sent-49, score-1.045]
27 In order to analyze this data in a hierarchical Bayesian way, we assume that the data for each scenario is a noisy sample of a Gaussian process (GP) with unknown mean and covariance function. [sent-59, score-0.924]
28 We assume that mean and covariance function are shared across different scenarios. [sent-60, score-0.613]
29 1 In the first modelling step presented in this section, we consider transductive learning (“labelling a partially labelled data set”), that is, we are interested in the model’s behavior only M on points X, with X = i=1 X i and cardinality N = |X|. [sent-61, score-0.317]
30 This situation is relevant for most collaborative filtering applications. [sent-62, score-0.266]
31 This reduces the whole “infinite dimensional” Gaussian process to its finite dimensional projection on points X, which is an N -variate Gaussian distribution with covariance matrix K and mean vector m. [sent-64, score-0.798]
32 Coming back to the user modelling problem mentioned above, this means that at least some items have been rated by more than one user. [sent-66, score-0.476]
33 Thus, our first modelling step focusses on directly learning the covariance matrix K and 1 Alternative HB approaches for collaborative filtering, like that discussed in [5], assume that model weights are drawn from a shared Gaussian distribution. [sent-67, score-1.143]
34 Following the hierarchical Bayesian assumption, the data observed in each scenario is thus a partial sample from N (y | m, K + σ 2 1), where 1 denotes the unit matrix. [sent-72, score-0.335]
35 We assume a Gaussian likelihood p(y i | f i ) with diagonal covariance matrix σ 2 1. [sent-74, score-0.622]
36 1 EM Learning For the above hierarchical Bayesian model, Eq. [sent-76, score-0.161]
37 2 The conjugate prior for mean m and covariance K of a multivariate Gaussian is the so-called Normal-Wishart distribution [6], which decomposes into the product of an inverse Wishart distribution for K and a Normal distribution for m, p(m, K) = N (m | ν, η −1 K)Wi−1 (K|α, U ). [sent-79, score-0.707]
38 (3) That is, the prior for the Gram matrix K is given by an inverse Wishart distribution with scalar parameter α > 1/2(N − 1) and U being a symmetric positive-definite matrix. [sent-80, score-0.188]
39 Given the covariance matrix K, m is Gaussian distributed with mean ν and covariance η −1 K, where η is a positive scalar. [sent-81, score-1.195]
40 The parameters can be interpreted in terms of an equivalent data set for the mean (this data set has size A, with A = ν, and mean µ = ν) and a data set for the covariance that has size B, with α = (B + N )/2, and covariance S, U = (B/2)S. [sent-82, score-1.146]
41 KI(i),I(i) denotes the square submatrix of K that corresponds to points I(i), that is, the covariance matrix for points in the i. [sent-89, score-0.78]
42 By K·,I(i) we denote the covariance matrix of all N points versus those in the i. [sent-91, score-0.687]
43 1 E-step ˜i In the E-step, one first computes f , the expected value of functional values on all N points for each scenario i. [sent-95, score-0.251]
44 The expected value is given by the standard equations for the predictive mean of Gaussian process models, where the covariance functions are replaced by corresponding sub-matrices of the current estimate for K: ˜i f = K·,I(i) (KI(i),I(i) + σ 2 1)−1 (y i − mI(i) ) + m, i = 1, . [sent-96, score-0.672]
45 (4) Also, covariances between all pairs of points are estimated, based on the predictive covariance for the GP models: ( denotes matrix transpose) ˜ C i = K − K·,I(i) (KI(i),I(i) + σ 2 1)−1 K , i = 1, . [sent-100, score-0.77]
46 2 M-step In the M-step, the vector of mean values m, the covariance matrix K and the noise variance σ 2 are being updated. [sent-107, score-0.689]
47 i=1 An intuitive explanation of the M-step is as follows: The new mean m is a weighted combination of the prior mean, weighted by the equivalent sample size, and the predictive mean. [sent-109, score-0.165]
48 The first term is typically irrelevant, it is a result of the coupling of the Gaussian and the inverse Wishart prior distributions via K. [sent-111, score-0.169]
49 The second term contains the prior covariance matrix, again weighted by the equivalent sample size. [sent-112, score-0.579]
50 Finally, the fourth term gives a correction term to compensate for the fact that the functional values f i are only estimates, thus the empirical covariance will be too small. [sent-114, score-0.606]
51 1, one can easily and efficiently learn a covariance matrix K and mean vector m from data obtained in different related scenarios. [sent-117, score-0.727]
52 This would, for example, be of interest in a collaborative filtering application with a fixed set of items. [sent-121, score-0.266]
53 In this section we describe how the covariance can be generalized to new inputs z ∈ X. [sent-122, score-0.585]
54 In j order to generalize the learned covariance matrix, we employ a kernel smoother with an auxiliary kernel function r(·, ·) that takes a pair of content features as input. [sent-124, score-1.086]
55 As a constraint, we need to guarantee that the derived kernel is positive definite, such that straightforward interpolation schemes cannot readily be applied. [sent-125, score-0.17]
56 In contrast to Nystr¨ m, o the extrapolating smoothing kernel is not known in our setting and we employ a generic smoothing kernel r(·, ·) instead [12]. [sent-128, score-0.553]
57 Let K = U ΛU T be the eigendecomposition of covariance matrix K, with a diagonal matrix of eigenvalues Λ and orthonormal eigenvectors U . [sent-129, score-0.773]
58 We now approximate the i-th scaled eigenvector v i by a Gaussian process with covariance function r(·, ·) and obtain as an approximation of the scaled eigenfunction N φi (w) = r(w, xj )bi,j j=1 (6) with weights bi = (bi,1 , . [sent-131, score-0.698]
59 R denotes the Gram matrix for the smoothing kernel on all N points. [sent-135, score-0.38]
60 Based on the approximate scaled eigenfunctions, the resulting kernel function is simply φi (w)φi (z) = r(w) (R + λI)−1 K(R + λI)−1 r(z). [sent-137, score-0.227]
61 L) are the Gram matrices at the training data points X for kernel function r (resp. [sent-143, score-0.269]
62 λ is a tuning parameter that determines which proportion of K is explained by the content kernel. [sent-145, score-0.171]
63 With λ → ∞ then l(w, z) → 0 and no portion of K is explained by the content kernel. [sent-147, score-0.172]
64 4 Similarly, one can build a kernel smoother to extrapolate from the mean vector m to an approximate mean function m(·). [sent-149, score-0.404]
65 The prediction for a new object v in scenario i thus ˆ becomes i f i (v) = m(v) + ˆ l(v, xj ) βj (8) j∈I(i) i with weights β given by β = (KI(i),I(i) + σ 2 I)−1 (y i − mI(i) ). [sent-150, score-0.18]
66 It is important to note l has a much richer structure than the auxiliary kernel r. [sent-151, score-0.208]
67 By expanding the expression for l, one can see that l amounts to a data-dependent covariance function that can be written as a superposition of kernels r, N r(xi , v)aw , j l(v, w) = (9) i=1 with input dependent weights aw = (R + λI)−1 K(R + λI)−1 r w . [sent-152, score-0.624]
68 4 Experiments We first illustrate the process of covariance matrix learning on a small toy example: Data is generated by sampling from a Gaussian process with the nonstationary “neural network covariance function” [11]. [sent-153, score-1.382]
69 We consider M = 20 scenarios, where each scenario has observations on a random subset X i of X, with N i ≈ 0. [sent-156, score-0.177]
70 1(c) was used both as the initial value for K and for the prior covariance S in Eq. [sent-165, score-0.549]
71 The goal is to predict individual users’ preferences for a large collection of art images5 , where 3 Note that, also if the true interpolating kernel was known, i. [sent-171, score-0.378]
72 , r = k, and with λ = 0, we obtain l(w, z) = k(w, z)K −1 k(w, z) which is the approximate kernel obtained with Nystr¨ m. [sent-173, score-0.17]
73 o 4 A related form of kernel matrix extrapolation has been recently proposed by [10]. [sent-174, score-0.38]
74 jsp (a) Training data (b) True covariance matrix (c) Initial covariance matrix (d) Covariance matrix learned via EM Figure 1: Example to illustrate covariance matrix learning via EM. [sent-180, score-2.21]
75 The data shown in (a) was drawn from a Gaussian process with a nonstationary “neural network” covariance function. [sent-181, score-0.654]
76 When initialized with the stationary matrix shown in (c), EM learning resulted in the covariance matrix shown in (d). [sent-182, score-0.738]
77 Comparing the learned matrix (d) with the true matrix (b) shows that the nonstationary structure is captured well each user rated a random subset out of a total of 642 paintings, with ratings “like” (+1), “dislike”(−1), or “not sure” (0). [sent-183, score-0.716]
78 In total, ratings from M = 190 users were collected, where each user had rated 89 paintings on average. [sent-184, score-0.542]
79 2(a) shows ROC curves for collaborative filtering when preferences of unrated items within the set of 642 images are predicted. [sent-187, score-0.524]
80 (4), “GP with EM covariance”) is compared with a collaborative approach using Pearson correlation [3] (“Collaborative Filtering”) and an alternative nonparametric hierarchical Bayesian approach [13] (“Hybrid Filter”). [sent-189, score-0.459]
81 All algorithms are evaluated in a 10-fold cross validation scheme (repeated 10 times), where we assume that ratings for 20 items are known for each test user. [sent-190, score-0.178]
82 The figure shows that our approach is considerably better than collaborative filtering with Pearson correlation and even gains a (yet small) advantage over the hybrid filtering technique. [sent-193, score-0.295]
83 Note that the EM algorithm converged6 very quickly, requiring about 4–6 EM steps to learn the covariance matrix K. [sent-194, score-0.66]
84 (a) Transductive methods (b) Inductive methods Figure 2: ROC curves of different methods for predicting user preferences for art images the training set are to be made (sometimes referred to as the “new item problem”). [sent-198, score-0.33]
85 (8), “GP with o Generalized Nystr¨ m”)7 , and when predicting user preferences from image features via an o SVM with squared exponential kernel (“SVM content-based filtering”). [sent-200, score-0.443]
86 It is apparent that the new approach with the learned kernel is superior to the standard SVM approach. [sent-201, score-0.236]
87 The low-level content features are only very poor indicators for the high level concept “liking an art image”, and inductive approaches in general need to rely on content-dependent collaborative filtering. [sent-203, score-0.529]
88 The purely content-independent collaborative effect, which is exploited in the transductive setting, cannot be generalized to new items. [sent-204, score-0.449]
89 The purely content-independent collaborative effect can be viewed as correlated noise in our model. [sent-205, score-0.299]
90 5 Summary and Conclusions This article introduced a novel method of learning Gaussian process covariance functions from multi-task learning problems, using a hierarchical Bayesian framework. [sent-206, score-0.711]
91 In the hierarchical framework, the GP models for individual scenarios borrow strength from each other via a common prior for mean and covariance. [sent-207, score-0.553]
92 The learning task was solved in two steps: First, an EM algorithm was used to learn the shared mean vector and covariance matrix on a fixed set of points. [sent-208, score-0.767]
93 In a second step, the learned covariance matrix was generalized to new points via a generalized form of Nystr¨ m method. [sent-209, score-0.978]
94 Our initial experiments, where o we use the method as a recommender system for art images, showed very promising results. [sent-210, score-0.248]
95 Also, in our approach, a clear distinction is made between content-dependent and content-independent collaborative filtering. [sent-211, score-0.266]
96 in recommender systems for textual items such as news articles), and allow a even better prediction of user preferences. [sent-214, score-0.388]
97 7 To obtain the kernel r, we fitted GP user preference models for a few randomly chosen users, with individual ARD weights for each input dimension in a squared exponential kernel. [sent-216, score-0.483]
98 Using the nystr¨ m method to speed up kernel machines. [sent-292, score-0.17]
99 Collaborative ensemble learning: Combining collaborative and content-based information filtering via hierarchical Bayes. [sent-308, score-0.494]
100 (2), we treat the functional values f i in each scenario i as the unknown variables. [sent-321, score-0.186]
wordName wordTfidf (topN-words)
[('covariance', 0.506), ('collaborative', 0.266), ('nystr', 0.198), ('em', 0.177), ('kernel', 0.17), ('gp', 0.166), ('hierarchical', 0.161), ('recommender', 0.156), ('modelling', 0.15), ('scenario', 0.146), ('user', 0.137), ('paintings', 0.125), ('scenarios', 0.124), ('matrix', 0.116), ('ltering', 0.105), ('nonstationary', 0.104), ('users', 0.103), ('bayesian', 0.1), ('content', 0.098), ('preference', 0.095), ('items', 0.095), ('extrapolation', 0.094), ('rated', 0.094), ('art', 0.092), ('hb', 0.087), ('gaussian', 0.085), ('ratings', 0.083), ('wishart', 0.081), ('extrapolating', 0.081), ('ard', 0.081), ('generalized', 0.079), ('inductive', 0.073), ('transductive', 0.071), ('preferences', 0.069), ('via', 0.067), ('mean', 0.067), ('learned', 0.066), ('smoothing', 0.066), ('tresp', 0.065), ('ki', 0.065), ('points', 0.065), ('anton', 0.062), ('extrapolate', 0.062), ('unrated', 0.062), ('roc', 0.06), ('scaled', 0.057), ('lp', 0.057), ('predictive', 0.055), ('schwaighofer', 0.054), ('nonstationarity', 0.054), ('unpublished', 0.05), ('multitask', 0.05), ('pearson', 0.05), ('gram', 0.049), ('individual', 0.047), ('parametric', 0.045), ('tted', 0.045), ('explained', 0.044), ('process', 0.044), ('borrow', 0.044), ('aw', 0.044), ('recommendation', 0.044), ('prior', 0.043), ('shared', 0.04), ('functional', 0.04), ('superposition', 0.04), ('predictions', 0.039), ('learn', 0.038), ('smoother', 0.038), ('auxiliary', 0.038), ('eigenfunctions', 0.038), ('yu', 0.037), ('engine', 0.037), ('eigenvectors', 0.035), ('toy', 0.034), ('matrices', 0.034), ('nite', 0.034), ('weights', 0.034), ('purely', 0.033), ('images', 0.032), ('nonparametric', 0.032), ('step', 0.031), ('observations', 0.031), ('multivariate', 0.031), ('conjugate', 0.031), ('term', 0.03), ('portion', 0.03), ('penalized', 0.029), ('williams', 0.029), ('hybrid', 0.029), ('inverse', 0.029), ('proportion', 0.029), ('illustrate', 0.028), ('denotes', 0.028), ('mixed', 0.027), ('kaufmann', 0.027), ('conjoint', 0.027), ('harchaoui', 0.027), ('vishwanathan', 0.027), ('gps', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
2 0.20160221 50 nips-2004-Dependent Gaussian Processes
Author: Phillip Boyle, Marcus Frean
Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1
3 0.14584652 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
4 0.13550103 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
Author: Olivier Chapelle, Za\
Abstract: Choice-based conjoint analysis builds models of consumer preferences over products with answers gathered in questionnaires. Our main goal is to bring tools from the machine learning community to solve this problem more efficiently. Thus, we propose two algorithms to quickly and accurately estimate consumer preferences. 1
5 0.13302431 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
6 0.12294733 113 nips-2004-Maximum-Margin Matrix Factorization
7 0.11886951 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
8 0.11884212 148 nips-2004-Probabilistic Computation in Spiking Populations
9 0.11842457 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
10 0.098818287 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
11 0.095349103 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
12 0.093971558 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
13 0.092136726 136 nips-2004-On Semi-Supervised Classification
14 0.091194764 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
15 0.083852448 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition
16 0.081510417 166 nips-2004-Semi-supervised Learning via Gaussian Processes
17 0.079279214 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
18 0.075473502 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
19 0.074431457 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
20 0.073871031 42 nips-2004-Computing regularization paths for learning multiple kernels
topicId topicWeight
[(0, -0.258), (1, 0.05), (2, -0.079), (3, 0.028), (4, -0.114), (5, -0.074), (6, -0.126), (7, -0.064), (8, 0.135), (9, -0.039), (10, 0.068), (11, 0.066), (12, 0.123), (13, 0.026), (14, 0.112), (15, 0.015), (16, 0.047), (17, 0.014), (18, 0.223), (19, -0.138), (20, -0.021), (21, 0.065), (22, -0.068), (23, -0.021), (24, 0.086), (25, -0.078), (26, 0.205), (27, 0.136), (28, 0.1), (29, 0.079), (30, -0.003), (31, -0.048), (32, 0.122), (33, 0.003), (34, 0.04), (35, -0.008), (36, -0.233), (37, -0.087), (38, -0.037), (39, 0.009), (40, -0.045), (41, 0.107), (42, -0.055), (43, 0.015), (44, -0.006), (45, 0.052), (46, 0.081), (47, 0.002), (48, 0.079), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.97298336 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
2 0.78182518 50 nips-2004-Dependent Gaussian Processes
Author: Phillip Boyle, Marcus Frean
Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1
3 0.71198374 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
4 0.65994406 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
Author: Olivier Chapelle, Za\
Abstract: Choice-based conjoint analysis builds models of consumer preferences over products with answers gathered in questionnaires. Our main goal is to bring tools from the machine learning community to solve this problem more efficiently. Thus, we propose two algorithms to quickly and accurately estimate consumer preferences. 1
5 0.54007238 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
6 0.53027147 168 nips-2004-Semigroup Kernels on Finite Sets
7 0.4748885 158 nips-2004-Sampling Methods for Unsupervised Learning
8 0.4649494 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
9 0.45913431 148 nips-2004-Probabilistic Computation in Spiking Populations
10 0.43718991 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
11 0.43416888 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
12 0.42601213 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM
13 0.42597023 94 nips-2004-Kernels for Multi--task Learning
14 0.41881594 166 nips-2004-Semi-supervised Learning via Gaussian Processes
15 0.40261394 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
16 0.39466491 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
17 0.38745794 68 nips-2004-Face Detection --- Efficient and Rank Deficient
18 0.38606387 136 nips-2004-On Semi-Supervised Classification
19 0.38250926 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
20 0.37785682 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
topicId topicWeight
[(13, 0.073), (15, 0.219), (25, 0.013), (26, 0.037), (31, 0.364), (33, 0.145), (35, 0.025), (39, 0.022), (50, 0.03)]
simIndex simValue paperId paperTitle
1 0.87851071 164 nips-2004-Semi-supervised Learning by Entropy Minimization
Author: Yves Grandvalet, Yoshua Bengio
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1
same-paper 2 0.87418509 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
3 0.86823857 137 nips-2004-On the Adaptive Properties of Decision Trees
Author: Clayton Scott, Robert Nowak
Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve these optimal rates while being practical and implementable. 1
4 0.86070561 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
5 0.68797755 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
6 0.68561977 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
7 0.66177613 69 nips-2004-Fast Rates to Bayes for Kernel Machines
8 0.66003573 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
9 0.65869129 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
10 0.65718579 183 nips-2004-Temporal-Difference Networks
11 0.64951831 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
12 0.64900547 33 nips-2004-Brain Inspired Reinforcement Learning
13 0.6488505 158 nips-2004-Sampling Methods for Unsupervised Learning
14 0.64744693 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
15 0.64742386 130 nips-2004-Newscast EM
16 0.64706922 177 nips-2004-Supervised Graph Inference
17 0.64641178 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
18 0.64592636 131 nips-2004-Non-Local Manifold Tangent Learning
19 0.64474624 168 nips-2004-Semigroup Kernels on Finite Sets
20 0.64334619 178 nips-2004-Support Vector Classification with Input Data Uncertainty