nips nips2012 nips2012-33 knowledge-graph by maker-knowledge-mining

33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature


Source: pdf

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. [sent-20, score-0.158]

2 We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. [sent-21, score-0.217]

3 Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. [sent-22, score-0.572]

4 This is a particular challenge in modelling big data, where evaluating the likelihood over the entire dataset is extremely computationally demanding. [sent-31, score-0.184]

5 There exist several standard randomised methods for computing model evidence, such as annealed importance sampling (AIS) [1], nested sampling [2] and bridge sampling. [sent-32, score-0.18]

6 We improve upon this existing work in three ways: Log-GP: [10] used a GP prior on the likelihood function; this is a poor model in this case, unable to express the non-negativity and high dynamic range of most likelihood functions. [sent-42, score-0.268]

7 We apply this method to estimate Z, and extend it to compute Z’s posterior variance and expected variance after adding a sample. [sent-44, score-0.328]

8 We use active sampling, selecting locations which minimise the expected uncertainty in Z. [sent-46, score-0.241]

9 Hyperparameter Marginalisation: Uncertainty in the hyperparameters of the model used for quadrature has previously been ignored, leading to overconfidence in the estimate of Z. [sent-47, score-0.435]

10 We introduce a tractable approximate marginalisation of input scale hyperparameters. [sent-48, score-0.133]

11 2 Bayesian Quadrature Bayesian quadrature [8, 10] is a means of performing Bayesian inference about the value of a potentially nonanalytic integral, f := f (x)p(x)dx. [sent-53, score-0.287]

12 For clarity, we henceforth assume the domain of integration X = R, although all results generalise to Rn . [sent-54, score-0.133]

13 Quadrature involves evaluating f (x) at a vector of sample points xs , giving f s := f (xs ). [sent-56, score-0.162]

14 Often this evaluation is computationally expensive; the consequent sparsity of samples introduces uncertainty about the function f between them, and hence uncertainty about the integral f . [sent-57, score-0.264]

15 These scales are typically fitted using type two maximum likelihood (MLII); we will later introduce an approximate means of marginalising them in Section 4. [sent-60, score-0.235]

16 Because integration is affine, we can hence use computed samples f s to perform analytic Gaussian process inference about the value of integrals over f (x), such as f . [sent-67, score-0.23]

17 The mean estimate for f given f s is m( f |f s ) = = = f p( f |f ) p(f |f s ) d f df f δ f − f (x) p(x) dx N f ; mf |s , Cf |s d f df mf |s (x) p(x) dx , (3) which is expressible in closed-form due to standard Gaussian identities [10]. [sent-68, score-0.378]

18 The corresponding closed-form expression for the posterior variance of f lends itself as a natural convergence diagnostic. [sent-69, score-0.201]

19 For example, we can calculate the posterior mean m( f g |f s , g s ) for an integral f (x)g(x)p(x)dx. [sent-71, score-0.241]

20 3 Modelling Likelihood Functions We wish to evaluate the evidence (1), an integral over non-negative likelihoods, ℓ(x). [sent-73, score-0.191]

21 Assigning a standard GP prior to ℓ(x) ignores prior information about the range and non-negativity of ℓ(x), leading to pathologies such as potentially negative evidences (as observed in [10]). [sent-74, score-0.164]

22 A much better prior would be a GP prior on log ℓ(x) (see Figure 2). [sent-75, score-0.313]

23 However, the resulting integral is intractable, m(Z|log ℓs ) = exp log ℓ(x) p(x) dx N log ℓ; mlog ℓ|s , Clog ℓ|s dlog ℓ , (4) as (4) does not possess the affine property exploited in (3). [sent-76, score-0.776]

24 1 Specifically, we linearise the problematic exponential term around some point log ℓ0 (x), as exp log ℓ(x) ≃ exp log ℓ0 (x) + exp log ℓ0 (x) log ℓ(x) − log ℓ0 (x) (5) The integral (4) consists of the product of Z and a GP for log ℓ. [sent-78, score-1.548]

25 The former is ∼ exp log ℓ, the latter is ∼ exp −(log ℓ − m)2 , effectively permitting only a small range of log ℓ functions. [sent-79, score-0.439]

26 Over this narrow region, it is reasonable to assume that Z does not vary too dramatically, and can be approximated as linear in log ℓ, as is assumed by (5). [sent-80, score-0.236]

27 Using this approximation, and making the definition ∆log ℓ|s := mlog ℓ|s − log ℓ0 , we arrive at m(Z|log ℓs ) ≃ m(Z|log ℓ0 , log ℓs ) := ℓ0 (x)p(x) dx + ℓ0 (x)∆log ℓ|s (x)p(x) dx . [sent-81, score-0.708]

28 (6) 1 In practice, we use the transform log (ℓ(x) + 1), allowing us to assume the transformed quantity has zero mean. [sent-82, score-0.207]

29 For both GPs2 (over both log and non-log spaces), we take zero prior means and Gaussian covariances of the form (2). [sent-88, score-0.29]

30 If a quantity is dependent upon the GP prior for ℓ, it will be represented as conditional on ℓs ; if dependent upon the former GP prior over log ℓ, it will be conditional upon log ℓs . [sent-90, score-0.692]

31 We expect ∆log ℓ|s (x) to be small everywhere relative to the magnitude of log ℓ(x) (see Figure 3). [sent-91, score-0.231]

32 Hence log ℓ0 is close to the peaks of the Gaussian over log ℓ, rendering our linearisation appropriate. [sent-92, score-0.528]

33 Unfortunately, the second integral in (6) is non-analytic due to the log ℓ0 term within ∆log ℓ|s . [sent-94, score-0.306]

34 As such, we perform another stage of Bayesian quadrature by treating ∆log ℓ|s as an unknown function of x. [sent-95, score-0.257]

35 For tractability, we assume this prior is independent of the prior for log ℓ. [sent-96, score-0.313]

36 A zero prior mean here is reasonable: ∆log ℓ|s is exactly zero at xs , and tends to zero far away from xs , where both mlog ℓ|s and log ℓ0 are given by the compatible prior means for log ℓ and ℓ. [sent-98, score-1.006]

37 xc should firstly include xs , where we know that ∆log ℓ|s is equal to zero. [sent-100, score-0.21]

38 We select the remainder of xc at random on the hyper-ellipses (whose axes are defined by the input scales for ℓ) surrounding existing observations; we expect ∆log ℓ|s to be extremised at such xc . [sent-101, score-0.181]

39 Given these candidates, we can now marginalise (6) over ∆log ℓ|s to give m(Z|log ℓs ) ≃ m(Z|log ℓ0 , log ℓs , ∆c ) = m(Z|ℓs ) + m ℓ∆log ℓ|s ℓs , ∆c , (7) where both terms are analytic as per Section 2; m(Z|ℓs ) is of the form (3). [sent-103, score-0.234]

40 This variance can be employed as a convergence diagnostic; it describes our uncertainty in the model evidence Z. [sent-106, score-0.246]

41 2 Note that separately modelling ℓ and log ℓ is not inconsistent: we use the posterior mean of the GP for ℓ only as a convenient parameterisation for ℓ0 ; we do not treat this GP as a full probabilistic model. [sent-107, score-0.391]

42 While this modelling choice may seem excessive, this approach provides significant advantages in the sampling efficiency of the overall algorithm by approximately capturing the non-negativity of our integrand and allowing active sampling. [sent-108, score-0.308]

43 b) An example showing the expected uncertainty in the evidence after observing the likelihood function at that location. [sent-111, score-0.262]

44 In summary, we have described a linearisation approach to exploiting a GP prior over log-likelihoods; this permitted the calculation of the analytic posterior mean (7) and variance (10) of Z. [sent-114, score-0.466]

45 4 Marginalising hyperparameters We now present a novel means of approximately marginalising the hyperparameters of the GP used to model the log-integrand, log ℓ. [sent-117, score-0.727]

46 In previous approaches to Bayesian Quadrature, hyperparameters were estimated using MLII, which approximates the likelihood as a delta function. [sent-118, score-0.261]

47 However, ignoring the uncertainty in the hyperparameters can lead to pathologies. [sent-119, score-0.235]

48 In particular, the reliability of the variance for Z depends crucially upon marginalising over all unknown quantities. [sent-120, score-0.243]

49 The hyperparameters of most interest are the input scales w for the GP over the log-likelihood; these hyperparameters can have a powerful influence on the fit to a function. [sent-121, score-0.381]

50 We use MLII to fit all hyperparameters other than w. [sent-122, score-0.178]

51 We make the following essential approximations: Flat prior: We assume that the prior for w is broad, so that our posterior is the normalised likelihood. [sent-124, score-0.157]

52 We represent the posterior mean for log ℓ conditioned on w as m := mlog ℓ|s,w . [sent-126, score-0.503]

53 ˆ ˆ ˆ GP mean affine in w: Given the narrow width of the likelihood for w, p(log ℓ|log ℓs , w) is approximated as having a GP mean which is affine in w around the MLII values, and a constant covariance; ˆ mlog ℓ|s,w ≃ m + ∂ m (w − w) and Clog ℓ|s,w ≃ Clog ℓ|s,w . [sent-127, score-0.342]

54 ˆ ˆ ˆ ∂w The implication of these approximations is that the marginal posterior mean over log ℓ is simply ˆ ˆ ˜ mlog ℓ|s := mlog ℓ|s,w . [sent-128, score-0.693]

55 The marginal posterior variance is Clog ℓ|s := Clog ℓ|s,w + ∂ m Cw ∂ m . [sent-129, score-0.237]

56 Our approximations give the marginal posterior mean for Z: m(Z|log ℓ0 , log ℓs , ∆c ) := m(Z|log ℓ0 , log ℓs , ∆c , w) , ˜ ˆ (11) of the form (7). [sent-131, score-0.592]

57 The marginal posterior variance ∂ m(x) ˆ ∂ m(x′ ) ˆ ˜ V (Z|log ℓ0 , log ℓs , ∆c ) = dx dx′ mℓ|s (x) mℓ|s (x′ ) Clog ℓ|s (x, x′ ) + Cw ∂w ∂w (12) is possible, although laborious, to express analytically, as with (10). [sent-132, score-0.514]

58 5 5 Active Sampling One major benefit of model-based integration is that samples can be chosen by any method, in contrast to Monte Carlo methods, which typically must sample from a specific distribution. [sent-133, score-0.127]

59 In this section, we describe a scheme to select samples xs sequentially, by minimising the expected uncertainty in the evidence that remains after taking each additional sample. [sent-134, score-0.362]

60 3 We take the variance in the evidence as our loss function, and proceed according to Bayesian decision theory. [sent-135, score-0.189]

61 Surprisingly, the posterior variance of a GP model with fixed hyperparameters does not depend on the function values at sampled locations at all; only the location of those samples matters. [sent-136, score-0.43]

62 In traditional Bayesian quadrature, the evidence is an affine transformation of the sampled likelihood values, hence its estimate for the variance in the evidence is also independent of likelihood values. [sent-137, score-0.447]

63 As such, active learning with fixed hyperparameters is pointless, and the optimal sampling design can be found in advance [13]. [sent-138, score-0.327]

64 As the affine transformation (5) itself depends on the function values (via the dependence of log ℓ0 ), the conclusions of the previous paragraph do not apply, and active learning is desirable. [sent-140, score-0.307]

65 The uncertainty over the hyperparameters of the GP further motivates active learning: without assuming a priori knowledge of the hyperparameters, we can’t evaluate the GP to precompute a sampling schedule. [sent-141, score-0.384]

66 The approximate marginalisation of hyperparameters permits an approach to active sampling that acknowledges the influence new samples may have on the posterior over hyperparameters. [sent-142, score-0.59]

67 Active sampling selects a new sample xa so as to minimise the expected variance in the evidence after adding the sample to the model of ℓ. [sent-143, score-0.433]

68 The first term in (13), ˆ ˆ ˆ the second moment, is independent of the selection of xa and can hence be safely ignored for active sampling (true regardless of the model chosen for the likelihood). [sent-145, score-0.26]

69 The second term, the negative expected squared mean, can be resolved analytically4 for any trial xa (we omit the laborious details). [sent-146, score-0.177]

70 That is, the GP posterior over log ℓa can be fully exploited when performing active sampling. [sent-148, score-0.411]

71 In order to minimise the expected variance, the objective in (13) encourages the maximisation of the expected squared mean of Z. [sent-149, score-0.152]

72 Note that the variance for log ℓa is increased by approximate integration over hyperparameters, encouraging exploration. [sent-152, score-0.38]

73 Define Zi as the ground truth evidence for the ith experiment, m(Zi ) as its estimated mean and V (Zi ) as its predicted variance. [sent-155, score-0.166]

74 Firstly, we computed the average log error, 3 We also expect such samples to be useful not just for estimating the evidence, but also for any other related expectations, such as would be required to perform prediction using the model. [sent-156, score-0.258]

75 We assume that ∆log ℓ|s does not depend on log ℓa , only its location xa : we know ∆(xa ) = 0 and assume ∆log ℓ|s elsewhere remains unchanged. [sent-158, score-0.318]

76 Next we computed the negative log-density of the truth, N assuming experiments are independent, − log p(Z) = − i=1 log N (Zi ; m(Zi ), V (Zi )), which quantifies the accuracy of our variance estimates. [sent-160, score-0.511]

77 Note that a single AIS chain provides no ready means of determining the posterior variance for its estimate of Z. [sent-173, score-0.263]

78 Here samples were drawn from the AIS chain above, and a GP was fit to the likelihood samples. [sent-175, score-0.166]

79 For this and other methods, where not otherwise mentioned, GP hyperparameters were selected using MLII . [sent-176, score-0.178]

80 Firstly, Bayesian Quadrature (BQ), which employed the linearisation approach of Section 3 to modeling the log-transformed likelihood values. [sent-178, score-0.173]

81 BQ* is the same algorithm as BQ but with hyperparameters approximately marginalised, as per Section 4. [sent-180, score-0.215]

82 The method is so named for the fact that we use not only Bayesian inference (with a GP over the log-transformed likelihood) to compute the posterior for the evidence, but also Bayesian decision theory to select our samples actively, as described in Section 5. [sent-184, score-0.155]

83 Both algorithms demonstrate the influence of active sampling on our performance. [sent-186, score-0.149]

84 Problems: We used these methods to evaluate evidences given Gaussian priors and a variety of likelihood functions. [sent-187, score-0.141]

85 As in [10] and [11], we focus on low numbers of samples; we permitted tested methods 150 samples on synthetic integrands, and 300 when using real data. [sent-188, score-0.174]

86 We are motivated by real-world, big-data, problems where evaluating likelihood samples is expensive, making it desirable to determine the techniques for evidence estimation that can operate best when permitted only a small number of samples. [sent-189, score-0.313]

87 This model has five hyperparameters to be marginalised, to which we assign priors drawn from the large corpus of data obtained from the Sloan Digital Sky Survey (SDSS) [15]. [sent-198, score-0.178]

88 We tested over four datasets; the expense of evaluating a GP likelihood sample on the large datasets available from the SDSS (140TB of data have been released in total) motivates the small sample sizes considered. [sent-199, score-0.153]

89 On average, BBQ * provides an estimate 5 Because a single AIS chain gives no estimate of uncertainty, it has no likelihood or calibration scores. [sent-202, score-0.153]

90 7 ×10−3 4 Z3 2 1 10 20 30 40 Number of samples − log p(Z) 2 SMC BMC BBQ* True value 0 −2 −4 −6 50 50 100 Number of samples (a) 150 (b) Figure 5: a) The posterior distribution over Z for several methods on a one-dimensional example as the number of samples increases. [sent-203, score-0.464]

91 b) The log density of the true evidence for different methods (colours identical to those in a), compared to the true Z (in black). [sent-206, score-0.299]

92 Table 2: Combined Real Results Table 1: Combined Synthetic Results Method SMC AIS BMC BQ BQ * BBQ BBQ * − log p(Z) > 1000 N/A > 1000 > 1000 > 1000 13. [sent-208, score-0.207]

93 286 SMC AIS BMC BQ BQ * BBQ BBQ * − log p(Z) 5. [sent-223, score-0.207]

94 Figure 5a shows a case where both SMC and BBQ * are relatively close to the true value, however BBQ *’s posterior variance is much smaller. [sent-244, score-0.201]

95 Figure 5b demonstrates the typical behaviour of the active sampling of BBQ *, which quickly concentrates the posterior distribution at the true Z. [sent-245, score-0.253]

96 The negative likelihoods of BQ * are for every problem slightly lower than for BQ (− log p(Z) is on average 0. [sent-246, score-0.24]

97 2 lower), indicating that the approximate marginalisation of hyperparameters grants a small improvement in variance estimate. [sent-247, score-0.383]

98 Here BBQ is clearly the best performer; the additional exploration induced by the hyperparameter marginalisation of BBQ * may have led to local peaks being incompletely exploited. [sent-249, score-0.191]

99 These are: approximately imposing a positivity constraint6 , approximately marginalising hyperparameters, and using active sampling to select the location of function evaluations. [sent-252, score-0.32]

100 Of these contributions, the active learning approach yielded the most significant gains for integral estimation. [sent-253, score-0.199]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.453), ('bbq', 0.352), ('quadrature', 0.257), ('bq', 0.23), ('log', 0.207), ('clog', 0.198), ('hyperparameters', 0.178), ('mlog', 0.154), ('ais', 0.143), ('xs', 0.132), ('mlii', 0.132), ('smc', 0.128), ('xa', 0.111), ('marginalisation', 0.108), ('posterior', 0.104), ('active', 0.1), ('integral', 0.099), ('variance', 0.097), ('marginalising', 0.097), ('evidence', 0.092), ('linearisation', 0.09), ('bayesian', 0.085), ('likelihood', 0.083), ('integrand', 0.08), ('monte', 0.079), ('zi', 0.079), ('xc', 0.078), ('integrals', 0.076), ('integration', 0.076), ('bmc', 0.074), ('marginalised', 0.072), ('dx', 0.07), ('carlo', 0.07), ('cw', 0.067), ('integrands', 0.066), ('af', 0.06), ('mf', 0.059), ('garnett', 0.058), ('ale', 0.058), ('evidences', 0.058), ('permitted', 0.057), ('uncertainty', 0.057), ('minimise', 0.054), ('prior', 0.053), ('samples', 0.051), ('osborne', 0.051), ('sampling', 0.049), ('upon', 0.049), ('annealed', 0.046), ('dla', 0.044), ('dlas', 0.044), ('hydrogen', 0.044), ('quasar', 0.044), ('sdss', 0.044), ('modelling', 0.042), ('roberts', 0.04), ('tested', 0.04), ('dlog', 0.039), ('calibration', 0.038), ('mean', 0.038), ('approximately', 0.037), ('laborious', 0.036), ('overcon', 0.036), ('randomised', 0.036), ('truth', 0.036), ('marginal', 0.036), ('hyperparameter', 0.035), ('gaussian', 0.035), ('rasmussen', 0.034), ('likelihoods', 0.033), ('approx', 0.032), ('generalise', 0.032), ('expressible', 0.032), ('sky', 0.032), ('chain', 0.032), ('tted', 0.031), ('neutral', 0.031), ('evaluating', 0.03), ('means', 0.03), ('expected', 0.03), ('twenty', 0.03), ('big', 0.029), ('narrow', 0.029), ('analytic', 0.027), ('rstly', 0.027), ('shaded', 0.027), ('synthetic', 0.026), ('former', 0.025), ('scales', 0.025), ('df', 0.025), ('zoubin', 0.025), ('henceforth', 0.025), ('numerical', 0.025), ('scale', 0.025), ('exploration', 0.024), ('peaks', 0.024), ('cf', 0.024), ('dent', 0.024), ('everywhere', 0.024), ('covariance', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy. 1

2 0.37204182 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

3 0.22697696 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

4 0.18491201 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a final decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the benefits that they bring compared to single-task learning. PAC-style bounds for classification [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difficult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, finding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower final part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully specified by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the flexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an infinite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is defined as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a specified number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are defined by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general infinite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the finite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to infinity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we define Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simplifications below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The first term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality defines Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the first gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the first and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, finally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the definition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and define a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to find the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to find a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is specified by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisfied whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisfied by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as defined by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to find the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the first T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the benefits of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the first T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the first T0 tasks. One finds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are fixed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We fix π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse first the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as defined e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-specific signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept fixed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at fixed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This final stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of specific choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artificial Intelligence and Statistics 10, pages 333–340. Society for Artificial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal finite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

5 0.17444576 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1

6 0.16853622 55 nips-2012-Bayesian Warped Gaussian Processes

7 0.16667128 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.1590573 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

9 0.11819196 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

10 0.1003546 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

11 0.095413715 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

12 0.092908531 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

13 0.089442387 147 nips-2012-Graphical Models via Generalized Linear Models

14 0.084533669 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

15 0.084010638 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

16 0.083236821 32 nips-2012-Active Comparison of Prediction Models

17 0.082506314 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs

18 0.082109608 118 nips-2012-Entangled Monte Carlo

19 0.081553861 41 nips-2012-Ancestor Sampling for Particle Gibbs

20 0.080843091 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.193), (1, 0.035), (2, 0.048), (3, 0.058), (4, -0.204), (5, -0.036), (6, 0.013), (7, 0.06), (8, -0.032), (9, -0.335), (10, -0.281), (11, -0.034), (12, -0.091), (13, 0.067), (14, -0.135), (15, 0.118), (16, -0.095), (17, 0.006), (18, -0.141), (19, 0.014), (20, -0.073), (21, -0.045), (22, -0.029), (23, 0.09), (24, 0.065), (25, 0.095), (26, -0.067), (27, 0.043), (28, -0.031), (29, -0.013), (30, 0.087), (31, 0.014), (32, -0.021), (33, 0.071), (34, 0.039), (35, 0.008), (36, 0.068), (37, 0.044), (38, 0.011), (39, -0.019), (40, -0.027), (41, -0.019), (42, 0.018), (43, 0.04), (44, -0.01), (45, -0.054), (46, 0.042), (47, 0.025), (48, 0.03), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92566049 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy. 1

2 0.91025299 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

3 0.84118158 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classification, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.

4 0.77831095 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

5 0.62030059 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

Author: Emtiyaz Khan, Shakir Mohamed, Kevin P. Murphy

Abstract: We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classification, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efficient fixed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classification that our new algorithm converges much faster than existing variational methods, and without any degradation in performance. 1

6 0.61753458 187 nips-2012-Learning curves for multi-task Gaussian process regression

7 0.61127216 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

8 0.51071715 11 nips-2012-A Marginalized Particle Gaussian Process Regression

9 0.47350198 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

10 0.44845733 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

11 0.41303232 37 nips-2012-Affine Independent Variational Inference

12 0.3995612 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

13 0.39159843 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

14 0.36732343 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

15 0.36537147 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

16 0.35718331 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

17 0.35031328 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

18 0.3441858 359 nips-2012-Variational Inference for Crowdsourcing

19 0.3352493 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation

20 0.33483815 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.02), (21, 0.019), (38, 0.065), (42, 0.017), (55, 0.012), (74, 0.024), (76, 0.628), (80, 0.088), (92, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99288911 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

Abstract: We propose an efficient, generalized, nonparametric, statistical KolmogorovSmirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods. 1

same-paper 2 0.98940033 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy. 1

3 0.98138034 286 nips-2012-Random Utility Theory for Social Choice

Author: Hossein Azari, David Parks, Lirong Xia

Abstract: Random utility theory models an agent’s preferences on alternatives by drawing a real-valued score on each alternative (typically independently) from a parameterized distribution, and then ranking the alternatives according to scores. A special case that has received significant attention is the Plackett-Luce model, for which fast inference methods for maximum likelihood estimators are available. This paper develops conditions on general random utility models that enable fast inference within a Bayesian framework through MC-EM, providing concave loglikelihood functions and bounded sets of global maxima solutions. Results on both real-world and simulated data provide support for the scalability of the approach and capability for model selection among general random utility models including Plackett-Luce. 1

4 0.9801563 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

Author: Francisco Pereira, Matthew Botvinick

Abstract: This paper introduces a novel classification method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1

5 0.97903824 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller

Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1

6 0.97682929 205 nips-2012-MCMC for continuous-time discrete-state systems

7 0.97478747 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models

8 0.94144219 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

9 0.93781585 247 nips-2012-Nonparametric Reduced Rank Regression

10 0.93486577 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

11 0.88185984 338 nips-2012-The Perturbed Variation

12 0.87308776 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

13 0.86880159 41 nips-2012-Ancestor Sampling for Particle Gibbs

14 0.86713731 142 nips-2012-Generalization Bounds for Domain Adaptation

15 0.86298662 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

16 0.85668498 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

17 0.85568929 327 nips-2012-Structured Learning of Gaussian Graphical Models

18 0.8539865 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

19 0.85364813 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

20 0.85341698 291 nips-2012-Reducing statistical time-series problems to binary classification