nips nips2012 nips2012-55 knowledge-graph by maker-knowledge-mining

55 nips-2012-Bayesian Warped Gaussian Processes

Source: pdf

Author: Miguel Lázaro-gredilla

Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classiﬁcation, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 es Abstract Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). [sent-4, score-0.171]

2 The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. [sent-5, score-0.036]

3 In order to learn its parameters, maximum likelihood was used. [sent-6, score-0.052]

4 In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. [sent-7, score-0.093]

5 The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classiﬁcation, etc. [sent-8, score-0.114]

6 We demonstrate the superior performance of Bayesian warped GPs on several real data sets. [sent-9, score-0.16]

7 1 Introduction In a Bayesian setting, the Gaussian process (GP) is commonly used to deﬁne a prior probability distribution over functions. [sent-10, score-0.033]

8 In the regression setting, output data are often modelled directly as observations from a GP. [sent-13, score-0.099]

9 However, it is shown in [1] that for some data sets, better models can be built if the observed outputs are regarded as a nonlinear distortion (the so-called warping) of a GP instead. [sent-14, score-0.039]

10 For a warped GP (WGP), the warping function can take any parametric form, and in [1] the sum of a linear function and several tanh functions is used. [sent-15, score-0.726]

11 The parameters deﬁning the transformation are then learned using maximum likelihood. [sent-16, score-0.036]

12 In this work we set out to show that it is possible to place another GP prior on the warping function and variationally integrate it out. [sent-18, score-0.552]

13 In Section 3, a variational lower bound on the exact evidence of the model is derived, which allows for approximate inference and hyperparameter learning. [sent-21, score-0.08]

14 We show the advantages of integrating out the warping function in Section 4, where we compare the performance of the maximum likelihood and the Bayesian versions of warped GPs. [sent-22, score-0.686]

15 (1b) Notice that by setting the prior mean on g(f ) to f , we assume that the warping is “by default” the identity. [sent-27, score-0.493]

16 For f , any valid covariance function k(x, x′ ) can be used, whereas for the warping 2 function g we use a squared exponential: c(f, f ′ ) = σg exp(−(f − f ′ )2 /(2ℓ2 )). [sent-28, score-0.497]

17 The mentioned ′ hyperparameters as well as those included in k(x, x ) can be collected in θ ≡ {θk , σg , ℓ, σ, µ0 }. [sent-29, score-0.058]

18 It might seem that since f (x) is already an arbitrary nonlinear function, further distorting its output through g(f ) is an additional unnecessary complication. [sent-30, score-0.07]

19 However, even though g(f (x)) can model arbitrary functions just as f (x) is able to, the implied prior is very different since the composition of two GPs g(f (x)) is no longer a GP. [sent-31, score-0.034]

20 This is the same idea as with copulas, but here the warping function g(f ) is treated in non-parametric form. [sent-32, score-0.474]

21 BWGP has an additional noise term ε that can account for extra noise in the observations after warping. [sent-35, score-0.075]

22 BWGP places a prior on the warping function, instead of using a parametric deﬁnition, which allows for maximum ﬂexibility while avoiding overﬁtting. [sent-37, score-0.534]

23 On the other hand, by choosing the number of tanh functions in their parametric warping function, WGP sets a trade-off between both. [sent-38, score-0.581]

24 Finally, the deﬁnition of the warping function is reversed between BWGP and WGP. [sent-39, score-0.474]

25 If no noise is present, our warping function y = g(f ) maps latent space f to output space y. [sent-40, score-0.562]

26 Because of this, the warping function in [1] is restricted to be monotonic, so that it is possible to unambiguously identify its inverse y = w−1 (f ) = g(f ) and thus deﬁne a valid probability distribution in output space. [sent-42, score-0.497]

27 Since we already work with the direct warping function g(f ), we do not need to impose any constraint on it and thus can use a GP prior. [sent-43, score-0.474]

28 , g ′ (f ) = 0) in the warping function (such as ordinal regression or classiﬁcation), since the inverse w(y) = g −1 (y) is not well deﬁned. [sent-46, score-0.534]

29 None of this problems arise on BWGP, which can handle both continuous and discrete observations and model warping functions with ﬂat regions. [sent-49, score-0.51]

30 2 Relationship with other Gaussian processes models For a given warping function g(f ), BWGP can be seen as a standard GP model with likelihood p(yi |f (xi )) = N (yi |g(f (xi )), σ 2 ). [sent-51, score-0.535]

31 Using a noisy latent function f (x) as prior and a step function as likelihood is equivalent to using a noiseless latent function as prior and normal cdf sigmoid function as likelihood [4], so this model corresponds exactly with GP probit classiﬁcation. [sent-54, score-0.2]

32 H(f ) is the Heaviside step function and bk are parameters deﬁning the widths and locations of the K bins in latent space. [sent-56, score-0.079]

33 • Maximum likelihood WGP [1]: Corresponds to setting g(f ) = w−1 (f ) and σ 2 = 0. [sent-57, score-0.033]

34 Thus, to some extent, BWGP can be regarded as likelihood learning tool. [sent-60, score-0.055]

35 Instead of resorting to expensive Monte Carlo methods, we will develop an efﬁcient variational approximation of comparable computational cost to that of WGP. [sent-62, score-0.065]

36 We omit conditioning on inputs {xi }n and hyperparameters θ. [sent-69, score-0.086]

37 fn ]⊤ is the latent function evaluated at the training inputs {x1 . [sent-73, score-0.089]

38 We use K to refer to the n × n covariance matrix of the latent function, with entries [K]ij = k(xi , xj ), whereas similarly [Cf f ]ij = c(fi , fj ) is the n × n warping covariance matrix. [sent-80, score-0.558]

39 Now we proceed as in sparse GPs [10] and augment this model with a set of m inducing variables u = [u1 . [sent-82, score-0.053]

40 We can expand p(g|f ) by ﬁrst conditioning on u to obtain p(g|u, f ), and then including the prior p(u). [sent-89, score-0.04]

41 The inclusion of the inducing variables does not change the model, independently of their number m or their locations v1 . [sent-92, score-0.073]

42 Expressing the warping function as g(v) = u(v) + v, the inducing variables correspond to evaluating GP u(v) at locations v1 . [sent-97, score-0.547]

43 vm , which live in latent space (just as f does). [sent-100, score-0.068]

44 Observe that u provides a probabilistic description of the warping function. [sent-101, score-0.474]

45 In particular, as m grows and the sampling in latent space becomes more and more dense1 , the covariance Cf f −Cf v C−1 C⊤v gets closer to zero2 and p(g|u, f ) becomes a Dirac delta, f vv thus making the warping function deterministic given u, g(f ) = f + [c(f, v1 ) . [sent-102, score-0.717]

46 2 Variational lower bound The exact posterior of BWGP model (3) is analytically intractable. [sent-107, score-0.078]

47 We can proceed by selecting, within a given family of distributions, the approximate posterior q(g, u, f ) that minimises the Kullback-Leibler (KL) divergence to the true posterior p(g, u, f |y). [sent-108, score-0.102]

48 o vv f 3 where F is a variational lower bound on the evidence log p(y). [sent-111, score-0.281]

49 Since log p(y) is constant for any choice of q, it is obvious that maximising F wrt q yields the best approximation in the mentioned KL sense within the considered family of distributions. [sent-112, score-0.077]

50 We should choose a family that can model the posterior as well as possible while keeping the computation of F tractable. [sent-113, score-0.051]

51 The remaining terms which depend on u can be arranged as follows: q(u) log 1 p(u) exp(− 2σ2 u⊤ C−1 Ψ2 C−1 u + vv vv q(u) 1 ⊤ σ2 (y Ψ1 − ψ ⊤ )C−1 u) 3 vv du. [sent-120, score-0.565]

52 Replacing one of the variational distributions within the bound by its 1 optimal value is sometimes referred to as using a “marginalised variational bound” [11]. [sent-125, score-0.13]

53 By (µ,Σ) imposing ∂F∂Σ = 0, we know that the posterior covariance can be expressed as Σ = (K−1 + −1 Λ) , for some diagonal matrix Λ. [sent-132, score-0.074]

54 With this deﬁnition, the bound F (µ, Λ) now depends only on 2n free variational parameters and can be computed in O(n3 ) time and O(n2 ) space, just as WGP. [sent-133, score-0.083]

55 The hyperparameters are the same as for a WGP that uses a single tanh function, so no overﬁtting is expected, while still enjoying a completely ﬂexible warping function. [sent-136, score-0.573]

56 4 Approximate predictive density In order to use the proposed approximate posterior to make predictions for a new test output y∗ given input x∗ we need to compute q(y∗ |y) = p(y∗ |g∗ )p(g∗ |f∗ , u)q(u)p(f∗ |f )q(f )dg∗ df∗ dudf . [sent-138, score-0.074]

57 However, the posterior mean and variance can be computed analytically. [sent-147, score-0.051]

58 In spite of this, the approximate posterior is not Gaussian in general. [sent-149, score-0.051]

59 In our experiments we will compare its performance with that of the original implementation4 of the maximum likelihood WGP model from [1]. [sent-151, score-0.052]

60 In order to show the effect of varying the complexity of the parametric warping function in WGP, we tested a 3 tanh model (the default, used in the experiments from [1]) and a 20 tanh model, denoted as WGP3 and WGP20, respectively. [sent-152, score-0.606]

61 The standard ARD SE covariance function [4] plus noise was used for the underlying GP in all models. [sent-156, score-0.05]

62 In order to generate a nonlinearly distorted signal, we round a sine function to the nearest integer and add Gaussian noise with variance σ 2 = 2. [sent-166, score-0.049]

63 A dashed envelope encloses 90% posterior mass for WGP, whereas a shading is used to show 90% posterior mass for BWGP. [sent-184, score-0.155]

64 Middle: The dotted line shows the true posterior at x = 0 and x = 0. [sent-185, score-0.051]

65 Since WGP does not model output noise explicitly, these ﬂat zones transfer and magnify output noise to latent space, with the consequent degradation in performance. [sent-190, score-0.138]

66 Note the extra spread of the posterior mass in comparison with the actual training data, which is much better modelled by BWGP. [sent-191, score-0.121]

67 BWGP is able to deal properly with noisy quantised signals and it is able to learn the implicit quantisation function. [sent-194, score-0.054]

68 2 Regression data sets We now turn to the three real data sets originally used in [1] to assess WGP and for which it is specially suited. [sent-196, score-0.051]

69 These are: abalone [14] (4177 samples, 8 dimensions), creep [15, 16] (2066 samples, 30 dimensions), and ailerons [17] (7154 samples, 40 dimensions). [sent-197, score-0.414]

70 For each problem, we generated 60 splits by randomly partitioning data. [sent-199, score-0.033]

71 The warping functions inferred by BWGP are displayed in Fig. [sent-201, score-0.536]

72 3(a)-(c) and are almost identical to those displayed in [1] for WGP. [sent-202, score-0.033]

73 MSE NLPD Model abalone creep ail (×10−8 ) abalone creep ailerons GP BWGP MLWGP3 MLWGP20 4. [sent-206, score-0.751]

74 08 In terms of NLPD, BWGP always outperforms the standard GP, but it is in turn outperformed by the maximum likelihood variants, which do not need to resort to any approximation to compute its posterior. [sent-254, score-0.052]

75 In terms of MSE, BWGP always performs better than WGP20 on these data sets, but only performs better than WGP3 on the creep data set, which, on the other hand, is the one that seems 6 to beneﬁt more from the use of a warping function. [sent-255, score-0.671]

76 It seems that the additional ﬂexibility of the warping function in WGP20 is penalising its ability to generalise properly. [sent-256, score-0.474]

77 Upon seeing these results, we can conclude that WGP3 is already a good enough solution when abundant training data are available and a simple warping function is required. [sent-257, score-0.489]

78 This is reasonable: The additional number of hyperparameters is small (only 9) and inference can be performed analytically. [sent-258, score-0.044]

79 3(a)-(c) that the posterior over the warping functions is highly peaked, so a maximum likelihood approach makes sense. [sent-260, score-0.592]

80 However, performance might suffer when the warping function becomes even slightly complex, as in creep, or when the number of available data for training is very small (see the effect of the training set size on Fig. [sent-261, score-0.504]

81 In those cases, BWGP is a safer option, since it will not overﬁt independently of the amount of data while allowing for a highly ﬂexible warping function. [sent-263, score-0.474]

82 3 Censored regression data sets We will now modify the previous data sets so that they become more challenging. [sent-267, score-0.059]

83 As discussed in [1], for this type of data, WGP tries to spread the samples in latent space by using a very sharp warping function and this causes the model problems. [sent-276, score-0.527]

84 Additionally, the computation of the NLPD becomes erroneous due to numerical problems, with some of the tanh functions becoming very close to sign functions. [sent-277, score-0.07]

85 Table 2: NMSE and NLPD ﬁgures for the compared methods on censored data sets. [sent-282, score-0.062]

86 MSE NLPD Model abalone creep ail (×10−8 ) abalone creep ailerons GP BWGP WGP3 WGP20 1. [sent-283, score-0.751]

87 4 Classiﬁcation data sets Classiﬁcation can be regarded as an extreme case of censoring or quantisation of a regression data set. [sent-320, score-0.096]

88 5 100 −5 −15 −10 −5 0 5 0 10 −200 −100 0 f 100 200 −1 −3 300 −20 −15 −10 (b) creep (reg) −1. [sent-328, score-0.197]

89 5 −1 −4 x 10 (c) ailerons (reg) Creep Abalone 0 f f (a) abalone (reg) −5 1. [sent-334, score-0.217]

90 5 −10 −100 −50 0 50 −12 100 −3 −2 (e) abalone (cens) −1 0 1 f f f (f) creep (cens) 2 3 4 5 −1. [sent-339, score-0.307]

91 8 f x 10 (g) ailerons (cens) (h) titanic (class) Figure 3: Inferred warping functions. [sent-348, score-0.611]

92 Table 3: Error rates (in percentage) for the proposed model on the benchmark from R¨ tsch [18]. [sent-349, score-0.033]

93 So we decided to test the BWGP model on the 13 classiﬁcation data sets from R¨ tsch benchmark [18]. [sent-390, score-0.048]

94 Instead, we used a standard GP classiﬁer (GPC) using a probit likelihood and expectation propagation for approximate inference. [sent-392, score-0.053]

95 The learned warping functions look similar for the different data sets. [sent-395, score-0.489]

96 Specially good results are obtained for german, ringnorm, and splice, though we are aware than even better results can be obtained by using an isotropic SE covariance on these data sets [19]. [sent-400, score-0.038]

97 5 Discussion and further work In this work we have shown how it is possible to variationally integrate out the warping function from warped GPs. [sent-401, score-0.693]

98 The experiments demonstrate the improved robustness of the BWGP model, which is able to operate properly in a much wider set of scenarios. [sent-403, score-0.04]

99 One example is ordinal regression [8], where the locations and widths of the bins can be integrated out instead of selected. [sent-408, score-0.101]

100 Variational learning of inducing variables in sparse Gaussian processes. [sent-475, score-0.053]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bwgp', 0.514), ('warping', 0.474), ('wgp', 0.333), ('cvv', 0.267), ('gp', 0.262), ('creep', 0.197), ('vv', 0.182), ('warped', 0.16), ('cf', 0.134), ('abalone', 0.11), ('ailerons', 0.107), ('nlpd', 0.106), ('variational', 0.065), ('mse', 0.062), ('censored', 0.062), ('gpc', 0.06), ('gps', 0.057), ('tanh', 0.055), ('inducing', 0.053), ('posterior', 0.051), ('cens', 0.045), ('wgps', 0.045), ('hyperparameters', 0.044), ('wrt', 0.044), ('gaussian', 0.044), ('variationally', 0.04), ('latent', 0.038), ('reg', 0.033), ('tsch', 0.033), ('likelihood', 0.033), ('splits', 0.033), ('displayed', 0.033), ('ordinal', 0.031), ('ail', 0.03), ('distorting', 0.03), ('quantisation', 0.03), ('titanic', 0.03), ('vm', 0.03), ('titsias', 0.029), ('regression', 0.029), ('german', 0.028), ('processes', 0.028), ('dg', 0.028), ('analytically', 0.027), ('noise', 0.027), ('du', 0.027), ('modelled', 0.026), ('df', 0.026), ('kl', 0.025), ('bayesian', 0.025), ('miguel', 0.025), ('shading', 0.025), ('properly', 0.024), ('monotonic', 0.023), ('output', 0.023), ('covariance', 0.023), ('regarded', 0.022), ('trace', 0.022), ('nmse', 0.022), ('sine', 0.022), ('copulas', 0.022), ('parametric', 0.022), ('inputs', 0.021), ('widths', 0.021), ('specially', 0.021), ('observations', 0.021), ('ii', 0.021), ('classi', 0.021), ('conditioning', 0.021), ('vj', 0.02), ('locations', 0.02), ('probit', 0.02), ('default', 0.019), ('log', 0.019), ('prior', 0.019), ('maximum', 0.019), ('integrate', 0.019), ('free', 0.018), ('augmented', 0.017), ('transformation', 0.017), ('dirac', 0.017), ('nonlinear', 0.017), ('spaced', 0.016), ('ij', 0.016), ('wider', 0.016), ('sets', 0.015), ('spread', 0.015), ('se', 0.015), ('delta', 0.015), ('evidence', 0.015), ('training', 0.015), ('jensen', 0.015), ('functions', 0.015), ('fn', 0.015), ('reasonable', 0.014), ('mass', 0.014), ('eq', 0.014), ('mentioned', 0.014), ('process', 0.014), ('inferred', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

2 0.19839667 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

3 0.16853622 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientiﬁc computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efﬁciency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientiﬁc problem from astronomy. 1

4 0.13937275 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each deﬁned over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points deﬁne the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efﬁcient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

5 0.12566884 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a ﬁnal decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the beneﬁts that they bring compared to single-task learning. PAC-style bounds for classiﬁcation [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difﬁcult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, ﬁnding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower ﬁnal part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully speciﬁed by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the ﬂexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an inﬁnite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is deﬁned as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a speciﬁed number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are deﬁned by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general inﬁnite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the ﬁnite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to inﬁnity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we deﬁne Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simpliﬁcations below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The ﬁrst term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality deﬁnes Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the ﬁrst gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the ﬁrst and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, ﬁnally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the deﬁnition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and deﬁne a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to ﬁnd the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to ﬁnd a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is speciﬁed by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisﬁed whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisﬁed by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as deﬁned by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to ﬁnd the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the ﬁrst T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the beneﬁts of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the ﬁrst T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the ﬁrst T0 tasks. One ﬁnds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are ﬁxed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We ﬁx π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse ﬁrst the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as deﬁned e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-speciﬁc signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept ﬁxed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at ﬁxed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This ﬁnal stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of speciﬁc choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artiﬁcial Intelligence and Statistics 10, pages 333–340. Society for Artiﬁcial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speciﬁc features. In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

6 0.10678103 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

7 0.10474052 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

8 0.091410518 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

9 0.065080546 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

10 0.059998792 206 nips-2012-Majorization for CRFs and Latent Likelihoods

11 0.052749157 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

12 0.05005984 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

13 0.04882111 37 nips-2012-Affine Independent Variational Inference

14 0.048624203 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

15 0.047287572 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

16 0.044511661 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

17 0.044165075 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

18 0.043129828 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

19 0.042778458 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

20 0.040857945 11 nips-2012-A Marginalized Particle Gaussian Process Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.038), (2, 0.008), (3, 0.016), (4, -0.103), (5, -0.018), (6, 0.018), (7, 0.074), (8, -0.006), (9, -0.205), (10, -0.195), (11, -0.004), (12, -0.063), (13, 0.068), (14, -0.036), (15, 0.093), (16, -0.038), (17, 0.068), (18, -0.083), (19, -0.026), (20, -0.086), (21, -0.057), (22, 0.042), (23, 0.052), (24, -0.003), (25, 0.038), (26, -0.053), (27, 0.021), (28, -0.072), (29, 0.002), (30, 0.069), (31, -0.008), (32, -0.008), (33, 0.104), (34, -0.0), (35, 0.02), (36, 0.03), (37, 0.032), (38, -0.031), (39, 0.007), (40, -0.077), (41, 0.001), (42, -0.008), (43, 0.019), (44, 0.012), (45, 0.052), (46, -0.023), (47, 0.061), (48, -0.037), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89358103 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

2 0.87855303 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

3 0.76953578 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

4 0.73543733 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

5 0.6356349 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

6 0.61069626 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

7 0.57621002 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

8 0.51202875 11 nips-2012-A Marginalized Particle Gaussian Process Regression

9 0.47891471 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

10 0.42954165 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

11 0.35330153 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

12 0.35093266 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

13 0.33440575 37 nips-2012-Affine Independent Variational Inference

14 0.32461879 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

15 0.32275072 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

16 0.32141277 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

17 0.31230652 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

18 0.30772692 211 nips-2012-Meta-Gaussian Information Bottleneck

19 0.30143082 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

20 0.29352793 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.049), (21, 0.014), (38, 0.097), (39, 0.02), (42, 0.032), (54, 0.014), (55, 0.026), (74, 0.03), (76, 0.157), (80, 0.075), (92, 0.031), (98, 0.335)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82966614 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

Author: Sean Gerrish, David M. Blei

Abstract: We develop a probabilistic model of legislative data that uses the text of the bills to uncover lawmakers’ positions on speciﬁc political issues. Our model can be used to explore how a lawmaker’s voting patterns deviate from what is expected and how that deviation depends on what is being voted on. We derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we demonstrate both improvement in heldout predictive performance and the model’s utility in interpreting an inherently multi-dimensional space. 1

2 0.74459893 267 nips-2012-Perceptron Learning of SAT

Author: Alex Flint, Matthew Blaschko

Abstract: Boolean satisﬁability (SAT) as a canonical NP-complete decision problem is one of the most important problems in computer science. In practice, real-world SAT sentences are drawn from a distribution that may result in efﬁcient algorithms for their solution. Such SAT instances are likely to have shared characteristics and substructures. This work approaches the exploration of a family of SAT solvers as a learning problem. In particular, we relate polynomial time solvability of a SAT subset to a notion of margin between sentences mapped by a feature function into a Hilbert space. Provided this mapping is based on polynomial time computable statistics of a sentence, we show that the existance of a margin between these data points implies the existance of a polynomial time solver for that SAT subset based on the Davis-Putnam-Logemann-Loveland algorithm. Furthermore, we show that a simple perceptron-style learning rule will ﬁnd an optimal SAT solver with a bounded number of training updates. We derive a linear time computable set of features and show analytically that margins exist for important polynomial special cases of SAT. Empirical results show an order of magnitude improvement over a state-of-the-art SAT solver on a hardware veriﬁcation task. 1

same-paper 3 0.72681183 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

4 0.70691466 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classiﬁcation and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the ﬁrst layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report signiﬁcant improvements over standard baselines. 1

5 0.6337415 180 nips-2012-Learning Mixtures of Tree Graphical Models

Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.

6 0.62735486 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

7 0.53205538 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.53003854 294 nips-2012-Repulsive Mixtures

9 0.52927125 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

10 0.52899319 188 nips-2012-Learning from Distributions via Support Measure Machines

11 0.52874202 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

12 0.52806753 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

13 0.52798843 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

14 0.52754974 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

15 0.5275206 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

16 0.52693015 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

17 0.52666932 126 nips-2012-FastEx: Hash Clustering with Exponential Families

18 0.52642143 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

19 0.52605671 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.52524567 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification