nips nips2012 nips2012-272 knowledge-graph by maker-knowledge-mining

272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Source: pdf

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. [sent-14, score-0.153]

2 Another, more ﬂexible take on this issue is to view the optimization of such parameters as a procedure to be automated. [sent-17, score-0.072]

3 Speciﬁcally, we could view such tuning as the optimization of an unknown black-box function and invoke algorithms developed for such problems. [sent-18, score-0.097]

4 A good choice is Bayesian optimization [1], which has been shown to outperform other state of the art global optimization algorithms on a number of challenging optimization benchmark functions [2]. [sent-19, score-0.31]

5 To pick the hyperparameters of the next experiment, one can optimize the expected improvement (EI) [1] over the current best result or the Gaussian process upper conﬁdence bound (UCB)[3]. [sent-21, score-0.336]

6 EI and UCB have been shown to be efﬁcient in the number of function evaluations required to ﬁnd the global optimum of many multimodal black-box functions [4, 3]. [sent-22, score-0.228]

7 1 Machine learning algorithms, however, have certain characteristics that distinguish them from other black-box optimization problems. [sent-23, score-0.072]

8 In both situations, the standard sequential approach of GP optimization can be suboptimal. [sent-27, score-0.072]

9 In this work, we identify good practices for Bayesian optimization of machine learning algorithms. [sent-28, score-0.096]

10 We argue that a fully Bayesian treatment of the underlying GP kernel is preferred to the approach based on optimization of the GP hyperparameters, as previously proposed [5]. [sent-29, score-0.12]

11 [7] have developed sequential model-based optimization strategies for the conﬁguration of satisﬁability and mixed integer programming solvers using random forests. [sent-34, score-0.117]

12 The machine learning algorithms we consider, however, warrant a fully Bayesian treatment as their expensive nature necessitates minimizing the number of evaluations. [sent-35, score-0.1]

13 Bayesian optimization strategies have also been used to tune the parameters of Markov chain Monte Carlo algorithms [8]. [sent-36, score-0.117]

14 [5] have explored various strategies for optimizing the hyperparameters of machine learning algorithms. [sent-38, score-0.284]

15 They demonstrated that grid search strategies are inferior to random search [9], and suggested the use of Gaussian process Bayesian optimization, optimizing the hyperparameters of a squared-exponential covariance, and proposed the Tree Parzen Algorithm. [sent-39, score-0.506]

16 2 Bayesian Optimization with Gaussian Process Priors As in other kinds of optimization, in Bayesian optimization we are interested in ﬁnding the minimum of a function f (x) on some bounded set X , which we will take to be a subset of RD . [sent-40, score-0.072]

17 What makes Bayesian optimization different from other procedures is that it constructs a probabilistic model for f (x) and then exploits this model to make decisions about where in X to next evaluate the function, while integrating out uncertainty. [sent-41, score-0.096]

18 The essential philosophy is to use all of the information available from previous evaluations of f (x) and not simply rely on local gradient and Hessian approximations. [sent-42, score-0.175]

19 When evaluations of f (x) are expensive to perform — as is the case when it requires training a machine learning algorithm — then it is easy to justify some extra computation to make better decisions. [sent-44, score-0.199]

20 For an overview of the Bayesian optimization formalism and a review of previous work, see, e. [sent-45, score-0.072]

21 In this section we brieﬂy review the general Bayesian optimization approach, before discussing our novel contributions in Section 3. [sent-49, score-0.072]

22 Second, we must choose an acquisition function, which is used to construct a utility function from the model posterior, allowing us to determine the next point to evaluate. [sent-53, score-0.135]

23 The support and properties of the resulting distribution on functions are determined by a mean function m : X → R and a positive deﬁnite covariance function K : X × X → R. [sent-58, score-0.099]

24 We will discuss the impact of covariance functions in Section 3. [sent-59, score-0.099]

25 2 Acquisition Functions for Bayesian Optimization We assume that the function f (x) is drawn from a Gaussian process prior and that our observations are of the form {xn , yn }N , where yn ∼ N (f (xn ), ν) and ν is the variance of noise intron=1 duced into the function observations. [sent-63, score-0.257]

26 This prior and these data induce a posterior over functions; the acquisition function, which we denote by a : X → R+ , determines what point in X should be evaluated next via a proxy optimization xnext = argmaxx a(x), where several different functions have been proposed. [sent-64, score-0.321]

27 In general, these acquisition functions depend on the previous observations, as well as the GP hyperparameters; we denote this dependence as a(x ; {xn , yn }, θ). [sent-65, score-0.283]

28 Under the Gaussian process prior, these functions depend on the model solely through its predictive mean function µ(x ; {xn , yn }, θ) and predictive variance function σ 2 (x ; {xn , yn }, θ). [sent-67, score-0.291]

29 Under the GP this can be computed analytically as f (xbest ) − µ(x ; {xn , yn }, θ) aPI (x ; {xn , yn }, θ) = Φ(γ(x)), γ(x) = . [sent-70, score-0.228]

30 (1) σ(x ; {xn , yn }, θ) Expected Improvement Alternatively, one could choose to maximize the expected improvement (EI) over the current best. [sent-71, score-0.217]

31 These acquisition functions have the form aLCB (x ; {xn , yn }, θ) = µ(x ; {xn , yn }, θ) − κ σ(x ; {xn , yn }, θ), (3) with a tunable κ to balance exploitation against exploration. [sent-73, score-0.511]

32 3 Practical Considerations for Bayesian Optimization of Hyperparameters Although an elegant framework for optimizing expensive functions, there are several limitations that have prevented it from becoming a widely-used technique for optimizing hyperparameters in machine learning problems. [sent-78, score-0.298]

33 Second, as the function evaluation itself may involve a time-consuming optimization procedure, problems may vary signiﬁcantly in duration and this should be taken into account. [sent-80, score-0.113]

34 Third, optimization algorithms should take advantage of multi-core parallelism in order to map well onto modern computational environments. [sent-81, score-0.092]

35 1 Covariance Functions and Treatment of Covariance Hyperparameters The power of the Gaussian process to express a rich distribution on functions rests solely on the shoulders of the covariance function. [sent-84, score-0.128]

36 While non-degenerate covariance functions correspond to inﬁnite bases, they nevertheless can correspond to strong assumptions regarding likely functions. [sent-85, score-0.099]

37 However, sample functions with this covariance function are unrealistically smooth for practical optimization problems. [sent-88, score-0.171]

38 (b) Three expected improvement acquisition functions, with the same data and hyperparameters. [sent-92, score-0.238]

39 Figure 2: Illustration of the acquisition with pending evaluations. [sent-95, score-0.233]

40 (a) Three data have been observed and three posterior functions are shown, with “fantasies” for three pending evaluations. [sent-96, score-0.157]

41 (b) Expected improvement, conditioned on the each joint fantasy of the pending outcome. [sent-97, score-0.13]

42 (c) Expected improvement after integrating over the fantasy outcomes. [sent-98, score-0.127]

43 This covariance function results in sample functions which are twice-differentiable, an assumption that corresponds to those made by, e. [sent-99, score-0.099]

44 After choosing the form of the covariance, we must also manage the hyperparameters that govern its behavior (Note that these “hyperparameters” are distinct from those being subjected to the overall Bayesian optimization. [sent-102, score-0.204]

45 For our problems of interest, typically we would have D + 3 Gaussian process hyperparameters: D length scales θ1:D , the covariance amplitude θ0 , the observation noise ν, and a constant mean m. [sent-104, score-0.094]

46 We can therefore blend acquisition functions arising from samples from the posterior over GP hyperparameters and have a Monte Carlo estimate of the integrated expected improvement. [sent-108, score-0.463]

47 Figure 1 shows how the integrated expected improvement changes the acquistion function. [sent-111, score-0.136]

48 2 Modeling Costs Ultimately, the objective of Bayesian optimization is to ﬁnd a good setting of our hyperparameters as quickly as possible. [sent-113, score-0.276]

49 Greedy acquisition procedures such as expected improvement try to make 4 the best progress possible in the next function evaluation. [sent-114, score-0.238]

50 From a practial point of view, however, we are not so concerned with function evaluations as with wallclock time. [sent-115, score-0.207]

51 To improve our performance in terms of wallclock time, we propose optimizing with the expected improvement per second, which prefers to acquire points that are not only likely to be good, but that are also likely to be evaluated quickly. [sent-117, score-0.222]

52 Under the independence assumption, we can easily compute the predicted expected inverse duration and use it to compute the expected improvement per second as a function of x. [sent-124, score-0.228]

53 3 Monte Carlo Acquisition for Parallelizing Bayesian Optimization With the advent of multi-core computing, it is natural to ask how we can parallelize our Bayesian optimization procedures. [sent-126, score-0.096]

54 Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending experiments. [sent-128, score-0.233]

55 Ideally, we could perform a roll-out of our acquisition policy, to choose a point that appropriately balanced information gain and exploitation. [sent-129, score-0.135]

56 Instead we propose a sequential strategy that takes advantage of the tractable inference properties of the Gaussian process to compute Monte Carlo estimates of the acquisiton function under different possible results from pending function evaluations. [sent-131, score-0.127]

57 Consider the situation in which N evaluations have completed, yielding data {xn , yn }N , and in n=1 which J evaluations are pending at locations {xj }J . [sent-132, score-0.562]

58 Ideally, we would choose a new point based j=1 on the expected acquisition function under all possible outcomes of these pending evaluations: a(x ; {xn , yn }, θ, {xj }) = ˆ RJ a(x ; {xn , yn }, θ, {xj , yj }) p({yj }J | {xj }J , {xn , yn }N ) dy1 · · · dyJ . [sent-133, score-0.607]

59 As in the covariance hyperparameter case, it is straightforward to use samples from this distribution to compute the expected acquisition and use this to select the next point. [sent-135, score-0.258]

60 We refer to our method of expected improvement while marginalizing GP hyperparameters as “GP EI MCMC”, optimizing hyperparameters as “GP EI Opt”, EI per second as “GP EI per Second”, and N times parallelized GP EI MCMC as “N x GP EI MCMC”. [sent-140, score-0.705]

61 Each results ﬁgure plots the progression of minxn f (xn ) over the number of function evaluations or time, averaged over multiple runs of each algorithm. [sent-141, score-0.175]

62 If not speciﬁed otherwise, xnext = argmaxx a(x) is computed using gradientbased search with multiple restarts (see supplementary material for details). [sent-142, score-0.098]

63 08 0 0 10 20 30 Function evaluations 40 50 0 20 40 60 Function Evaluations 80 5 100 10 15 20 25 Minutes 30 35 40 45 (a) (b) (c) Figure 3: Comparisons on the Branin-Hoo function (3a) and training logistic regression on MNIST (3b). [sent-169, score-0.2]

64 optimization techniques [2] that is deﬁned over x ∈ R2 where 0 ≤ x1 ≤ 15 and −5 ≤ x2 ≤ 15. [sent-172, score-0.072]

65 On Branin-Hoo, integrating over hyperparameters is superior to using a point estimate and the GP EI signiﬁcantly outperforms TPA, ﬁnding the minimum in less than half as many evaluations, in both cases. [sent-177, score-0.228]

66 For logistic regression, 3b and 3c show that although EI per second is less efﬁcient in function evaluations it outperforms standard EI in time. [sent-178, score-0.252]

67 [17] relied on an exhaustive grid search of size 6 × 6 × 8, for a total of 288 hyperparameter conﬁgurations. [sent-186, score-0.176]

68 [17], we used a lower bound on the per word perplexity of the validation set documents as the performance measure. [sent-192, score-0.073]

69 One must also specify the number of topics and the hyperparameters η for the symmetric Dirichlet prior over the topic distributions and α for the symmetric Dirichlet prior over the per document topic mixing weights. [sent-193, score-0.256]

70 01 in our experiments in order to emulate their analysis and repeated exactly the grid search reported in the paper3 . [sent-196, score-0.198]

71 Each online LDA evaluation generally took between ﬁve to ten hours to converge, thus the grid search requires approximately 60 to 120 processor days to complete. [sent-197, score-0.197]

72 28 GP EI MCMC GP EI per Second 3x GP EI MCMC 3x GP EI per Second 0. [sent-207, score-0.104]

73 275 Min Function Value GP EI MCMC GP EI per Second 3x GP EI MCMC 3x GP EI per Second Random Grid Search Min Function Value Min function value 0. [sent-211, score-0.104]

74 24 0 20 40 60 Function evaluations 80 100 (b) 0. [sent-220, score-0.175]

75 24 0 20 40 60 Function evaluations 80 100 (c) Figure 5: A comparison of various strategies for optimizing the hyperparameters of M3E models on the protein motif ﬁnding task in terms of walltime (5a), function evaluations (5b) and different covariance functions(5c). [sent-221, score-0.783]

76 In Figures 4a and 4b we compare our various strategies of optimization over the same grid on this expensive problem. [sent-222, score-0.248]

77 That is, the algorithms were restricted to only the exact parameter settings as evaluated by the grid search. [sent-223, score-0.107]

78 Each optimization was then repeated 100 times (each time picking two different random experiments to initialize the optimization with) and the mean and standard error are reported4 . [sent-224, score-0.166]

79 Figure 4c also presents a 5 run average of optimization with 3 and 5 times parallelized GP EI MCMC, but without restricting the new parameter setting to be on the pre-speciﬁed grid (see supplementary material for details). [sent-225, score-0.234]

80 Clearly integrating over hyperparameters is superior to using a point estimate in this case. [sent-227, score-0.228]

81 Finally, in Figure 4c we see that the parallelized GP EI MCMC algorithms ﬁnd a signiﬁcantly better minimum value than was found in the grid search used by Hoffman et al. [sent-229, score-0.233]

82 Setting the hyperparameters, such as the regularisation term, C, of structured SVMs remains a challenge and these are typically set through a time consuming grid search procedure as is done in [18, 19]. [sent-236, score-0.197]

83 We ran a grid search over the 1400 possible combinations of these parameters, evaluating each over 5 random 50-50 training and test splits. [sent-247, score-0.15]

84 In Figures 5a and 5b, we compare the randomized grid search to GP EI MCMC, GP EI per Second and their 3x parallelized versions, all constrained to the same points on the grid. [sent-248, score-0.257]

85 We observe that the Bayesian optimization strategies are considerably more efﬁcient than grid search which is the status quo. [sent-250, score-0.267]

86 In this case, GP EI MCMC is superior to GP EI per Second in terms of function evaluations but GP EI per Second ﬁnds better parameters faster than GP EI MCMC as it learns to use a less strict convergence tolerance early on while exploring the other parameters. [sent-251, score-0.297]

87 Indeed, 3x GP EI per second, is the least efﬁcient in terms of function evaluations but ﬁnds better parameters faster than all the other algorithms. [sent-252, score-0.227]

88 Figure 5c compares the use of various covariance functions in GP EI MCMC optimization on this problem, again repeating the optimization 100 times. [sent-253, score-0.243]

89 2 0 10 20 30 Function evaluations 40 0 50 10 20 30 40 Time (Hours) 50 60 70 Figure 6: Validation error on the CIFAR-10 data for different optimization strategies. [sent-265, score-0.247]

90 Multi-layer convolutional neural networks are an example of such a model for which a thorough exploration of architechtures and hyperparameters is beneﬁcial, as demonstrated in Saxe et al. [sent-270, score-0.296]

91 In this empirical analysis, we tune nine hyperparameters of a three-layer convolutional network [22] on the CIFAR-10 benchmark dataset using the code provided 5 . [sent-274, score-0.276]

92 This model has been carefully tuned by a human expert [22] to achieve a highly competitive result of 18% test error on the unaugmented data, which matches the published state of the art result [23] on CIFAR10. [sent-275, score-0.107]

93 The best hyperparameters found by the GP EI MCMC approach achieve an error on the test set of 14. [sent-279, score-0.204]

94 98%, which is over 3% better than the expert and the state of the art on CIFAR-10. [sent-280, score-0.084]

95 5 Conclusion We presented methods for performing Bayesian optimization for hyperparameter selection of general machine learning algorithms. [sent-285, score-0.098]

96 The resulting Bayesian optimization ﬁnds better hyperparameters signiﬁcantly faster than the approaches used by the authors and surpasses a human expert at selecting hyperparameters on the competitive CIFAR-10 dataset, beating the state of the art by over 3%. [sent-288, score-0.564]

97 A taxonomy of global optimization methods based on response surfaces. [sent-301, score-0.091]

98 Gaussian process optimization in the bandit setting: No regret and experimental design. [sent-304, score-0.12]

99 A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. [sent-331, score-0.096]

100 Dealing with asynchronicity in parallel Gaussian process based global optimization. [sent-361, score-0.071]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.645), ('ei', 0.498), ('hyperparameters', 0.204), ('mcmc', 0.196), ('evaluations', 0.175), ('acquisition', 0.135), ('yn', 0.114), ('grid', 0.107), ('pending', 0.098), ('xn', 0.091), ('optimization', 0.072), ('bayesian', 0.071), ('improvement', 0.071), ('hoffman', 0.07), ('opt', 0.066), ('covariance', 0.065), ('tpa', 0.064), ('ard', 0.055), ('parallelized', 0.055), ('per', 0.052), ('bergstra', 0.049), ('treatment', 0.048), ('fantasies', 0.048), ('lda', 0.047), ('strategies', 0.045), ('expert', 0.043), ('gaussian', 0.043), ('search', 0.043), ('art', 0.041), ('duration', 0.041), ('saxe', 0.036), ('optimizing', 0.035), ('parzen', 0.035), ('miller', 0.034), ('functions', 0.034), ('integrated', 0.033), ('ucb', 0.033), ('convolutional', 0.032), ('expected', 0.032), ('architechtures', 0.032), ('fantasy', 0.032), ('ginsbourger', 0.032), ('sqexp', 0.032), ('wallclock', 0.032), ('walltime', 0.032), ('xbest', 0.032), ('xnext', 0.032), ('svms', 0.029), ('process', 0.029), ('latent', 0.028), ('warrant', 0.028), ('pawan', 0.028), ('jasper', 0.028), ('hutter', 0.028), ('carlo', 0.028), ('cores', 0.028), ('et', 0.028), ('hours', 0.027), ('monte', 0.026), ('nando', 0.026), ('brochu', 0.026), ('matern', 0.026), ('minibatch', 0.026), ('regularisation', 0.026), ('emulate', 0.026), ('motif', 0.026), ('protein', 0.026), ('hyperparameter', 0.026), ('dirichlet', 0.026), ('tuning', 0.025), ('posterior', 0.025), ('logistic', 0.025), ('integrating', 0.024), ('advent', 0.024), ('practices', 0.024), ('expensive', 0.024), ('parallel', 0.023), ('argmaxx', 0.023), ('published', 0.023), ('packer', 0.022), ('yoshua', 0.022), ('repeated', 0.022), ('figures', 0.022), ('kumar', 0.021), ('nine', 0.021), ('validation', 0.021), ('structured', 0.021), ('nds', 0.02), ('online', 0.02), ('parallelism', 0.02), ('ryan', 0.02), ('regret', 0.019), ('global', 0.019), ('daphne', 0.019), ('amazon', 0.019), ('code', 0.019), ('dence', 0.018), ('calibration', 0.018), ('tolerance', 0.018), ('matthias', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

2 0.37204182 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

Abstract: Numerical integration is a key component of many problems in scientiﬁc computing, statistical modelling, and machine learning. Bayesian Quadrature is a modelbased method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efﬁciency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientiﬁc problem from astronomy. 1

3 0.29977727 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each deﬁned over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points deﬁne the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efﬁcient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

4 0.23227152 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a ﬁnal decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the beneﬁts that they bring compared to single-task learning. PAC-style bounds for classiﬁcation [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difﬁcult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, ﬁnding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower ﬁnal part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully speciﬁed by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the ﬂexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an inﬁnite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is deﬁned as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a speciﬁed number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are deﬁned by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general inﬁnite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the ﬁnite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to inﬁnity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we deﬁne Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simpliﬁcations below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The ﬁrst term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality deﬁnes Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the ﬁrst gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the ﬁrst and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, ﬁnally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the deﬁnition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and deﬁne a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to ﬁnd the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to ﬁnd a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is speciﬁed by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisﬁed whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisﬁed by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as deﬁned by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to ﬁnd the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the ﬁrst T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the beneﬁts of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the ﬁrst T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the ﬁrst T0 tasks. One ﬁnds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are ﬁxed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We ﬁx π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse ﬁrst the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as deﬁned e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-speciﬁc signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept ﬁxed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at ﬁxed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This ﬁnal stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of speciﬁc choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artiﬁcial Intelligence and Statistics 10, pages 333–340. Society for Artiﬁcial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speciﬁc features. In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

5 0.21045931 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

Author: Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, Jose M. Hernández-lobato

Abstract: We present a new model based on Gaussian processes (GPs) for learning pairwise preferences expressed by multiple users. Inference is simpliﬁed by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efﬁcient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. 1

6 0.19839667 55 nips-2012-Bayesian Warped Gaussian Processes

7 0.19008015 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

8 0.18388912 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

9 0.1094766 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

10 0.10260695 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

11 0.096670099 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

12 0.093337037 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

13 0.085398115 11 nips-2012-A Marginalized Particle Gaussian Process Regression

14 0.074058309 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

15 0.066786408 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

16 0.066370144 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning

17 0.060492266 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

18 0.059714902 275 nips-2012-Privacy Aware Learning

19 0.058056597 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

20 0.056944091 148 nips-2012-Hamming Distance Metric Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, 0.033), (2, 0.02), (3, 0.037), (4, -0.17), (5, -0.044), (6, 0.03), (7, 0.087), (8, -0.007), (9, -0.376), (10, -0.258), (11, -0.004), (12, -0.112), (13, 0.104), (14, -0.125), (15, 0.209), (16, -0.097), (17, 0.083), (18, -0.081), (19, -0.049), (20, -0.131), (21, -0.091), (22, 0.016), (23, 0.031), (24, 0.062), (25, 0.07), (26, -0.111), (27, 0.013), (28, -0.056), (29, -0.001), (30, 0.184), (31, 0.037), (32, -0.073), (33, 0.14), (34, 0.066), (35, -0.01), (36, 0.04), (37, 0.051), (38, 0.002), (39, -0.041), (40, -0.074), (41, -0.02), (42, -0.011), (43, 0.031), (44, -0.077), (45, -0.033), (46, 0.006), (47, 0.043), (48, -0.003), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95663118 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

2 0.81960982 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classiﬁcation, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.

3 0.80774838 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

Author: Michael Osborne, Roman Garnett, Zoubin Ghahramani, David K. Duvenaud, Stephen J. Roberts, Carl E. Rasmussen

4 0.7840178 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

5 0.62437093 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

6 0.56689286 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

7 0.54761535 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

8 0.4904688 11 nips-2012-A Marginalized Particle Gaussian Process Regression

9 0.44027388 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

10 0.42018661 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

11 0.33669218 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

12 0.29479548 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

13 0.27978176 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

14 0.27588645 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

15 0.26891571 37 nips-2012-Affine Independent Variational Inference

16 0.26439229 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

17 0.26436549 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

18 0.2624391 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

19 0.25857794 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

20 0.25852457 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.072), (21, 0.019), (38, 0.121), (42, 0.018), (53, 0.012), (54, 0.028), (55, 0.041), (59, 0.176), (74, 0.049), (76, 0.206), (80, 0.102), (92, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94281787 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

Author: Qirong Ho, Junming Yin, Eric P. Xing

Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1

2 0.89488858 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classiﬁcation. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased ﬂexibility its downside is the considerable risk of overﬁtting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics deﬁned on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classiﬁcation problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a signiﬁcant manner. 1

same-paper 3 0.89319295 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

4 0.88657475 294 nips-2012-Repulsive Mixtures

Author: Francesca Petralia, Vinayak Rao, David B. Dunson

Abstract: Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, ﬁnite mixtures and inﬁnite mixtures, relying on Dirichlet processes and modiﬁcations, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set. Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based clustering; Repulsive point process; Well separated mixture. 1

5 0.88430959 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

Author: Daniel Zoran, Yair Weiss

Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '

6 0.87890267 338 nips-2012-The Perturbed Variation

7 0.86079526 298 nips-2012-Scalable Inference of Overlapping Communities

8 0.83261442 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

9 0.83045697 188 nips-2012-Learning from Distributions via Support Measure Machines

10 0.82828265 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.82793838 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

12 0.82774556 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

13 0.82690746 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

14 0.82502192 126 nips-2012-FastEx: Hash Clustering with Exponential Families

15 0.82386506 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

16 0.82298601 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

17 0.82289392 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

18 0.82279676 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

19 0.82245654 197 nips-2012-Learning with Recursive Perceptual Representations

20 0.82233411 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models