nips nips2012 nips2012-181 knowledge-graph by maker-knowledge-mining

181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Source: pdf

Author: Koby Crammer, Yishay Mansour

Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. [sent-6, score-0.43]

2 Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. [sent-7, score-0.775]

3 Each task is then mapped to a single hypothesis in the pool (hard association). [sent-8, score-0.243]

4 We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. [sent-9, score-0.634]

5 1 Introduction Consider sentiment analysis task for a set of reviews for different products. [sent-11, score-0.363]

6 Furthermore, reviewers may use different amount and level of superlatives to describe the same sentiment level, or feeling different sentiment level yet describing the product with the same text. [sent-13, score-0.393]

7 Should one build individual sentiment predictors, one per product, based on small amount of data, or build a single sentiment predictor for all products, based on mixed input with potentially heterogeneous linguistic usage? [sent-15, score-0.586]

8 ) In addition, the different tasks might be somewhat different on both domain (text used) or predictions (sentiment association with given text), which may raise the dilemma between clustering related tasks or related domain. [sent-19, score-0.601]

9 Rather than clustering the different tasks before the learning, perform it as part of the learning task. [sent-21, score-0.246]

10 The idea is that we can control the complexity of the learning process by deciding on the size of pool of shared classiﬁers. [sent-24, score-0.302]

11 We start by computing an upper and lower bounds on the VC dimension, showing that the VC dimension is at most O(T log k + kd log(T kd)), where T is the number of domains, k the number of shared hypothesis and d the VC dimension of the basic hypothesis class. [sent-27, score-0.931]

12 This shows that the dependency on the number of tasks (T ) and 1 the number of shared hypothesis (k) is very different, namely, increasing the number of shared hypothesis increases the VC dimension only logarithmically. [sent-29, score-1.036]

13 This will imply that if we have N ˜ examples per task, the generalization error is only O log k N + dk TN compared to O d N when learning each task individually. [sent-30, score-0.397]

14 We also derived a K-means like algorithm to learn such classiﬁers, of both models and association of tasks and models. [sent-32, score-0.376]

15 We conduct experiment with both synthetic problems and sentiment prediction, with number of tasks ranging beween 30 − 370, some contain as high-as 18 examples in the training set. [sent-34, score-0.592]

16 Our experimental results strongly support the beneﬁts from the approach we propose here, which attains lower test error compared with learning individual models per task, or a single model for all tasks. [sent-35, score-0.355]

17 In domain adaptation we often assume that the tasks to be performed are very similar to each other, yet the data comes from different distributions, and often there is only unlabeled data from the domain (or task) of interest. [sent-37, score-0.468]

18 [20] proposed to learn one model per task, yet ﬁnd a small set of shared features using mixed-norm regularization. [sent-47, score-0.362]

19 [4] took a similar approach, yet with added complexity that the feature space can also be rotated before choosing this small shared set. [sent-49, score-0.295]

20 [1], learn by ﬁrst ﬁnding a linear transformation shared by all tasks, and then individual models per task. [sent-51, score-0.506]

21 Evgeniou [13] and Daume [15] proposed to combine two models, one individual per task and the other shared across all tasks, and combine them at test time, while later Evgeniou et al. [sent-53, score-0.534]

22 Finally, there exists large body of work on multi-task learning in the Bayesian setting, where a shared prior is used to connect or related the various tasks [5, 22, 16], while other works [17, 21, 9] are using Gaussian process predictors. [sent-55, score-0.496]

23 They assumed that the relative error (or a bound over it) is known, and proved generalization bound for that task, their bounds proposed to use some of the datasets, but not all, when building a model for the main task. [sent-58, score-0.231]

24 We do not assume this knowledge and learn few tasks simultaneously. [sent-60, score-0.275]

25 2 Model There is a set T of T tasks and with each task t there is an associated distribution Dt over inputs (x, y), where x ∈ Rr and y ∈ Y. [sent-61, score-0.349]

26 , hk } ⊂ H is a set of k hypotheses from a class of functions H = {h : Rr → Y}. [sent-70, score-0.341]

27 The function g maps each task t ∈ T to the hypotheses pool Hk , where the mapping is to a single hypothesis (hard association). [sent-71, score-0.323]

28 2 In the hard k-shared task classiﬁer, g maps each task t ∈ T to one hypothesis in hi ∈ Hk , i. [sent-76, score-0.343]

29 The class of hard k-shared task classiﬁers using hypotheses class H includes all such (Hk , g) classiﬁers, i. [sent-82, score-0.246]

30 Our main goal in this section is to bound the VC dimension of the resulting hypothesis class FH,k , assuming the VC dimension of H is d. [sent-91, score-0.298]

31 We show that the VC dimension of FH,k is at most O(T log k + kd log(T kd)) and at least Ω(T log k + dk). [sent-92, score-0.388]

32 For any hypothesis class H of VC-dimension d, the class of hard k-shared task classiﬁers FH,k has VC dimension at most the minimum between dT and O(T log k + kd log(T kd)). [sent-94, score-0.614]

33 Proof: Our main goal is to derive an upper bound on the number of possible labeling using a hard k-shared task classiﬁers FH,k . [sent-95, score-0.244]

34 Let Φd (m) = j=0 m be an upper bound on the number of j labeling over m examples using a hypothesis class of VC dimension d. [sent-97, score-0.376]

35 We consider all mapping g of the T tasks to Hk , there are k T such mappings. [sent-99, score-0.246]

36 Fix a particular mapping g where hypothesis hj has tasks S j ⊂ T assigned to it. [sent-100, score-0.436]

37 (At this point hj ∈ H is not ﬁxed yet, we are only ﬁxing g and the tasks that are mapped to the j hypothesis in Hk . [sent-101, score-0.436]

38 ) There are mj = t∈S j nt examples for the tasks in S j , and therefore at most Φd (mj ) labeling. [sent-102, score-0.419]

39 ) We can upper bound the numbers of labeling any hypothesis k pool Hk by j=1 Φd (mj ). [sent-104, score-0.292]

40 Since m = j mj , this bound is maximized when mj = m/k, and this implies that the number of labeling is upper bounded by k T (em/dk)dk . [sent-105, score-0.22]

41 dk We need to ﬁnd the largest m for which m ≤ kd log(em/dk) + T log k ≤ T log k + kd log(e/dk) + kd log m ≤ T log k + kd log m for dk ≥ 3. [sent-109, score-1.24]

42 This implies that 2m ≤ k T m ≤ T log k + 16kd log(T dk log k) = O (T log k + kd log(T kd)) , which derives an upper bound on the number of points that can be shattered, and hence the VC dimension. [sent-111, score-0.492]

43 To show the upper bound of dT , we simply let each domain select a separate hypothesis from H. [sent-112, score-0.24]

44 , hk } Figure 1: The SHAMO algorithm for learning shared models. [sent-133, score-0.487]

45 For any hypothesis class H of VC-dimension d, for any hard k-shared task classiﬁer f = (Hk , g) we have that with probability 1 − δ, |e(f ) − e(f )| = O ˆ (T log k + kd log(T kd)) log(m/T ) + log 1/δ m . [sent-136, score-0.582]

46 For any hypothesis class H of VC-dimension d, for any k, for any hard k-shared task classiﬁer f = (Hk , g) we have that with probability 1 − δ, |e(f ) − e(f )| = O ˆ (T log k + kd log(T kd)) log(m/T ) + log(k/δ) m . [sent-139, score-0.528]

47 The last two bounds state that empirical error is close to true error under two conditions, ﬁrst that T log k is small in compared with m = t nt . [sent-140, score-0.346]

48 The main point is that if the VC dimension is large and the average number of examples m/T is low, it is possible to compensate if the number of models k is small relative to the number of tasks T . [sent-144, score-0.435]

49 Hence, we expect to improve performance over individual models if there are many-tasks, yet we predict with relative few models. [sent-145, score-0.234]

50 There is a hypothesis class H of VC-dimension d, such that the class of hard k-shared task FH,k has VC dimension at least max{kd , T min{d, log k}}. [sent-148, score-0.396]

51 Proof: To show the lower bound of kd consider d points that H shatters, x1 , . [sent-149, score-0.254]

52 For any labeling of S, we can select for each domain j a different hypothesis from H that agrees with the labeling. [sent-154, score-0.24]

53 hk ∈ H, such that for any labeling of xi ’s there is a hypothesis hj which is consistent with it. [sent-164, score-0.49]

54 (Since k hypotheses can shatter at most log k points, we get the dependency on log k. [sent-165, score-0.225]

55 The middle (experiment I) and right (experiment II) panels shows the average error vs k for the three algorithms, and the “effective” number of models vs k (right axis). [sent-173, score-0.346]

56 Here we simply set g(t) to be the model which attains nt 1 the lowest error evaluated on the training set, that is, g(t) = arg mink nt i=1 I[hj (xt,i ) = yt,i ] . [sent-185, score-0.328]

57 Clearly, each stage reduces the training error of (1), but how far the resulting hypotheses from the optimal one is not clear. [sent-188, score-0.261]

58 We found in practice that this leads to over-ﬁtting, that is, in the second stage suboptimal hypotheses are assigned to g if evaluated on the test set (which clearly is not known during training time. [sent-190, score-0.227]

59 We call this algorithm SHAMO for learning with shared models. [sent-195, score-0.25]

60 5 Empirical Study We evaluated our algorithm on both synthetic and real-world sentiment classiﬁcation task. [sent-198, score-0.233]

61 Three methods are evaluated, learning one model per task, called Individual below, learning one model for all tasks called Shared below, and learning k models using our algorithm, SHAMO. [sent-200, score-0.339]

62 The ﬁrst two inputs of tasks t were drawn with a covariance speciﬁc for that tasks. [sent-207, score-0.272]

63 , x20 ) was set to be sign(x2 · st ) where st ∈ {−1, +1} with probability half. [sent-212, score-0.278]

64 We generated T = 200 such tasks each with 6 training examples (with at least one example from each class), and ran our algorithm for various values of k. [sent-213, score-0.355]

65 2, clearly two models are enough to classify all tasks correctly (depending on the value of st above), and furthermore, applying the wrong model yields test error of 100%. [sent-217, score-0.568]

66 All 6 examples were used both to build models and associating models to tasks. [sent-218, score-0.229]

67 2, in which we plot mean error of the three algorithms vs the number of models k, with error bars for 95% conﬁdence interval. [sent-222, score-0.325]

68 It is clear that Shared performs worst with an average error of 50% (highest line), which is explained by the fact that the test error of half of the models over the other half of the data-sets is about 100%. [sent-224, score-0.352]

69 The dotted-black line indicates the number of “effective” models per value of k, which is the smallest number of models which at least 90 tasks are associated with (exactly) one of them. [sent-227, score-0.394]

70 The ﬁrst ten inputs of tasks t were drawn with a covariance speciﬁc for that tasks. [sent-235, score-0.272]

71 In these experiments ten models are enough to classify all tasks correctly, yet in this experiment, applying the wrong model yields test error of only 50%. [sent-247, score-0.474]

72 Out of the 25 examples available for each task, 7 were used to build models, and the remaining 18 were used to associate models to tasks (η = 7/25). [sent-248, score-0.384]

73 2, in which we plot mean error of the three algorithms vs the number of models k, with error bars for 95% conﬁdence interval. [sent-252, score-0.325]

74 As before, Shared performs worst, Individual performs second, with test error of about performing second with about 15% obtained with 25 training examples. [sent-254, score-0.247]

75 We downloaded 2, 000 reviews from 31 categories, such as books, dvd and so on; a total of 62, 000 reviews all together. [sent-262, score-0.224]

76 The reviews we used were originally labeled with 1, 2, 4, 5 stars, as reviews with 3 stars were very hard to predict, even with very large amount of data. [sent-264, score-0.345]

77 Models 15 15 2 22 4 6 8 10 No Models ( K ) 12 0 14 (g) Data D, 186 Tasks 20 2 4 6 8 10 No Models ( K ) 12 0 14 (h) Data E, 248 Tasks Figure 4: Top: test error of Individual and Shared algorithms vs test error of SHAMO for k = 14, for all datasets with 2 thresholds. [sent-269, score-0.363]

78 Bottom: average error vs k for the three algorithms, and the “effective” number of models vs k (right axis). [sent-270, score-0.316]

79 Since we focus in the case of many tasks with small amount of data each, we used about 1/9 of the data for training and the remaining for evaluation. [sent-274, score-0.305]

80 The outcome are 62 tasks with 108 training examples and 892 test examples. [sent-282, score-0.39]

81 For the third dataset we partitioned the reviews into three sets, using one of the three goals above - is the number of starts above or below 1, is it above or below 3, and is it above or below 5 - ending up with 93 tasks with 72 training examples and 592 test examples in each. [sent-285, score-0.579]

82 5, and used one half of the examples to build models (set the weights of prediction functions), and the remaining half to evaluate each of the k models on the T tasks, and associating models to tasks. [sent-293, score-0.33]

83 The top panel shows the error of Individual and Shared vs SHAMO for k = 14. [sent-297, score-0.222]

84 Indeed, all the red-squares (corresponding to Individual) are above the blue-circles (corresponding to Shared), indicating that the shared model outperforms individual models. [sent-300, score-0.384]

85 Models 36 6 8 10 No Models ( K ) 12 Data H, 279 tasks 0 14 22 2 4 (d) 6 8 10 No Models ( K ) 12 0 14 Data I, 372 tasks Figure 5: Average error vs k for the three algorithms, and the “effective” number of models vs k (right axis). [sent-305, score-0.808]

86 The top panels show the test error of Individual and Shared algorithms vs test error of SHAMO for k = 14, with number of tasks increasing from left to right. [sent-314, score-0.616]

87 This gap is getting smaller with the number of tasks (the clouds are overlapping as we go from left panel to right). [sent-316, score-0.328]

88 intuitively, Shared introduces (label) bias as the two thresholds are being treated as one, while Individual introduces variance as smaller and smaller training sets are used, as we go from the left panel to the right one, the gap between bias and variance shrinks as the variance is increased. [sent-317, score-0.22]

89 As Shared is not affected by k nor T (as total training examples remains the same), its test error of 36% is ﬁxed across panels. [sent-322, score-0.259]

90 As we have more tasks, and less training examples per task, the test error of Individual increases from 25. [sent-323, score-0.275]

91 First, the gap between Individual and Shared is much smaller, in some tasks one is better, and in other tasks the other is better. [sent-335, score-0.529]

92 Additionally, for the smallest number of tasks (left) Individual is better with a gap of ∼ 1. [sent-336, score-0.283]

93 5%, while for largest number of tasks Individual is worse with a gap ranging between 1 − 4%. [sent-337, score-0.311]

94 Summary We described theoretical framework for multitask learning using small number of shared models. [sent-342, score-0.278]

95 Our theory suggests that many-tasks can be used to compensate for small number of training examples per task, if one can partition that tasks to few sets, with similar labeling function per set. [sent-343, score-0.516]

96 Our experimental results on both hand-crafted problems and real-world sentiment classiﬁcation problem strongly support the beneﬁts from the approach, even with very few examples per task. [sent-345, score-0.262]

97 We also plan to derive theory and algorithms for soft association of tasks to classiﬁers. [sent-347, score-0.292]

98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-353, score-0.246]

99 Generalizing from several related classiﬁcation tasks to a new unlabeled sample. [sent-369, score-0.246]

100 Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. [sent-372, score-0.225]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('shamo', 0.588), ('shared', 0.25), ('tasks', 0.246), ('hk', 0.237), ('kd', 0.218), ('vc', 0.214), ('avearge', 0.202), ('sentiment', 0.174), ('fhk', 0.147), ('st', 0.139), ('individual', 0.134), ('hypothesis', 0.114), ('reviews', 0.112), ('error', 0.093), ('vs', 0.084), ('stars', 0.08), ('hypotheses', 0.08), ('thresholds', 0.079), ('task', 0.077), ('mansour', 0.077), ('hj', 0.076), ('nt', 0.076), ('koby', 0.07), ('effective', 0.065), ('labeling', 0.063), ('domain', 0.063), ('dimension', 0.062), ('training', 0.059), ('crammer', 0.056), ('models', 0.055), ('log', 0.054), ('evgeniou', 0.052), ('pool', 0.052), ('adaptation', 0.051), ('examples', 0.05), ('dk', 0.049), ('yishay', 0.049), ('mj', 0.047), ('association', 0.046), ('panel', 0.045), ('yet', 0.045), ('classi', 0.045), ('ers', 0.042), ('theodoros', 0.042), ('hard', 0.041), ('massimiliano', 0.04), ('rr', 0.04), ('jennifer', 0.038), ('per', 0.038), ('gap', 0.037), ('blitzer', 0.037), ('afshin', 0.037), ('aviv', 0.037), ('shatter', 0.037), ('shattered', 0.037), ('tel', 0.037), ('tting', 0.037), ('generalization', 0.036), ('bound', 0.036), ('associating', 0.036), ('test', 0.035), ('synthetic', 0.035), ('hi', 0.034), ('dt', 0.033), ('corollary', 0.033), ('build', 0.033), ('blanchard', 0.032), ('ando', 0.03), ('mehryar', 0.03), ('jt', 0.03), ('bounds', 0.03), ('panels', 0.03), ('performs', 0.03), ('stage', 0.029), ('ij', 0.029), ('learn', 0.029), ('ranging', 0.028), ('hg', 0.028), ('volker', 0.028), ('multitask', 0.028), ('upper', 0.027), ('dataset', 0.027), ('inputs', 0.026), ('tresp', 0.026), ('fernando', 0.026), ('moderately', 0.025), ('calling', 0.025), ('er', 0.025), ('class', 0.024), ('evaluated', 0.024), ('summarized', 0.024), ('kai', 0.024), ('datasets', 0.023), ('amit', 0.023), ('axis', 0.023), ('half', 0.023), ('compensate', 0.022), ('hal', 0.022), ('affected', 0.022), ('daum', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

2 0.13010657 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a ﬁnal decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the beneﬁts that they bring compared to single-task learning. PAC-style bounds for classiﬁcation [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difﬁcult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, ﬁnding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower ﬁnal part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully speciﬁed by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the ﬂexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an inﬁnite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is deﬁned as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a speciﬁed number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are deﬁned by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general inﬁnite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the ﬁnite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to inﬁnity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we deﬁne Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simpliﬁcations below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The ﬁrst term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality deﬁnes Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the ﬁrst gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the ﬁrst and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, ﬁnally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the deﬁnition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and deﬁne a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to ﬁnd the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to ﬁnd a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is speciﬁed by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisﬁed whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisﬁed by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as deﬁned by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to ﬁnd the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the ﬁrst T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the beneﬁts of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the ﬁrst T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the ﬁrst T0 tasks. One ﬁnds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are ﬁxed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We ﬁx π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse ﬁrst the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as deﬁned e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-speciﬁc signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept ﬁxed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at ﬁxed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This ﬁnal stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of speciﬁc choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artiﬁcial Intelligence and Statistics 10, pages 333–340. Society for Artiﬁcial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speciﬁc features. In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

3 0.12874649 291 nips-2012-Reducing statistical time-series problems to binary classification

Author: Daniil Ryabko, Jeremie Mary

Abstract: We show how binary classiﬁcation methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classiﬁcation and concern highly-dependent time series. Speciﬁcally, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classiﬁcation methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1

4 0.11533426 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

5 0.11347025 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. All, All Pairs, Tree-based classiﬁers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the ﬁrst four methods, the classiﬁcation is based on a reduction to binary classiﬁcation. We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

6 0.10484522 142 nips-2012-Generalization Bounds for Domain Adaptation

7 0.098868564 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

8 0.084566712 361 nips-2012-Volume Regularization for Binary Classification

9 0.082369126 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

10 0.079550914 200 nips-2012-Local Supervised Learning through Space Partitioning

11 0.071580671 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

12 0.066641226 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

13 0.064085864 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

14 0.063981138 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

15 0.062171295 282 nips-2012-Proximal Newton-type methods for convex optimization

16 0.061621133 32 nips-2012-Active Comparison of Prediction Models

17 0.06027583 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

18 0.059610702 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

19 0.056812555 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

20 0.056332089 221 nips-2012-Multi-Stage Multi-Task Feature Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.009), (2, 0.011), (3, -0.036), (4, 0.049), (5, 0.003), (6, 0.003), (7, 0.088), (8, -0.036), (9, -0.058), (10, -0.007), (11, 0.016), (12, 0.025), (13, -0.031), (14, 0.026), (15, -0.048), (16, -0.054), (17, 0.081), (18, -0.061), (19, 0.041), (20, -0.07), (21, -0.014), (22, -0.03), (23, 0.023), (24, 0.022), (25, -0.027), (26, 0.096), (27, -0.056), (28, -0.013), (29, -0.134), (30, 0.007), (31, 0.134), (32, -0.042), (33, -0.141), (34, -0.05), (35, -0.004), (36, -0.12), (37, 0.003), (38, -0.015), (39, -0.106), (40, 0.003), (41, -0.022), (42, 0.007), (43, 0.039), (44, 0.011), (45, -0.038), (46, -0.03), (47, -0.021), (48, -0.13), (49, -0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93619293 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

2 0.69062018 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

3 0.6837042 291 nips-2012-Reducing statistical time-series problems to binary classification

Author: Daniil Ryabko, Jeremie Mary

4 0.62645781 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

5 0.61759514 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

6 0.61340517 187 nips-2012-Learning curves for multi-task Gaussian process regression

7 0.57705736 361 nips-2012-Volume Regularization for Binary Classification

8 0.57195204 142 nips-2012-Generalization Bounds for Domain Adaptation

9 0.57109708 271 nips-2012-Pointwise Tracking the Optimal Regression Function

10 0.52362561 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button

11 0.52341592 32 nips-2012-Active Comparison of Prediction Models

12 0.51190525 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.49191311 221 nips-2012-Multi-Stage Multi-Task Feature Learning

14 0.48792645 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data

15 0.48697677 225 nips-2012-Multi-task Vector Field Learning

16 0.48049578 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

17 0.47923699 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

18 0.47448811 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

19 0.45415887 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

20 0.44931373 222 nips-2012-Multi-Task Averaging

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (21, 0.014), (38, 0.564), (42, 0.021), (54, 0.012), (55, 0.011), (74, 0.037), (76, 0.139), (80, 0.058), (92, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99667609 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

Author: Zhuo Wang, Alan Stocker, Daniel Lee

Abstract: In this work we study how the stimulus distribution inﬂuences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1

2 0.99089849 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with ﬁxed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the ﬁrst efﬁcient weak-oracle. 1

3 0.99012762 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efﬁciently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

4 0.9853763 165 nips-2012-Iterative ranking from pair-wise comparisons

Author: Sahand Negahban, Sewoong Oh, Devavrat Shah

Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, ﬁnding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efﬁcacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the ﬁnite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1

same-paper 5 0.9683851 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

6 0.96542048 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

7 0.96308464 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.95668077 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

9 0.94788975 208 nips-2012-Matrix reconstruction with the local max norm

10 0.94285995 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

11 0.90866739 285 nips-2012-Query Complexity of Derivative-Free Optimization

12 0.89936823 86 nips-2012-Convex Multi-view Subspace Learning

13 0.89131725 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

14 0.88713539 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

15 0.88504958 30 nips-2012-Accuracy at the Top

16 0.88445073 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

17 0.87784523 324 nips-2012-Stochastic Gradient Descent with Only One Projection

18 0.87700504 187 nips-2012-Learning curves for multi-task Gaussian process regression

19 0.87679279 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

20 0.87547308 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization