nips nips2007 nips2007-135 knowledge-graph by maker-knowledge-mining

135 nips-2007-Multi-task Gaussian Process Prediction


Source: pdf

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. [sent-20, score-0.507]

2 We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. [sent-22, score-0.91]

3 We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. [sent-23, score-0.557]

4 A common set up is that there are multiple related tasks for which we want to avoid tabula rasa learning by sharing information across the different tasks. [sent-26, score-0.189]

5 The hope is that by learning these tasks simultaneously one can improve performance over the “no transfer” case (i. [sent-27, score-0.189]

6 However, as pointed out in [1] and supported empirically by [2], assuming relatedness in a set of tasks and simply learning them together can be detrimental. [sent-30, score-0.189]

7 It is therefore important to have models that will generally benefit related tasks and will not hurt performance when these tasks are unrelated. [sent-31, score-0.378]

8 We propose a model that attempts to learn inter-task dependencies based solely on the task identities and the observed data for each task. [sent-33, score-0.115]

9 This contrasts with approaches in [3, 4] where task-descriptor features t were used in a parametric covariance function over different tasks—such a function may be too constrained by both its parametric form and the task descriptors to model task similarities effectively. [sent-34, score-0.917]

10 Hence we propose a model that learns a “free-form” task-similarity matrix, which is used in conjunction with a parameterized covariance function over the input features x. [sent-36, score-0.258]

11 In our model, this is achieved by having a common covariance function over the features x of the input observations. [sent-38, score-0.258]

12 This contrasts with the semiparametric latent factor model [5] where, with the same set of input observations, one has to estimate the parameters of several covariance functions belonging to different latent processes. [sent-39, score-0.437]

13 For our model we can show the interesting theoretical property that there is a cancellation of intertask transfer in the specific case of noise-free observations and a block design. [sent-40, score-0.781]

14 Finally, we make use of GP approximations and properties of our model in order to scale our approach to large multi-task data sets, and evaluate the benefits of our model on two practical multi-task applications: a compiler performance prediction problem and a exam score prediction task. [sent-42, score-0.623]

15 , xN we define the complete set of responses for M tasks as y = (y11 , . [sent-49, score-0.189]

16 , yN M )T , where yil is the response for the lth task on the ith input xi . [sent-64, score-0.231]

17 Given a set of observations yo , which is a subset of y, we want to predict some of the unobserved response-values yu at some input locations for certain tasks. [sent-66, score-0.382]

18 We approach this problem by placing a GP prior over the latent functions {fl } so that we directly induce correlations between tasks. [sent-67, score-0.114]

19 Assuming that the GPs have zero mean we set f fl (x)fk (x ) = Klk k x (x, x ) 2 yil ∼ N (fl (xi ), σl ), (1) where K f is a positive semi-definite (PSD) matrix that specifies the inter-task similarities, k x is a 2 covariance function over inputs, and σl is the noise variance for the lth task. [sent-68, score-0.477]

20 Below we focus on x stationary covariance functions k ; hence, to avoid redundancy in the parametrization, we further let k x be only a correlation function (i. [sent-69, score-0.177]

21 The important property of this model is that the joint Gaussian distribution over y is not blockdiagonal wrt tasks, so that observations of one task can affect the predictions on another task. [sent-72, score-0.332]

22 In [4, 3] this property also holds, but instead of specifying a general PSD matrix K f , these authors set f Klk = k f (tl , tk ), where k f (·, ·) is a covariance function over the task-descriptor features t. [sent-73, score-0.364]

23 One popular setup for multi-task learning is to assume that tasks can be clustered, and that there are inter-task correlations between tasks in the same cluster. [sent-74, score-0.428]

24 This can be easily modelled with a general task-similarity K f matrix: if we assume that the tasks are ordered with respect to the clusters, then K f will have a block diagonal structure. [sent-75, score-0.336]

25 Of course, as we are learning a “free form” K f the ordering of the tasks is irrelevant in practice (and is only useful for explanatory purposes). [sent-76, score-0.189]

26 1 Inference Inference in our model can be done by using the standard GP formulae for the mean and variance of the predictive distribution with the covariance function given in equation (1). [sent-78, score-0.177]

27 2 Learning Hyperparameters Given the set of observations yo , we wish to learn the parameters θ x of k x and the matrix K f to maximize the marginal likelihood p(yo |X, θ x , K f ). [sent-85, score-0.399]

28 Note, however, that one only needs to actually compute the Gram matrix and its inverse at the visible locations corresponding to yo . [sent-91, score-0.302]

29 Alternatively, it is possible to exploit the Kronecker product structure of the full covariance matrix as in [6], where an EM algorithm is proposed such that learning of θ x and K f in the M-step is decoupled. [sent-92, score-0.249]

30 For clarity, let us consider the case where yo = y, i. [sent-100, score-0.196]

31 3 Noiseless observations and the cancellation of inter-task transfer One particularly interesting case to consider is noise-free observations at the same locations for all tasks (i. [sent-108, score-0.952]

32 In this case maximizing the marginal likelihood p(y|X) wrt the parameters θ x of k x reduces to maximizing −M log |K x | − N log |Y T (K x )−1 Y |, an expression that does not depend on K f . [sent-111, score-0.108]

33 Then K f is simply the sample covariance of the de-correlated Y . [sent-115, score-0.177]

34 Unfortunately, in this case there is effectively no transfer between the tasks (given the kernels). [sent-116, score-0.564]

35 We have (using the mixedproduct property of Kronecker products) that f (x∗ ) = K f ⊗ kx ∗ f T = (K ) f =  = Kf ⊗ Kx ⊗ (kx )T ∗ f −1 K (K )  T ⊗ −1 f −1 (K ) y ⊗ (K ) (kx )T (K x )−1 ∗ (kx )T (K x )−1 y·1 ∗ (6) x −1 y y (7) (8)   . [sent-118, score-0.2]

36 Thus, in the noiseless case with a block design, the predictions for task l depend only on the targets y·l . [sent-122, score-0.338]

37 In other words, there is a cancellation of transfer. [sent-123, score-0.196]

38 One can in fact generalize this result to show that the cancellation of transfer for task l does still hold even if the observations are only sparsely observed at locations X = (x1 , . [sent-124, score-0.799]

39 After having derived this result we learned that it is known as autokrigeability in the geostatistics literature [7], and is also related to the symmetric Markov property of covariance functions that is discussed in [8]. [sent-128, score-0.284]

40 We emphasize that if the observations are noisy, or if there is not a block design, then this result on cancellation of transfer will not hold. [sent-129, score-0.747]

41 This result can also be generalized to multidimensional tensor product covariance functions and grids [9]. [sent-130, score-0.177]

42 Here, we use the Nystr¨ m approximation of K x in the o def x x x marginal likelihood, so that K x ≈ K x = K·I (KII )−1 KI· , where I indexes Q rows/columns of K x . [sent-134, score-0.132]

43 Specifying a full rank K f requires M (M + 1)/2 parameters, and for large M this would be a lot of parameters to estimate. [sent-137, score-0.13]

44 def ˜ ˜ Applying both approximations to get Σ ≈ Σ = K f ⊗ K x + D ⊗ IN , we have, after using the −1 T −1 def −1 −1 −1 x ˜ B ∆ where B = (L ⊗ Woodbury identity, Σ = ∆ − ∆ B I ⊗ KII + B T ∆−1 B def x ˜ ˜ K·I ), and ∆ = D ⊗ IN is a diagonal matrix. [sent-143, score-0.356]

45 of Σ For the EM algorithm, the approximation of K x poses a problem in (4) because for the rank-deficient matrix K x , its log-determinant is negative infinity, and its matrix inverse is undefined. [sent-145, score-0.144]

46 Within the GP literature, [14, 15, 16, 17, 18] give models where the covariance matrix of the full (noiseless) system is block diagonal, and each of the M blocks is induced from the same kernel function. [sent-152, score-0.378]

47 In contrast, in our model and in [5, 3, 4] the covariance is not block diagonal. [sent-154, score-0.274]

48 The semiparametric latent factor model (SLFM) of Teh et al [5] involves having P latent processes (where P ≤ M ) and each of these latent processes has its own covariance function. [sent-155, score-0.554]

49 The noiseless outputs are obtained by linear mixing of these processes with a M × P matrix Φ. [sent-156, score-0.213]

50 The covariance matrix of the system under this model has rank at most P N , so that when P < M the system corresponds to a degenerate GP. [sent-157, score-0.317]

51 Our model is similar to [5] but simpler, in that all of the P latent processes share the same covariance function; this reduces the number of free parameters to be fitted and should help to minimize overfitting. [sent-158, score-0.385]

52 With a common covariance function k x , it turns out that K f is equal to ΦΦT , so a K f that is strictly positive definite corresponds to using P = M latent processes. [sent-159, score-0.241]

53 A sum of such processes is known as the linear coregionalization model (LCM) [7] for which [6] gives an EM-based algorithm for parameter estimation. [sent-164, score-0.119]

54 To see this, let Epp be a P × P diagonal matrix with 1 at (p, p) and zero elsewhere. [sent-167, score-0.122]

55 Then we P P x x can write the covariance in SLFM as (Φ⊗I)( p=1 Epp ⊗Kp )(Φ⊗I)T = p=1 (ΦEpp ΦT )⊗Kp , where ΦEpp ΦT is of rank 1. [sent-168, score-0.245]

56 [19] consider methods for inducing correlations between tasks based on a correlated prior over linear regression parameters. [sent-170, score-0.271]

57 In fact this corresponds to a GP prior using the kernel k(x, x ) = xT Ax for some positive definite matrix A. [sent-171, score-0.104]

58 The first application is a compiler performance prediction problem where the goal is to predict the speed-up obtained in a given program (task) when applying a sequence of code transformations x. [sent-178, score-0.399]

59 The second application is an exam score prediction problem where the goal is to predict the exam score obtained by a student x belonging to a specific school (task). [sent-179, score-0.758]

60 In the sequel, we will refer to the data related to the first problem as the compiler data and the data related to the second problem as the school data. [sent-180, score-0.437]

61 We are interested in assessing the benefits of our approach not only with respect to the no-transfer case but also with respect to the case when a parametric GP is used on the joint input-dependent and task-dependent space as in [3]. [sent-181, score-0.198]

62 To train the parametric model note that the parameters of the covariance function over task descriptors k f (t, t ) can be tuned by maximizing the marginal likelihood, as in [3]. [sent-182, score-0.542]

63 For both applications we have used a squared-exponential (or Gaussian) covariance function k x and a non-parametric form for K f . [sent-185, score-0.177]

64 Where relevant the parametric covariance function k f was also taken to be of squared-exponential form. [sent-186, score-0.375]

65 All 2 the length scales in k x and k f were initialized to 1, and all σl were constrained to be equal for all tasks and initialized to 0. [sent-190, score-0.189]

66 Each task is to predict the speed-up on a given program when applying a specific transformation sequence. [sent-195, score-0.215]

67 The speed-up after applying a transformation sequence on a given program is defined as the ratio of the execution time of the original program (baseline) over the execution time of the transformed program. [sent-196, score-0.248]

68 In [3] the taskdescriptor features (for each program) are based on the speed-ups obtained on a pre-selected set of 8 transformations sequences, so-called “canonical responses”. [sent-198, score-0.123]

69 For comparison with [20, 19] we evaluate our model following the set up described above and similarly, we have created dummy variables for those features that are categorical forming a total of 19 student-dependent features and 8 school-dependent features. [sent-213, score-0.196]

70 However, we note that school-descriptor features such as the percentage of students eligible for free school meals and the percentage of students in VR band 1 actually depend on the year the particular sample was taken. [sent-214, score-0.822]

71 However, as we have described throughout this paper, our approach learns task similarity directly without the need for task-dependent features. [sent-216, score-0.115]

72 6 Results For the compiler data we have M = 11 tasks and we have used a Cholesky decomposition K f = LLT . [sent-218, score-0.432]

73 For the school data we have M = 139 tasks and we have preferred a reduced rank ˜˜ parameterization of K f ≈ K f = LLT , with ranks 1, 2, 3 and 5. [sent-219, score-0.451]

74 Compiler Data: For this particular application, in a real-life scenario it is critical to achieve good performance with a low number of training data-points per task given that a training data-point requires the compilation and execution of a (potentially) different version of a program. [sent-223, score-0.157]

75 Therefore, although there are a total of 88214 training points per program we have followed a similar set up to [3] by considering N = 16, 32, 64 and 128 transformation sequences per program for training. [sent-224, score-0.164]

76 Figure 1 shows the mean absolute errors obtained on the compiler data for some of the tasks (top row and bottom left) and on average for all the tasks (bottom right). [sent-228, score-0.655]

77 Sample task 1 (histogram) is an example where learning the tasks simultaneously brings major benefits over the no transfer case. [sent-229, score-0.712]

78 Additionally, it is consistently (although only marginally) superior to the parametric approach. [sent-231, score-0.198]

79 For sample task 2 (fir), our approach not only significantly outperforms the no transfer case but also provides greater benefits over the parametric method (which for N = 64 and 128 is worse than no transfer). [sent-232, score-0.688]

80 Sample task 3 (adpcm) is the only case out of all 11 tasks where our approach degrades performance, although it should be noted that all the methods perform similarly. [sent-233, score-0.304]

81 Further analysis of the data indicates that learning on this task is hard as there is a lot of variability that cannot be explained by the 1-out-of-13 encoding used for the input features. [sent-234, score-0.22]

82 Finally, for all tasks on average (bottom right) our approach brings significant improvements over single task learning and consistently outperforms the parametric method. [sent-235, score-0.535]

83 For all tasks except one our model provides better or roughly equal performance than the non-transfer case and the parametric model. [sent-236, score-0.387]

84 Given that there can be multiple observations of a target value for a given task at a specific input x, we have taken the mean of these observations and corrected the noise variances by dividing them over the corresponding number of observations. [sent-239, score-0.273]

85 As in [19], the percentage explained variance is used as the measure of performance. [sent-240, score-0.111]

86 02 0 16 32 N 64 128 N (c) (d) Figure 1: Panels (a), (b) and (c) show the average mean absolute error on the compiler data as a function of the number of training points for specific tasks. [sent-267, score-0.277]

87 no transfer stands for the use of a single GP for each task separately; transfer parametric is the use of a GP with a joint parametric (SE) covariance function as in [3]; and transfer free-form is multi-task GP with a “free form” covariance matrix over tasks. [sent-268, score-2.062]

88 The parametric result given in the table was obtained from the school-descriptor features; in the cases where these features varied for a given school over the years, an average was taken. [sent-272, score-0.473]

89 no transfer parametric rank 1 rank 2 rank 3 rank 5 21. [sent-280, score-0.845]

90 42) Table 1: Percentage variance explained on the school dataset for various situations. [sent-292, score-0.237]

91 On the school data the parametric approach for K f slightly outperforms the non-parametric method, probably due to the large size of this matrix relative to the amount of data. [sent-294, score-0.464]

92 One can also run the parametric approach creating a task for every unique school-features descriptor1 ; this gives rise to 288 tasks rather than 139 schools, and a performance of 33. [sent-295, score-0.502]

93 they combine both student and school features into x) and then introduce inter-task correlations as described in section 4. [sent-300, score-0.371]

94 This approach uses the same information as our 288 task case, and gives similar performance of around 34% (as shown in Figure 3 of [19]). [sent-301, score-0.115]

95 1 that the school features can vary over different years. [sent-303, score-0.275]

96 7 Conclusion In this paper we have described a method for multi-task learning based on a GP prior which has inter-task correlations specified by the task similarity matrix K f . [sent-304, score-0.237]

97 We have shown that in a noisefree block design, there is actually a cancellation of transfer in this model, but not in general. [sent-305, score-0.668]

98 We have successfully applied the method to the compiler and school problems. [sent-306, score-0.437]

99 However, such features might be beneficial if we consider a setup where there are only few datapoints for a new task, and where the task-descriptor features convey useful information about the tasks. [sent-310, score-0.162]

100 A note on noise-free Gaussian process prediction with separable covariance functions and grid designs. [sent-355, score-0.227]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transfer', 0.375), ('compiler', 0.243), ('gp', 0.232), ('parametric', 0.198), ('cancellation', 0.196), ('yo', 0.196), ('school', 0.194), ('tasks', 0.189), ('covariance', 0.177), ('exam', 0.168), ('kx', 0.166), ('task', 0.115), ('epp', 0.112), ('fl', 0.112), ('kf', 0.104), ('mae', 0.104), ('llt', 0.097), ('block', 0.097), ('kai', 0.089), ('klk', 0.084), ('lcm', 0.084), ('slfm', 0.084), ('kii', 0.083), ('students', 0.083), ('free', 0.081), ('features', 0.081), ('def', 0.08), ('observations', 0.079), ('noiseless', 0.078), ('evgeniou', 0.078), ('volker', 0.078), ('geostatistics', 0.073), ('yu', 0.073), ('matrix', 0.072), ('rank', 0.068), ('percentage', 0.068), ('lth', 0.067), ('approximations', 0.066), ('tresp', 0.066), ('program', 0.064), ('latent', 0.064), ('processes', 0.063), ('vr', 0.062), ('lot', 0.062), ('christopher', 0.06), ('parametrization', 0.059), ('band', 0.059), ('edinburgh', 0.059), ('semiparametric', 0.059), ('chai', 0.056), ('coregionalization', 0.056), ('edwin', 0.056), ('meals', 0.056), ('wrt', 0.056), ('cholesky', 0.053), ('marginal', 0.052), ('correlations', 0.05), ('kronecker', 0.05), ('diagonal', 0.05), ('prediction', 0.05), ('eligible', 0.049), ('shipeng', 0.049), ('anton', 0.049), ('bonilla', 0.049), ('yil', 0.049), ('predictions', 0.048), ('em', 0.046), ('student', 0.046), ('score', 0.046), ('bene', 0.045), ('bakker', 0.044), ('schools', 0.044), ('ppca', 0.044), ('kp', 0.044), ('rasmussen', 0.044), ('explained', 0.043), ('yn', 0.042), ('execution', 0.042), ('transformations', 0.042), ('carl', 0.041), ('psd', 0.041), ('gender', 0.041), ('belonging', 0.04), ('design', 0.04), ('ts', 0.04), ('multitask', 0.039), ('hyperparameters', 0.037), ('ki', 0.036), ('transformation', 0.036), ('locations', 0.034), ('categorical', 0.034), ('absolute', 0.034), ('property', 0.034), ('brings', 0.033), ('contrasts', 0.033), ('kernel', 0.032), ('speed', 0.032), ('inducing', 0.032), ('records', 0.032), ('learnt', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

2 0.28735092 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

Author: M. M. Mahmud, Sylvian Ray

Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1

3 0.27936146 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.16811332 170 nips-2007-Robust Regression with Twinned Gaussian Processes

Author: Andrew Naish-guzman, Sean Holden

Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].

5 0.15219705 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

6 0.14357729 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

7 0.14318359 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

8 0.12713923 195 nips-2007-The Generalized FITC Approximation

9 0.12222077 32 nips-2007-Bayesian Co-Training

10 0.11367887 134 nips-2007-Multi-Task Learning via Conic Programming

11 0.11102989 156 nips-2007-Predictive Matrix-Variate t Models

12 0.089931972 97 nips-2007-Hidden Common Cause Relations in Relational Learning

13 0.080513604 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

14 0.078786612 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

15 0.072217077 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior

16 0.07097061 161 nips-2007-Random Projections for Manifold Learning

17 0.068908289 175 nips-2007-Semi-Supervised Multitask Learning

18 0.068810321 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

19 0.067813247 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

20 0.065165319 213 nips-2007-Variational Inference for Diffusion Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.251), (1, 0.103), (2, -0.038), (3, 0.083), (4, -0.049), (5, 0.035), (6, -0.288), (7, -0.19), (8, -0.084), (9, -0.039), (10, -0.167), (11, -0.115), (12, -0.028), (13, 0.085), (14, -0.129), (15, -0.113), (16, -0.042), (17, -0.104), (18, 0.097), (19, -0.075), (20, -0.153), (21, 0.179), (22, -0.004), (23, -0.018), (24, 0.041), (25, 0.095), (26, 0.078), (27, -0.012), (28, -0.043), (29, 0.02), (30, -0.133), (31, 0.027), (32, -0.114), (33, -0.147), (34, -0.027), (35, 0.125), (36, 0.002), (37, -0.161), (38, 0.021), (39, 0.005), (40, 0.042), (41, -0.075), (42, -0.077), (43, -0.091), (44, -0.006), (45, 0.039), (46, 0.034), (47, 0.07), (48, -0.061), (49, -0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95455658 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

2 0.81408399 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

Author: M. M. Mahmud, Sylvian Ray

Abstract: In transfer learning we aim to solve new problems using fewer examples using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as it is conceptually troubling and it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the ‘right’ amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other reasonable transfer method can do much better than our Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between 8 arbitrarily chosen databases from the UCI ML repository. 1

3 0.67056286 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.55958933 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

Author: Andreas Argyriou, Massimiliano Pontil, Yiming Ying, Charles A. Micchelli

Abstract: Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance. 1

5 0.5521155 195 nips-2007-The Generalized FITC Approximation

Author: Andrew Naish-guzman, Sean Holden

Abstract: We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.

6 0.50280184 170 nips-2007-Robust Regression with Twinned Gaussian Processes

7 0.48725125 156 nips-2007-Predictive Matrix-Variate t Models

8 0.45210207 134 nips-2007-Multi-Task Learning via Conic Programming

9 0.45072225 97 nips-2007-Hidden Common Cause Relations in Relational Learning

10 0.43965915 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

11 0.40911058 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

12 0.37783846 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

13 0.37131357 32 nips-2007-Bayesian Co-Training

14 0.35356984 96 nips-2007-Heterogeneous Component Analysis

15 0.3492299 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

16 0.33285856 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

17 0.33135507 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

18 0.32671222 175 nips-2007-Semi-Supervised Multitask Learning

19 0.31957614 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

20 0.30976531 18 nips-2007-A probabilistic model for generating realistic lip movements from speech


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.028), (13, 0.019), (16, 0.058), (18, 0.012), (21, 0.074), (31, 0.014), (34, 0.015), (35, 0.047), (47, 0.068), (49, 0.013), (83, 0.082), (85, 0.435), (87, 0.012), (90, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86235315 197 nips-2007-The Infinite Markov Model

Author: Daichi Mochihashi, Eiichiro Sumita

Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically infinite order. By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1

2 0.79997134 77 nips-2007-Efficient Inference for Distributions on Permutations

Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas

Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1

same-paper 3 0.79956436 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

4 0.72344685 112 nips-2007-Learning Monotonic Transformations for Classification

Author: Andrew Howard, Tony Jebara

Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1

5 0.49671859 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction. 1

6 0.44653261 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

7 0.4410181 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

8 0.42882484 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

9 0.42648664 97 nips-2007-Hidden Common Cause Relations in Relational Learning

10 0.42579868 134 nips-2007-Multi-Task Learning via Conic Programming

11 0.42128077 63 nips-2007-Convex Relaxations of Latent Variable Training

12 0.41824111 195 nips-2007-The Generalized FITC Approximation

13 0.41624704 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

14 0.41607681 156 nips-2007-Predictive Matrix-Variate t Models

15 0.41459036 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

16 0.41383567 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

17 0.40967366 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

18 0.40814358 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

19 0.40599632 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling

20 0.40572312 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models