jmlr jmlr2012 jmlr2012-1 knowledge-graph by maker-knowledge-mining

1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Source: pdf

Author: Grigorios Skolidis, Guido Sanguinetti

Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 On the other hand, in the complex world that we live in we are usually faced with unseen but similar problems, situations which human intelligence handles by adaptively taking decisions on the new tasks using knowledge from similar tasks. [sent-14, score-0.463]

2 For example, the situation where labels are available for all tasks is tackled by multi-task learning, which synergistically solves the learning problem in all tasks simultaneously (Caruana, 1997; Bakker and Heskes, 2003; Ando and Zhang, 2005). [sent-19, score-0.808]

3 In particular, it is generally assumed that at least the input data for the target task will be available during the learning, so that a measure of similarity between the training and target tasks can be estimated. [sent-33, score-0.694]

4 The question that we wish to raise in this work is whether the notion of generalisation can be extended to the level of tasks as a form of meta-generalisation. [sent-34, score-0.415]

5 Meta-generalisation is a concept introduced in Baxter (2000), where the author argues whether a transfer learning algorithm can generalise well on totally unseen tasks after seeing sufﬁciently many source (or training) tasks. [sent-35, score-0.622]

6 In his work Baxter (2000) derives bounds on the generalisation error of this problem in terms of a generalised VC-dimension parameter, as well as comments that the number of source tasks and examples per task required to ensure good performance on novel tasks has to be sufﬁciently large. [sent-38, score-1.015]

7 One way to approach meta-generalising is through domain adaptation, by training a model on the data of the source and the target set of tasks (Ben-David et al. [sent-41, score-0.593]

8 This type of approach, as well as the model proposed in Baxter (2000), are essentially trained in a transductive way, as the algorithm is able to make predictions only on the test tasks that is trained on, or needs to be retrained in case a new task arrives. [sent-43, score-0.553]

9 The problem of sampling the space of tasks to make predictions on totally unseen tasks in the inductive setting, which is the exact analog of generalising in the level of tasks, to the best of our knowledge has not been speciﬁcally addressed. [sent-47, score-0.975]

10 Hence, multi-task learning can be seen as an Inductive TL algorithm since input data and labels are available for all the tasks that we wish to make predictions. [sent-49, score-0.418]

11 On this basis, meta-generalising can be considered as a form of Unsupervised TL, since the learning algorithm does not have any exploitable information about the target tasks during training . [sent-52, score-0.503]

12 The model uses a multi-class Gaussian process for assigning probabilistically unseen tasks to source tasks (determining task responsibilities), and then uses a multi-task Gaussian process (Bonilla et al. [sent-56, score-1.063]

13 In a meta-generalising scenario the learner is provided with a set of source tasks TS = s , . [sent-65, score-0.52]

14 , T s } which are used for training the model; testing is then performed on a set of target {T1 M t tasks TT = {T1t , . [sent-68, score-0.503]

15 Each of the M source tasks will contain a training set of input/ output pairs (x, y), while data from any of the H target tasks are hidden. [sent-72, score-0.983]

16 For later convenience, we will s deﬁne the whole training set across tasks as a set of triples T s = {xis , yst , ysx }N , where xis ∈ Rd is i i i=1 the input feature vector, ysx ∈ {−1, +1} are the class labels, and yst ∈ {1, . [sent-73, score-0.678]

17 , M} is the source task i i label indicating to which task the input/ output pair pertains, and N s = ∑M nsj is the total number j=1 of training pairs where nsj is number of data points from the jth source task. [sent-76, score-0.497]

18 Moreover, we will ns ns j j write X js = {xisj }i=1 to denote the total item set of the jth source task, while ysx = {ysx }i=1 and j ij ns j yst = {ystj }i=1 will be used to denote all class and task labels from the jth source task. [sent-77, score-0.521]

19 nt j Each of the H target tasks T jt will consist of a set X tj = {xti j }i=1 of input points, where ntj is number of data points from the jth target task and both types of labels are missing. [sent-79, score-0.758]

20 For reasons that will become clear later j=1 on, it is further assumed that for each target task data point xtj there is information that it comes from the jth target task, but there is no knowledge with which of the source tasks is more similar. [sent-81, score-0.777]

21 (2010), we will deﬁne the low-error joint prediction between a source and a target task as the error λe between their predictive functions fs and ft respectively, evaluated at the union 693 S KOLIDIS AND S ANGUINETTI of the source and the target sets X = X s ∪ X t , with N = N s + N t . [sent-88, score-0.472]

22 Intuitively, if the error λe is large then there is a disagreement between the labels of the source and target tasks distribution. [sent-90, score-0.579]

23 Deﬁnition 1 Given a set of source tasks TS and a set of target tasks TT , meta-generalising is an inductive inference method that aims at making predictions on the set of target tasks by sampling the space of source tasks . [sent-98, score-1.925]

24 We further deﬁne two possible scenarios: in the fully observed tasks case, we assume that the similarity of the distribution assumption is perfectly met, so that the data generating distribution of the target task is the same as that of one of the source tasks (but we do not know which one). [sent-99, score-1.113]

25 This assumption is relaxed in the partially observed tasks scenario, where we still assume similarity of the distribution but we do not necessarily have identity. [sent-100, score-0.442]

26 The data of each task are on the base level and the distribution of the tasks is on the meta level. [sent-102, score-0.51]

27 Model the distribution of the data of each task, and the distribution of the source tasks (correlation between tasks). [sent-104, score-0.48]

28 In this work, we are interested in the general case where no reliable task descriptor features are available; we will then learn similarities between tasks through a distribution matching pursuit. [sent-114, score-0.51]

29 In addition, we employ a classiﬁer over the tasks to learn the task labels (from which task each data point comes from). [sent-127, score-0.658]

30 This conﬁguration implies that the matrix Kt models the correlations between the vectors f j , that is, the tasks in the multi-task view, and Kx models the correlations between each element of vectors f j . [sent-161, score-0.496]

31 For example, if Kt was ﬁxed to the identity matrix, then all tasks would be independent but they would still share the same hyperparameters of the covariance function. [sent-167, score-0.512]

32 The objectives of the model are ﬁrst to model the dependencies between the tasks, and second to assign unseen tasks to source tasks by ﬁnding task similarities. [sent-175, score-1.063]

33 In this subsection we use x, yt , and yx to refer to xs , yst , and ysx to keep the notation light, since in the learning phase only source tasks are involved. [sent-178, score-1.016]

34 , M} and yx ∈ {−1, +1} as the task and class labels respectively. [sent-184, score-0.449]

35 Variables f and g are the two sets of GPs for the multi-task and multi-class classiﬁers respectively, whereas variables hx and ht denote the auxiliary variables of the two classiﬁers; (a) graphical representation of the training phase, (b) graphical representation of Meta-generalising. [sent-187, score-0.352]

36 and Chib (1993), we deﬁne two sets of auxiliary variables ht = vec(Ht ), and hx = vec(Hx ), which as shown later on enables the multinomial and the binary probit model respectively. [sent-188, score-0.359]

37 The ﬁrst one is responsible for the classiﬁcation over the tasks g|X, θx ∼ GP (0, I ⊗ Kx ), where g = vec(G), G = [g1 , . [sent-193, score-0.39]

38 Standard results show that the distributions that maximize the lower bound are given by Q(Θi ) = exp(EQ(Θ\Θi ) {log p(yt , yx , Θ|X, θt , θx )}) exp(EQ(Θ\Θi ) {log p(yt , yx , Θ|X, θt , θx )})dΘi where Q(Θ \ Θi ) denotes the factorized distribution with the ith component removed. [sent-223, score-0.602]

39 3 Prediction on Novel Tasks While in the previous section we described how to train the model on training data from the source tasks, we now describe how to perform predictions on unseen target tasks. [sent-245, score-0.319]

40 In many cases, a target task consists of a batch of input points, and the simple fact that they all come from the same task contains valuable information about the correlations between the associated outputs. [sent-252, score-0.443]

41 In many multi-task problems it is a usual phenomenon to observe groups of highly correlated tasks (e. [sent-254, score-0.39]

42 b), while other times tasks are correlated but in a more random fashion (e. [sent-257, score-0.39]

43 , xt∗ }, in the ﬁrst scenario we treat each data point 2 1 nt from the target task individually to infer its task responsibilities, which we will refer to as Point to Point Gating (P2PGat). [sent-267, score-0.394]

44 In the second scenario we wish to combine the information from all nt test points to infer the overall task responsibilities for the target task, which we will refer to as Batch predictions. [sent-269, score-0.358]

45 Additionally, kt , kt are used to denote the jth column and the j jth ˜ kj x,x∗ j jj element of Kt respectively, kx ∗ is used to denote the covariance vector between X and x∗ , and Φ is x,x the probit function. [sent-275, score-0.871]

46 Equations (12) and (13), indicate that inferring x the tasks responsibilities on a set of points depends not only on the correlations between the test points and the train points but also on the correlations between the test points themselves. [sent-284, score-0.58]

47 from the unknown data generating distribution, and approximate it by: ∗ p(yt∗ = k|x∗ , X, yt ) ≈ ∏n p(yt∗ = k|xi∗ , X, yt ) i=1 i , M n∗ ∑m=1 ∏ j=1 p(yt∗ = m|x∗ , X, yt ) j j (14) where p(yt∗ = k|xi∗ , X, yt ) are the task responsibilities computed individually for each test point. [sent-298, score-0.652]

48 The fully observed tasks case, considered in Section 4. [sent-311, score-0.442]

49 In this case all available tasks are used in the training phase, but in the testing phase the model has no information from which of the source task the target task comes from. [sent-313, score-0.833]

50 In this case the data generating distribution of the target task does not match the distribution of one of the source tasks, so that the set of source tasks is strictly a subset of the set of all tasks. [sent-316, score-0.761]

51 2 gives more insight into the connections between the correlation structure of the tasks and the task prediction mechanism on totally unseen tasks. [sent-319, score-0.615]

52 Intuitively, the success of the model depends strongly on whether the model will be able to infer correctly from which of the source tasks the target task actually comes from. [sent-347, score-0.671]

53 005 1 (a) (b) Figure 2: Toy data set I distribution; (a) scatter plot and density for the ﬁrst cluster of tasks (1-3), (b) scatter plot and density for the second cluster of tasks (4-6). [sent-361, score-0.974]

54 Data for the ﬁrst three tasks are generated from a mixture of two partially overlapping Gaussian distributions, and similarly for the remaining three tasks. [sent-367, score-0.417]

55 Hence, the six tasks cluster in two groups; for each task 600 data points were generated, which were equally divided between the two classes. [sent-368, score-0.578]

56 2 T OY DATA S ET II The second toy data set consists of four tasks which group into two clusters. [sent-395, score-0.484]

57 While the densities are peaked in different locations, without class labels the tasks are almost identical, meaning that the multi-class classiﬁer cannot learn to discriminate between the two tasks. [sent-400, score-0.418]

58 b shows the Hinton diagram of the task covariance matrix, which indicates a more random structure between the tasks, but ﬁnds that some tasks are more correlated than others, for example ‘a/g’ with ‘a/o’, and ‘i/j’ with ‘f/t’. [sent-485, score-0.617]

59 It should be noted though, that in this data set the “low-error joint prediction” assumption is partially violated since there is label disagreement between tasks ‘a/g’ and ‘g/y’, where the ‘g’ letter belongs to class “+1” in task ‘a/g’ and to “-1” in task ‘g/y’. [sent-486, score-0.657]

60 Interestingly, the MAP approach is consistently worse than other methods, a situation that will be reversed in the partially observed tasks scenario. [sent-511, score-0.442]

61 b, demonstrates that there are correlations between the tasks but in more random way. [sent-513, score-0.443]

62 In the fully observed tasks scenario, the space of tasks has been sampled sufﬁciently (by deﬁnition). [sent-525, score-0.832]

63 If the model assumptions are met, the correlation structure of the tasks does not have a strong inﬂuence on the predictions, since the Batch mode outperformed the P2PGat gating and MAP estimate in all experiments. [sent-534, score-0.439]

64 As we will see, this will be a crucial difference between the fully and partially observed tasks scenario. [sent-535, score-0.469]

65 These clusters may be evident from the experimental design of the problem (as in the case of the landmine data set discussed below), or may become evident from the training phase on the source tasks, if the learned task covariance matrix exhibits a strong block structure. [sent-540, score-0.48]

66 1 consisting of two clusters of tasks; in this section, training tasks are selected by randomly selecting equal number of tasks from each cluster. [sent-547, score-0.875]

67 Experimental results are presented for two and four training tasks in Figures 8. [sent-550, score-0.432]

68 65 100 150 DPET 200 100 (a) 150 DPET 200 (b) Figure 8: Average AP on the unseen tasks of Toy data set I; (a) training on 2 tasks generalising on 4, (b) training on 4 tasks generalising on 2. [sent-567, score-1.421]

69 Tasks 1-10 correspond to regions that are relatively highly foliated while tasks 11-19 correspond to regions that are bare earth or desert. [sent-575, score-0.39]

70 The experimental setup suggests the presence of two clusters of tasks corresponding to the 700 Landmine (+1) Clutter (−1) 600 Data points 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Task Figure 9: Landmine detection data distribution. [sent-577, score-0.443]

71 Thus, in this data set training tasks are set by randomly selecting equal number of tasks from the ﬁrst cluster, tasks 1-10, and from the second cluster, tasks 11-19. [sent-581, score-1.602]

72 a shows the mean AP on the 17, 15, and 11 unseen target tasks for each partition respectively. [sent-587, score-0.534]

73 15 20 50 100 20 50 100 DPET DPET DPET (a) (b) (c) Figure 10: AP on the 17 unseen tasks of Landmine data set; training on 2 tasks, generalising on 17; (a) AP over 17 tasks, (b) AP over 9 tasks of the ﬁrst cluster, (c) AP over 8 tasks of the second cluster. [sent-626, score-1.332]

74 05 100 20 50 100 DPET DPET DPET (a) (b) (c) Figure 11: Average AP on the 15 unseen tasks of Landmine data set; training on 4 tasks, generalising on 15; (a) Overall AP over 15 tasks, (b) Average AP over 8 tasks of the ﬁrst cluster, (c) Average AP over 7 tasks of the second cluster. [sent-653, score-1.332]

75 05 20 50 100 DPET DPET DPET (a) (b) (c) Figure 12: Average AP on the 11 unseen tasks of Landmine data set; training on 8 tasks, generalising on 11; (a) Overall AP over 11 tasks, (b) Average AP over 6 tasks of the ﬁrst cluster, (c) Average AP over 5 tasks of the second cluster. [sent-683, score-1.332]

76 The results from the fully observed tasks scenario indicate an unclear pattern of correlations between the tasks, as summarised in the task covariance matrix Figure 7. [sent-689, score-0.732]

77 This allows us to assume that the classes between the tasks will not be anti-correlated, so that at least the low-error joint prediction assumption should approximately hold. [sent-693, score-0.39]

78 Since there are no obvious clusters among tasks, in this set of experiments the training tasks are chosen by randomly selecting some for training and keeping the rest as test tasks. [sent-694, score-0.527]

79 Figure 13 presents the results on the unseen tasks that were obtained by training the CMCMT model with 4 and 5 tasks. [sent-695, score-0.505]

80 Secondly, we observe that the performance in this set of experiments exhibits some interesting patterns as the number of training tasks increases. [sent-698, score-0.432]

81 Speciﬁcally, for four training tasks the performance of all methods does not signiﬁcantly improve as we increase the number of data points per task, and in some cases it even deteriorates, a phenomenon that was also observed for 2 and 3 training tasks but results are omitted for brevity. [sent-699, score-0.889]

82 This indicates that if the space of tasks has not been sampled sufﬁciently, the model can not yield good generalisation performance to new tasks, even if the number of training data increases. [sent-700, score-0.457]

83 In contrast, for ﬁve training tasks the MAP and P2PGat methods yield a signiﬁcant improvement of performance as the number of data points increases (levelling off after 200 DPETs). [sent-701, score-0.432]

84 82 250 DPET 100 150 200 250 DPET (a) (b) Figure 13: Average AP on the unseen tasks of Arrhythmia data set on different number of training tasks; (a) training on 4 tasks, generalising on 3, (b) training on 5 tasks, generalising on 2. [sent-721, score-0.683]

85 4 C HARACTER C LASSIFICATION For reasons of completeness, we present an analysis of the character classiﬁcation problem in the partially observed tasks scenario. [sent-730, score-0.477]

86 The fully observed tasks analysis of the character classiﬁcation problem did not reveal any clusters of tasks. [sent-732, score-0.53]

87 This is borne out by experimental evidence: simulation results with 4, 5, and 6 training tasks, which are omitted for brevity, indicated that increasing the number of tasks and the number of training points per task does not improve the performance in any of the methods. [sent-735, score-0.594]

88 5 O BSERVATIONS Meta-generalising in a partially observed tasks scenario is an extremely hard problem; nevertheless, we believe there are some interesting points that can be made from the previous experimental analysis. [sent-739, score-0.482]

89 In situations where there are clusters of tasks, even though the model hasn’t seen all tasks, the Batch method can still make accurate predictions that reaches the performance of the fully observed tasks case. [sent-742, score-0.538]

90 Pragmatically, one could consider whether the training phase of the model has revealed clusters of tasks when deciding which prediction method to apply. [sent-743, score-0.485]

91 While we have not tested our model for very large numbers of training tasks, the results suggest that often a signiﬁcant improvement in performance can be achieved when the number of training tasks crosses a critical number, indicating a sufﬁcient coverage of the task space. [sent-749, score-0.594]

92 This phenomenon was observed in the Arrhythmia classiﬁcation problem for 2 and 3 training tasks where the performance of the models remained the same as the number of training samples per task increased. [sent-750, score-0.619]

93 In essence more training data lead to stronger biases for metageneralisation in target tasks that are not correlated with any of the training tasks. [sent-751, score-0.545]

94 Conclusions In this paper we presented an investigation on the use of Gaussian Processes for meta-generalisation, that is, predicting on unseen learning tasks by leveraging the information of several, related tasks. [sent-756, score-0.463]

95 Our model attacks the meta-generalisation problem by coupling two GPs, a multi-class classiﬁer that learns task responsibilities, and a multi-task classiﬁer that learns prediction models on individual tasks as well as learning the global correlation structure between training tasks. [sent-757, score-0.552]

96 It is important to remark that our method crucially relies on the ability to learn the covariance matrix of a GP: the fundamental ingredient in this work is the task correlation matrix which captures the correlations between source tasks. [sent-762, score-0.34]

97 This not only has a signiﬁcant impact on the prediction results, but can reveal the presence of clusters of tasks within the data, hence guiding the choice of the appropriate prediction method (Batch or P2PGat). [sent-763, score-0.443]

98 2 Q(f) Q(f) ∝ exp EQ(hx ) NM ∑ log p(hx | fi ) + log p(f|X) i i=1 1 1 1 ∝ exp EQ(hx ) − hxT hx + fT hx − fT f − fT Kt ⊗ Kx 2 2 2 1 T −1 ˜ ∝ exp − f (I + Kt ⊗ Kx )f + fT h + const. [sent-783, score-0.414]

99 66 20 50 100 DPET DPET DPET (a) (b) (c) Figure 14: AUC on the Landmine detection problem; (a) AUC over 17 tasks by training on 2 tasks, (b) AUC over 15 tasks by training on 4 tasks, (c) AUC over 11 tasks by training on 8 tasks. [sent-828, score-1.296]

100 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-839, score-0.39]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kx', 0.395), ('tasks', 0.39), ('yx', 0.301), ('ap', 0.262), ('hx', 0.207), ('dpet', 0.189), ('batchmcappr', 0.172), ('kt', 0.14), ('cmtmc', 0.139), ('anguinetti', 0.123), ('eneralising', 0.123), ('kolidis', 0.123), ('task', 0.12), ('yt', 0.112), ('mtl', 0.107), ('tudy', 0.105), ('ht', 0.103), ('landmine', 0.098), ('eta', 0.095), ('pproach', 0.095), ('toy', 0.094), ('source', 0.09), ('pool', 0.089), ('rocesses', 0.087), ('responsibilities', 0.084), ('batch', 0.079), ('covariance', 0.077), ('gp', 0.074), ('unseen', 0.073), ('ase', 0.073), ('target', 0.071), ('arrhythmia', 0.07), ('aussian', 0.069), ('cluster', 0.068), ('ysx', 0.066), ('girolami', 0.063), ('auc', 0.062), ('skolidis', 0.057), ('yst', 0.057), ('gps', 0.057), ('bonilla', 0.056), ('nm', 0.054), ('tl', 0.053), ('correlations', 0.053), ('clusters', 0.053), ('eq', 0.051), ('map', 0.05), ('gating', 0.049), ('probit', 0.049), ('posterior', 0.047), ('generalising', 0.047), ('hyperparameters', 0.045), ('nt', 0.043), ('predictions', 0.043), ('training', 0.042), ('adaptation', 0.042), ('daum', 0.042), ('rogers', 0.042), ('vancouver', 0.042), ('beats', 0.041), ('scenario', 0.04), ('classi', 0.039), ('heart', 0.037), ('transfer', 0.037), ('generalizing', 0.036), ('character', 0.035), ('pooling', 0.035), ('vec', 0.035), ('jth', 0.035), ('experts', 0.034), ('hinton', 0.033), ('arnold', 0.033), ('arrhythmic', 0.033), ('trace', 0.033), ('xue', 0.033), ('totally', 0.032), ('baxter', 0.032), ('canada', 0.031), ('diagram', 0.03), ('ft', 0.03), ('met', 0.029), ('scatter', 0.029), ('processes', 0.029), ('rasmussen', 0.029), ('labels', 0.028), ('beat', 0.028), ('er', 0.027), ('fully', 0.027), ('partially', 0.027), ('average', 0.026), ('variational', 0.026), ('generalisation', 0.025), ('gm', 0.025), ('observed', 0.025), ('dht', 0.025), ('ktj', 0.025), ('nhi', 0.025), ('nht', 0.025), ('pvc', 0.025), ('sanguinetti', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

2 0.094708525 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a ﬂexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classiﬁcation, which is the categorization of an item into one of several ﬁxed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difﬁcult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-ﬁeld bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classiﬁcation, multinomial logistic, variational approximation, sparse approximation

3 0.091963008 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to signiﬁcantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that signiﬁcantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-speciﬁc skills. Keywords: reinforcement learning, transfer, shaping, skills

4 0.082664102 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

Author: Philipp Hennig, Christian J. Schuler

Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation

5 0.068244167 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efﬁcient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efﬁcient, crowdsourcing

6 0.066843912 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

7 0.061857205 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

8 0.056655314 59 jmlr-2012-Linear Regression With Random Projections

9 0.05473939 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

10 0.051686957 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

11 0.049712248 73 jmlr-2012-Multi-task Regression using Minimal Penalties

12 0.04876795 97 jmlr-2012-Regularization Techniques for Learning with Matrices

13 0.048766226 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

14 0.048061702 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

15 0.047621015 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

16 0.043079648 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

17 0.037830316 84 jmlr-2012-Online Submodular Minimization

18 0.036422726 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

19 0.03576554 91 jmlr-2012-Plug-in Approach to Active Learning

20 0.035727467 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, -0.022), (2, 0.104), (3, -0.081), (4, 0.109), (5, -0.02), (6, 0.033), (7, 0.086), (8, -0.191), (9, 0.078), (10, -0.093), (11, 0.049), (12, -0.124), (13, 0.025), (14, -0.109), (15, 0.041), (16, -0.064), (17, 0.084), (18, -0.029), (19, 0.028), (20, -0.024), (21, 0.03), (22, 0.031), (23, -0.048), (24, 0.092), (25, 0.07), (26, -0.15), (27, 0.143), (28, 0.053), (29, 0.317), (30, -0.099), (31, -0.05), (32, -0.007), (33, 0.076), (34, 0.04), (35, 0.166), (36, -0.024), (37, -0.223), (38, -0.056), (39, 0.141), (40, 0.015), (41, 0.037), (42, 0.033), (43, 0.045), (44, -0.042), (45, -0.099), (46, -0.011), (47, -0.233), (48, 0.015), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96161717 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

2 0.62291342 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

3 0.41404971 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

Author: Carl Brunner, Andreas Fischer, Klaus Luig, Thorsten Thies

Abstract: Pairwise classiﬁcation is the task to predict whether the examples a, b of a pair (a, b) belong to the same class or to different classes. In particular, interclass generalization problems can be treated in this way. In pairwise classiﬁcation, the order of the two input examples should not affect the classiﬁcation result. To achieve this, particular kernels as well as the use of symmetric training sets in the framework of support vector machines were suggested. The paper discusses both approaches in a general way and establishes a strong connection between them. In addition, an efﬁcient implementation is discussed which allows the training of several millions of pairs. The value of these contributions is conﬁrmed by excellent results on the labeled faces in the wild benchmark. Keywords: pairwise support vector machines, interclass generalization, pairwise kernels, large scale problems

4 0.38361889 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Author: Kian Ming A. Chai

5 0.37788564 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

Author: Philipp Hennig, Christian J. Schuler

6 0.30889338 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

7 0.29390046 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

8 0.2712146 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

9 0.26560032 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

10 0.24515472 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

11 0.24360622 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

12 0.24334282 59 jmlr-2012-Linear Regression With Random Projections

13 0.23435123 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

14 0.23425174 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

15 0.233127 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

16 0.2304143 73 jmlr-2012-Multi-task Regression using Minimal Penalties

17 0.21869363 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

18 0.21696898 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

19 0.21179959 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

20 0.20527388 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.39), (7, 0.016), (21, 0.027), (26, 0.032), (29, 0.042), (35, 0.027), (49, 0.037), (56, 0.012), (57, 0.017), (69, 0.013), (75, 0.076), (77, 0.023), (79, 0.02), (81, 0.019), (92, 0.073), (96, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.71320111 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are signiﬁcantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ

same-paper 2 0.68228632 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

3 0.36321759 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

4 0.35815328 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

5 0.35548282 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-inﬁnite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to signiﬁcant improvements in the efﬁciency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, ﬁxed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines

6 0.35496417 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

7 0.35488531 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

8 0.35466307 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

9 0.354554 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

10 0.35329533 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

11 0.35292977 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

12 0.35266834 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

13 0.34880117 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

14 0.34700608 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

15 0.34680507 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

16 0.34657598 13 jmlr-2012-Active Learning via Perfect Selective Classification

17 0.34653878 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

18 0.34621423 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

19 0.34597254 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

20 0.34435785 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection