jmlr jmlr2005 jmlr2005-45 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
Reference: text
sentIndex sentText sentNum sentScore
1 UK Department of Computer Science University College London Gower Street, London WC1E, UK Editor: John Shawe-Taylor Abstract We study the problem of learning many related tasks simultaneously using kernel methods and regularization. [sent-10, score-0.443]
2 The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. [sent-11, score-0.177]
3 Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. [sent-12, score-0.495]
4 These kernels model relations among the tasks and are derived from a novel form of regularizers. [sent-13, score-0.397]
5 In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. [sent-15, score-0.733]
6 Introduction Past empirical work has shown that, when there are multiple related learning tasks it is beneficial to learn them simultaneously instead of independently as typically done in practice (Bakker and Heskes, 2003; Caruana, 1997; Heskes, 2000; Thrun and Pratt, 1997). [sent-17, score-0.384]
7 Our analysis establishes that the problem of estimating many task functions with regularization can be linked to a single task learning problem provided a family of multi-task kernel functions we define is used. [sent-20, score-0.495]
8 Breiman and Friedman (1998) propose the curds&whey; method, where the relations between the various tasks are modeled in a post–processing fashion. [sent-40, score-0.35]
9 In (Baxter, 2000) the notion of the “extended VC dimension” (for a family of hypothesis spaces) is defined and it is used to derive generalization bounds on the average 1 error of T tasks learned which is shown to decrease at best as T . [sent-44, score-0.339]
10 In (Ben-David and Schuller, 2003) the extended VC dimension was used to derive tighter bounds that hold for each task (not just the average error among tasks as considered in (Baxter, 2000)) in the case that the learning tasks are related in a particular way defined. [sent-46, score-0.779]
11 More recent work considers learning multiple tasks in a semi-supervised setting (Ando and Zhang, 2004) and the problem of feature selection with SVM across the tasks (Jebara, 2004). [sent-47, score-0.649]
12 Finally, a number of approaches for learning multiple tasks are Bayesian, where a probability model capturing the relations between the different tasks is estimated simultaneously with the models’ parameters for each of the individual tasks. [sent-48, score-0.728]
13 In this model relatedness between the tasks is captured by this Gaussian distribution: the smaller the variance of the Gaussian the more related the tasks are. [sent-52, score-0.656]
14 Background and Notation In this section, we briefly review the basic setup for single-task learning using regularization in a reproducing kernel Hilbert space (RHKS) HK with kernel K. [sent-65, score-0.297]
15 To this end, a common approach within SLT and regularization theory is to learn f as the minimizer in HK of the functional 1 m ∑ L(y j , f (x j )) + γ f 2 K (1) j∈Nm where f 2 is the norm of f in HK . [sent-78, score-0.184]
16 2 Multi-Task Learning: Notation For multi-task learning we have n tasks and corresponding to the −th task there are available m examples {(xi , yi ) : i ∈ Nm } sampled from a distribution P on X × Y . [sent-95, score-0.469]
17 In this paper we mainly discuss the case that the tasks have a common input space, that is X = X for all and briefly comment on the more general case in Section 5. [sent-98, score-0.31]
18 Each of these tasks can be learned using images of different size (or different representations). [sent-109, score-0.339]
19 We now turn to the extension of the theory and methods for single-task learning using the regularization based kernel methods briefly reviewed above to the case of multi-task learning. [sent-110, score-0.177]
20 For a certain choice of J (or, equivalently, matrix E), the regularization function (6) learns the tasks independently using the regularization method (1). [sent-117, score-0.552]
21 In particular, for J(u) = ∑ ∈Nn u 2 the 1 1 function (6) decouples, that is, R(u) = n ∑ ∈Nn r (u ) where r (u ) = m ∑ j∈Nm L(y j , u x j ) + γ u 2 , meaning that the task parameters are learned independently. [sent-118, score-0.188]
22 On the other hand, if we choose J(u) = 619 E VGENIOU , M ICCHELLI AND P ONTIL ∑ ,q∈Nn u − uq 2 , we can “force” the task parameters to be close to each other: task parameters u are learned jointly by minimizing (6). [sent-119, score-0.419]
23 Note that function (6) depends on dn parameters whose number can be very large if the number of tasks n is large. [sent-120, score-0.387]
24 As we shall see, the input space of this kernel depends is the original d−dimensional space of the data plus an additional dimension which records the task the data belongs to. [sent-122, score-0.25]
25 We also define the p × dn feature matrix B := [B : ∈ Nn ] formed by concatenating the n matrices B , ∈ Nn . [sent-125, score-0.176]
26 Moreover, we assume that the feature matrix B is of full rank dn as well. [sent-127, score-0.176]
27 For example, if we choose B = B0 for every ∈ Nn , where B0 is a prescribed p × d matrix, Equation (8) tells us that all tasks are the same task, that is, f1 = f2 = · · · = fn . [sent-129, score-0.346]
28 In particular if p = d and B0 = Id the function (11) (see below) implements a single-task learning problem, as in Equation (2) with all the mn data from the n tasks as if they all come from the same task. [sent-130, score-0.31]
29 (10) We call this kernel a linear multi-task kernel since it is bilinear in x and t for fixed and q. [sent-133, score-0.182]
30 Using this linear feature map representation, we wish to convert the regularization function (6) to a function of the form (2), namely, S(w) := 1 nm ∑ ∑ ∈Nn j∈Nm L(y j , w B x j ) + γ w w, w ∈ R p . [sent-134, score-0.344]
31 (11) This transformation relates matrix E defining the homogeneous quadratic function of u we used in (6), J(u), and the feature matrix B. [sent-135, score-0.2]
32 Since Equation (9) requires that the feature vector w is common to all vectors u and those are arbitrary, the feature matrix B must be of full rank dn and, so, the matrix E above is well defined. [sent-141, score-0.275]
33 For example, with p = dn we can find a dn × dn matrix T by using the eigenvalues and eigenvectors of E. [sent-147, score-0.338]
34 As an example, consider the case that B is a dn × d matrix all of whose d × d blocks are zero except for the −th block which is equal to Id . [sent-152, score-0.18]
35 This choice means that we are learning all tasks independently, that is, J(u) = ∑ ∈Nn u 2 and proposition (1) confirms that E = Idn . [sent-153, score-0.342]
36 This observation would also extend to the circumstance where there are arbitrary linear relations amongst the task functions. [sent-156, score-0.199]
37 Indeed, we can impose such linear relations on the features directly to achieve this relation amongst the task functions. [sent-157, score-0.199]
38 Since functional (11) is like a single task regularization functional (2), by the representer theorem— see Equation (5)—its minimizer has the form w∗ = ∑ ∑ cj B xj . [sent-164, score-0.38]
39 j∈Nm ∈Nn This implies that the optimal task functions are ∗ fq (x) = ∑ ∑ c j K((x j , j∈Nm ∈Nn 621 ), (x, q)), x ∈ Rd , q ∈ Nn (14) E VGENIOU , M ICCHELLI AND P ONTIL where the kernel is defined in Equation (10). [sent-165, score-0.25]
40 Having defined the kernel for (10), we can now use standard single-task learning methods to learn multiple tasks simultaneously (we only need to define the appropriate kernel for the input data (x, )). [sent-167, score-0.566]
41 In this case the parameters c j in Equation (14) are obtained by solving the system of linear equations ∑ ∑ q∈Nn j∈Nm K((x jq , q), (xi , ))c jq = yi , i ∈ Nm , ∈ Nn . [sent-170, score-0.24]
42 2 max ci∈ ∑ ∑ ci ∈Nn i∈Nm − 1 2 ∑ ∑ ci yi c jq y jq K((xi , ), (x jq , q)) ,q∈Nn i, j∈Nm subject, for all i ∈ Nn and ∈ Nn , to the constrains that 0 ≤ ci ≤ 1 . [sent-176, score-0.45]
43 These cases arise from different choices of matrices B that we used above to model task relatedness or, equivalently, by directly choosing the function J in Equation (6). [sent-180, score-0.195]
44 q (20) Indeed, J can be written as (u, Eu) where E is the n × n block matrix whose , q block is the d × d matrix G q Id and the result follows. [sent-183, score-0.206]
45 If λ is small the tasks parameters are related (closed to their average) whereas if λ = 1 the task are learned independently. [sent-200, score-0.498]
46 Moreover, if we replace the identity matrix Id in Equation (21) by a (any) d × d matrix A we obtain the kernel K((x, ), (t, q)) = (1 − λ + λnδ q )x Qt, , q ∈ Nn , x,t ∈ Rn where Q = A A. [sent-206, score-0.275]
47 2 TASK C LUSTERING R EGULARIZATION The regularizer in Equation (24) implements the idea that the task parameters u are all related to each other in the sense that each u is close to an “average parameter” u0 . [sent-210, score-0.255]
48 In particular, if ρh = δhk( ) with k( ) the cluster task belongs to, matrix G is invertible q and takes the simple form 1 (27) G−1 = δ q + θ q q ρ where θ q = 1 if tasks and q belong to the same cluster and zero otherwise. [sent-217, score-0.539]
49 In particular, if c = 1 1 and we set ρ = 1−λ the kernel K((x, ), (t, q)) = (δ q + ρ )x t is the same (modulo a constant) as the λn kernel in Equation (22). [sent-218, score-0.182]
50 3 G RAPH R EGULARIZATION In our third example we choose an n × n symmetric matrix A all of whose entries are in the unit interval, and consider the regularizer J(u) := 1 2 ∑ u − uq 2 A q = ,q∈Nn ∑ u uq L q (28) ,q∈Nn where L = D − A with D q = δ q ∑h∈Nn A h . [sent-221, score-0.31]
51 The equation A q = 0 means that tasks and q are not related, whereas A q = 1 means strong relation. [sent-223, score-0.352]
52 Thus, we have that ∑ Lq= σk vk vkq (29) k∈Nn where the matrix V = (vk ) is orthogonal, σ1 = · · · = σr < σr+1 ≤ · · · ≤ σn are the eigenvalues of L and r ≥ 1 is the multiplicity of the zero eigenvalue. [sent-226, score-0.173]
53 We can use this observation to assert that on the space S the regularization function (6) corresponding to the Laplacian has a unique minimum and it is given in the form of a representer theorem for kernel (30). [sent-234, score-0.213]
54 We tested two multi-task versions of SVM: a) we considered the simple case that the matrix Q in Equation (25) is the identity matrix, that is, we use the multi-task kernel (22), and b) we estimate the matrix Q in (25) by running PCA on the previously learned task parameters. [sent-237, score-0.487]
55 One of the key questions we considered is: how does multi-task learning perform relative to single-task as the number of data per task and as the number of tasks change? [sent-252, score-0.54]
56 This could often be for example the case in analyzing customer data for marketing, where we may have data about many customers (tens of thousands) but only a few samples per customer (only tens) (Allenby and Rossi, 1999; Arora, Allenby, and Ginter, 1998). [sent-254, score-0.22]
57 Therefore we have 200 classification tasks and 120 20-dimensional data points for each task – for a total of 24000 data points. [sent-273, score-0.469]
58 To test how multi-task compares to single task as the number of data per task and/or the number of tasks changes, we ran experiments with varying numbers of data per task and number of tasks. [sent-275, score-0.929]
59 In particular, we considered 50, 100, and 200 tasks, splitting the 200 tasks into 4 groups of 50 or 2 groups of 100 (or one group of 200), and then taking the average performance among the 4 groups, the 2 groups (and the 1 group). [sent-276, score-0.31]
60 2 On the other hand, the multi-task learning regularization parameter γ and parameter λ in (22) were chosen using a validation set consisting of one (training) data point per task which we then included back to the training data for the final training after the parameter selection. [sent-280, score-0.316]
61 The parameters λ and γ used when we estimated matrix Q through PCA were the same as when we used the identity matrix as Q. [sent-281, score-0.184]
62 Notice that the performance of the single-task SVM does not change as the number of tasks increases – as expected. [sent-291, score-0.31]
63 53 Table 1: Comparison of Methods as the number of data per task and the number of tasks changes. [sent-346, score-0.54]
64 “One SVM” stands for training one SVM with all the data from all the task, “Indiv SVM” stands for training for each task independently, “Identity” stands for the multi-task SVM with the identity matrix, and “PCA” is the multi-task SVM using the PCA approach. [sent-347, score-0.278]
65 • When there are few data per task (20, 30, or 60), both multi-task SVMs significantly outperform the single-task SVM. [sent-350, score-0.23]
66 • As the number of tasks increases the advantage of multi-task learning increases – for example for 20 data per task, the improvement in performance relative to single-task SVM is 1. [sent-351, score-0.381]
67 07 percent for the 50, 100, and 200 tasks respectively. [sent-354, score-0.31]
68 • When we have many data per task (90), the simple multi-task SVM does not provide any advantage relative to the single-task SVM. [sent-355, score-0.23]
69 , 2004; Micchelli and Pontil, 2005b) To explore the second point further, we show in Figure 1 the change in performance for the identity matrix based multi-task SVM relative to the single-task SVM in the case of 20 data per task. [sent-361, score-0.185]
70 We notice the following: • When there are only a few tasks (for example, less than 20 in this case), multi-task can hurt the performance relative to single-task. [sent-364, score-0.31]
71 Hence our experimental findings indicate that for few tasks one should use either a single-task SVM or a multi-task one with parameter λ selected near 1. [sent-367, score-0.31]
72 • As the number of tasks increases, performance improves – surpassing the performance of the single-task SVM after 20 tasks in this case. [sent-368, score-0.62]
73 As discussed in (Baxter, 1997, 2000; Ben-David, Gehrke, and Schuller, 2002; Ben-David and Schuller, 2003), an important theoretical question is to study the effects of adding additional tasks on the generalization performance (Ben-David, Gehrke, and Schuller, 2002; Ben-David and Schuller, 2003). [sent-369, score-0.31]
74 What our experiments show is that, for few tasks it may be inappropriate to follow a multitask approach if a small λ is used, but as the number of tasks increases performance relative to single-task learning improves. [sent-370, score-0.644]
75 In Figure 2 we plot the test error for the simple multi-task learning method using the identity matrix (kernel (22)) for the case of 20 data per task when there are 200 tasks (third row in Table 4. [sent-373, score-0.654]
76 1), or 10 tasks (for which single-task SVM outperforms multi-task SVM for λ = 0. [sent-374, score-0.31]
77 Notice that for the 200 tasks the error drops and then increases, having a flat minimum between λ = 0. [sent-377, score-0.31]
78 Hence, for a few tasks multi-task learning can still help if a large enough λ is used. [sent-384, score-0.31]
79 5 0 20 40 60 80 100 120 140 160 180 200 Figure 1: The horizontal axis is the number of tasks used. [sent-390, score-0.343]
80 We also show the performance of a single-task SVM (dashed line) which, of course, is not changing as the number of tasks increases. [sent-393, score-0.31]
81 8 1 Figure 2: The horizontal axis is the parameter λ for the simple multi-task method with the identity matrix kernel (22). [sent-407, score-0.238]
82 There are 200 tasks with 20 training points and 100 test points per task. [sent-409, score-0.381]
83 The goal is to predict the exam scores of the students based on the following inputs: year of the exam, gender, VR band, ethnic group, percentage of students eligible for free school meals in the school, percentage of students in VR band one in the school, gender of the school (i. [sent-432, score-0.42]
84 We set regularization parameter γ to be 1 and used a linear kernel for simplicity. [sent-443, score-0.177]
85 The results show again the advantage of learning all tasks (for all schools) simultaneously instead of learning them one by one. [sent-447, score-0.352]
86 Note, however, that for this data set one SVM for all tasks performs the best, which is also similar to using a small enough λ (any λ between 0 and 0. [sent-450, score-0.31]
87 This result also indicates that when the tasks are the same task, using the proposed multi-task learning method does not hurt as long as a small enough λ is used. [sent-453, score-0.31]
88 The vector w is computed by minimizing the single-task functional S(w) := 1 nm ∑ ∑ ∈Nn j∈Nm L(y j , w, Φ (x j ) ) + γ w, w , w ∈ W . [sent-466, score-0.262]
89 (31) By the representer theorem, the minimizer of functional S has the form in Equation (14) where the multi-task kernel is given by the formula K((x, ), (t, q)) = Φ (x), Φq (t) x,t ∈ X , , q ∈ Nn . [sent-467, score-0.193]
90 We note that for every {ci : i ∈ Nm , ∈ Nn } ⊂ R and {xi : i ∈ Nm , ∈ Nn } ⊂ X we have ∑ ∑ i, j∈Nm ,q∈Nn ci c jq G(z (xi ), zq (x jq )) = ∑ ∑ ci c jq G(˜i , z jq ) ≥ 0 z ˜ i, jq where we have defined zi = z (xi ) and the last inequality follows by the hypothesis that G is a ˜ kernel. [sent-476, score-0.66]
91 2 Conclusion and Future Work We developed a framework for multi-task learning in the context of regularization in reproducing kernel Hilbert spaces. [sent-491, score-0.206]
92 The framework allows to model relationships between the tasks and to learn the task parameters simultaneously. [sent-493, score-0.501]
93 The experimental results show that the proposed multi-task learning methods can lead to significant performance improvements relative to the single-task learning methods, especially when many tasks with few data each are learned. [sent-499, score-0.31]
94 This kernel is a convex combination of two kernels, the first of which corresponds to learning independent tasks and the second one is a rank one kernel which corresponds to learning all tasks as the same task. [sent-504, score-0.802]
95 A drawback of our proposed multi-task kernel method is that its computational complexity time is O(p(mn)) which is worst than the complexity of solving n independent kernel methods, this being nO(p(m)). [sent-518, score-0.182]
96 For example, if we use the kernel (22) and the tasks share the same input examples it is possible to show that the linear system of mn Equations (15) can be reduced to solving n + 1 systems of m equations, which is essentially the same as solving n independent ridge regression problems. [sent-523, score-0.459]
97 Other feature selection formulations where the tasks may share only some of their features should also be possible. [sent-526, score-0.339]
98 An interesting problem deserving of investigation is the question of how to learn a set of tasks online where at each instance of time a set of examples for a new task is sampled. [sent-529, score-0.501]
99 A Bayesian/information theoretic model of learning to learn via multiple task sampling. [sent-575, score-0.191]
100 Clustering learning tasks and the selective cross–task transfer of knowledge. [sent-780, score-0.31]
wordName wordTfidf (topN-words)
[('nn', 0.543), ('tasks', 0.31), ('nm', 0.229), ('bakker', 0.215), ('heskes', 0.184), ('task', 0.159), ('pontil', 0.149), ('allenby', 0.143), ('svm', 0.138), ('evgeniou', 0.134), ('schuller', 0.132), ('icchelli', 0.132), ('vgeniou', 0.132), ('micchelli', 0.128), ('jq', 0.12), ('ultiple', 0.117), ('ontil', 0.117), ('regularizer', 0.096), ('ethods', 0.092), ('kernel', 0.091), ('regularization', 0.086), ('ernel', 0.086), ('gehrke', 0.086), ('baxter', 0.081), ('dn', 0.077), ('uq', 0.072), ('arora', 0.072), ('per', 0.071), ('matrix', 0.07), ('hk', 0.07), ('rd', 0.066), ('multi', 0.064), ('students', 0.063), ('customer', 0.06), ('id', 0.059), ('marketing', 0.058), ('ridge', 0.058), ('pca', 0.058), ('earning', 0.058), ('ginter', 0.057), ('rossi', 0.057), ('school', 0.054), ('preferences', 0.048), ('kernels', 0.047), ('identity', 0.044), ('beb', 0.043), ('bqt', 0.043), ('greene', 0.043), ('slt', 0.043), ('pratt', 0.043), ('equation', 0.042), ('simultaneously', 0.042), ('hilbert', 0.04), ('relations', 0.04), ('poggio', 0.038), ('eigenvalues', 0.037), ('vk', 0.037), ('relatedness', 0.036), ('exam', 0.036), ('theodoros', 0.036), ('prescribed', 0.036), ('lanckriet', 0.036), ('representer', 0.036), ('semide', 0.035), ('functional', 0.033), ('axis', 0.033), ('block', 0.033), ('minimizer', 0.033), ('proposition', 0.032), ('econometrics', 0.032), ('nc', 0.032), ('learn', 0.032), ('homogeneous', 0.031), ('th', 0.03), ('ci', 0.03), ('caruana', 0.029), ('seemingly', 0.029), ('feature', 0.029), ('reproducing', 0.029), ('percentage', 0.029), ('band', 0.029), ('boussios', 0.029), ('customers', 0.029), ('elementwise', 0.029), ('indiv', 0.029), ('nid', 0.029), ('vkq', 0.029), ('learned', 0.029), ('trade', 0.027), ('individual', 0.026), ('experimentally', 0.025), ('stands', 0.025), ('unrelated', 0.025), ('laplacian', 0.025), ('thrun', 0.025), ('tested', 0.024), ('ando', 0.024), ('modulo', 0.024), ('multitask', 0.024), ('pseudoinverse', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000017 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
2 0.080052644 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.078908183 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.072858684 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
6 0.066004492 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
7 0.057408981 64 jmlr-2005-Semigroup Kernels on Measures
8 0.056844223 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
9 0.056315556 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
10 0.055889171 11 jmlr-2005-Algorithmic Stability and Meta-Learning
11 0.053126551 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
12 0.047707155 47 jmlr-2005-Learning from Examples as an Inverse Problem
13 0.045728937 67 jmlr-2005-Stability of Randomized Learning Algorithms
14 0.043871768 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
15 0.043864366 36 jmlr-2005-Gaussian Processes for Ordinal Regression
16 0.043760147 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
17 0.04163852 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
18 0.039700232 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
19 0.039578501 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
20 0.036828939 41 jmlr-2005-Kernel Methods for Measuring Independence
topicId topicWeight
[(0, 0.227), (1, 0.114), (2, 0.178), (3, 0.123), (4, 0.159), (5, 0.093), (6, -0.032), (7, -0.021), (8, -0.066), (9, 0.007), (10, 0.056), (11, 0.036), (12, 0.027), (13, 0.068), (14, -0.238), (15, -0.018), (16, -0.048), (17, 0.037), (18, -0.17), (19, 0.263), (20, -0.025), (21, 0.083), (22, 0.17), (23, 0.113), (24, -0.057), (25, -0.216), (26, -0.03), (27, 0.016), (28, -0.084), (29, 0.046), (30, -0.171), (31, -0.077), (32, 0.089), (33, 0.124), (34, -0.078), (35, -0.048), (36, 0.08), (37, 0.0), (38, 0.016), (39, 0.116), (40, -0.056), (41, -0.077), (42, -0.195), (43, 0.028), (44, 0.135), (45, 0.134), (46, 0.359), (47, -0.086), (48, 0.019), (49, 0.117)]
simIndex simValue paperId paperTitle
same-paper 1 0.96318531 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
2 0.32039571 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
3 0.30654228 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
4 0.30002296 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
5 0.29880124 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
Author: Hal Daumé III, Daniel Marcu
Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian
6 0.27515912 41 jmlr-2005-Kernel Methods for Measuring Independence
7 0.26341656 64 jmlr-2005-Semigroup Kernels on Measures
8 0.26179397 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
9 0.23571557 11 jmlr-2005-Algorithmic Stability and Meta-Learning
10 0.22850962 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
11 0.22209179 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
12 0.21762064 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
13 0.1997146 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
14 0.19795384 67 jmlr-2005-Stability of Randomized Learning Algorithms
15 0.19645008 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
16 0.18067399 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
17 0.17282121 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
18 0.17173691 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
19 0.16378035 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
20 0.14768589 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
topicId topicWeight
[(13, 0.019), (17, 0.018), (19, 0.028), (21, 0.399), (36, 0.037), (37, 0.044), (42, 0.018), (43, 0.057), (44, 0.019), (47, 0.021), (52, 0.092), (59, 0.026), (70, 0.018), (80, 0.011), (88, 0.09), (90, 0.024), (94, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.7264865 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
2 0.38874319 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.3642858 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.3480328 67 jmlr-2005-Stability of Randomized Learning Algorithms
Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.
5 0.34790257 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
6 0.34350818 3 jmlr-2005-A Classification Framework for Anomaly Detection
7 0.34291634 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
8 0.34030616 71 jmlr-2005-Variational Message Passing
9 0.33916402 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.33815318 36 jmlr-2005-Gaussian Processes for Ordinal Regression
11 0.33775267 64 jmlr-2005-Semigroup Kernels on Measures
12 0.3366099 44 jmlr-2005-Learning Module Networks
13 0.33494458 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
14 0.33461675 11 jmlr-2005-Algorithmic Stability and Meta-Learning
15 0.33334041 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
16 0.33236861 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
17 0.33200815 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.32891503 20 jmlr-2005-Clustering with Bregman Divergences
19 0.3239066 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
20 0.32310772 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets