jmlr jmlr2008 jmlr2008-92 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
Reference: text
sentIndex sentText sentNum sentScore
1 The reproducing kernel K associated to HK has its values as operators on Y . [sent-13, score-0.37]
2 We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. [sent-16, score-0.459]
3 Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces 1. [sent-18, score-0.576]
4 , 2006; Devinatz, 1960; Lowitzsh, 2005; Reisert and Burkhardt, 2007; Vazquez and Walter, 2003, and references therein for more information) A multi-task kernel K is the reproducing kernel of a Hilbert space of functions from an input space X which takes values in a Hilbert space Y . [sent-37, score-0.661]
5 Specifically, the RKHS is formed by taking the closure of the linear span of kernel sections {K(·, x)y, x ∈ X , y ∈ Y }, relative to the RKHS norm. [sent-42, score-0.353]
6 Here, we are concerned with conditions on the kernel K which ensure that all continuous functions from X to Y can be uniformly approximated on any compact subset of X by the linear span of kernel sections. [sent-44, score-0.702]
7 Steinwart uses the expression universal kernel and we follow that terminology here. [sent-46, score-0.513]
8 The problem of identifying universal kernels was also discussed by Poggio et al. [sent-47, score-0.459]
9 The question of identifying universal kernels has a practical basis. [sent-53, score-0.459]
10 Sometimes, such a kernel is called operator-valued or matrix-valued kernel if Y is infinite of finite dimensional, respectively. [sent-65, score-0.484]
11 1616 U NIVERSAL MULTI - TASK KERNELS proved universal consistency of this algorithm assuming that the kernel is universal and fulfills the additional condition that the operators K(x, x) have finite trace. [sent-67, score-0.849]
12 The results in these papers imply universal consistency of kernel-based learning algorithms when the considered kernel is universal. [sent-68, score-0.513]
13 One more interesting application of universal kernels is described in Gretton et al. [sent-69, score-0.459]
14 In Section 2, we review the basic definition and properties of multi-task kernels, define the notion of universal kernel and describe some examples. [sent-72, score-0.544]
15 In Section 3, we introduce the notion of feature map associated to a multi-task kernel and show its relevance to the question of universality. [sent-73, score-0.382]
16 The importance of this result is that universality of a kernel can be established directly by considering its features. [sent-75, score-0.435]
17 1 reproducing kernel Hilbert space of K mapping from X to L (Y , W ) space of continuous Y -valued functions on Z isometric mapping from C (Z , Y ) to C (Z × B1 ) subset of C (Z , Y ) generated by K, see Eq. [sent-79, score-0.463]
18 RKHS of Vector-Valued Functions In this section, we review the theory of reproducing kernels for Hilbert spaces of vector-valued functions as in Micchelli and Pontil (2005) and introduce the notion of universal kernels. [sent-83, score-0.553]
19 Throughout this paper, we assume that the kernel K is continuous relative to the operator norm on L (Y ). [sent-98, score-0.404]
20 We also define, for every multi-task kernel K, the subspace of C (Z , Y ) CK (Z , Y ) := span{Kx y : x ∈ Z , y ∈ Y }, (2) where the closure is relative to the norm in the space C (Z , Y ). [sent-101, score-0.403]
21 Definition 2 We say that a multi-task kernel K is a universal kernel if, for any compact subset Z of X , CK (Z , Y ) = C (Z , Y ). [sent-102, score-0.874]
22 In order to describe some of the examples of multi-task kernels below, it is useful to first present the following generalization of Schur product of scalar kernels to matrix-valued kernels. [sent-105, score-0.618]
23 Then, the element-wise product kernel K ◦ G : X × X → Rn × Rn defined, for any x,t ∈ X and p, q ∈ Nn , by K ◦ G(x,t) pq := K(x,t) pq G(x,t) pq is an n × n multi-task kernel. [sent-113, score-0.674]
24 Example 1 If, for every j ∈ Nm the function G j : X × X → R is a scalar kernel and B j ∈ L+ (Y ), then the function K(x,t) = ∑ G j (x,t)B j , ∀x,t ∈ X (5) j∈Nm is a multi-task kernel. [sent-121, score-0.454]
25 Thus, the above kernel combines two heterogeneous kernels to form a more flexible one. [sent-138, score-0.43]
26 Example 2 If X0 is a compact Hausdorff space, for p ∈ Nn , Tp is a map from X from X0 (not necessary linear) and G : X0 × X0 → R is a scalar kernel, then n K(x,t) := G(Tp x, Tqt) p,q=1 , ∀x,t ∈ X is a matrix-valued kernel on X . [sent-145, score-0.645]
27 It is interesting to note, in passing, that, although one would expect the function K(x,t) := e−σ pq x−t 2 n p,q=1 , ∀x,t ∈ X (6) to be a kernel over X = Rd , we will show later in Section 5 that this is not true, unless all entries of the matrix σ are the same. [sent-154, score-0.376]
28 We also postpone to that section the proof of the claims in Examples 1-4 as well as the discussion about the universality of the kernels therein. [sent-189, score-0.424]
29 It is well known that universality of kernels is a main hypothesis in the proof of the consistency of kernel-based learning algorithms. [sent-191, score-0.424]
30 Universal Kernels by Features In this section, we prove that a multi-task kernel is universal if and only if its feature representation is universal. [sent-200, score-0.596]
31 A feature representation associated with a multi-task kernel K is a continuous function Φ : X → L (Y , W ) such that, for every x,t ∈ X K(x,t) = Φ∗ (x)Φ(t), (7) where, we recall, for each x ∈ X , Φ∗ (x) is the adjoint of Φ(x) and, therefore, it is in L (W , Y ). [sent-203, score-0.416]
32 Therefore, according to (7) the matrix representation of the multi-task kernel K is, for each x,t ∈ X , (K(x,t)) pq = ∑ Φrp (x)Φrq (t), p, q ∈ Nn . [sent-209, score-0.422]
33 Rk Rn , r∈Nk Returning to the general case, we emphasize that we assume that the kernel K has the representation in Equation (7), although if it corresponds to a compact integral operator, such a representation will follow from the spectral theorem and Mercer Theorem (see, e. [sent-210, score-0.561]
34 The theorem below demonstrates, as we mentioned above, that the kernel K is universal if and only if its feature representation is universal. [sent-217, score-0.656]
35 As we know, the feature representation of a given kernel is not unique, therefore we conclude by Theorem 4 that if some feature representation of a multi-task kernel is universal then every feature representation is universal. [sent-220, score-1.059]
36 Similarly, to any multi-task kernel K on Z we associate a scalar kernel J on Z × B 1 defined, for every (x, y), (t, u) ∈ X × B1 , as J((x, y), (t, u)) := (K(x,t)u, y). [sent-245, score-0.696]
37 Lemma 5 If Z is a compact subset of X and K is a continuous multi-task kernel then ι(C K (Z , Y )) = CJ (Z × B1 ). [sent-250, score-0.418]
38 Lemma 6 If Z is a compact subset of X and K is a continuous multi-task kernel with feature representation Φ then ι(CΦ (Z , Y )) = CΨ (Z × B1 ). [sent-257, score-0.501]
39 Proposition 7 For any compact set Z ⊆ X , and any continuous multi-task kernel K with feature representation Φ, we have that CΨ (Z × B1 ) = CJ (Z × B1 ). [sent-277, score-0.501]
40 (2006) to the scalar kernel J on the set Z × B1 with the feature representation given by Equation (12). [sent-280, score-0.537]
41 Further Perspectives for Universality In this section, we provide an alternate proof of Theorem 4 using the notion of vector measure and also highlight the notion of the annihilator of a set, a useful tool for our examples of multi-task kernels in Section 5. [sent-296, score-0.468]
42 At first glance, the reduction of the question of when a multi-task kernel is universal to the scalar case, as explained in Section 3, seems compelling. [sent-297, score-0.725]
43 Finally, as we demonstrated in Section 3 universality of multi-task kernels concerns density in the subspace CJ (Z × B1 ), not the full space C (Z × B1 ), an issue heretofore not considered. [sent-301, score-0.419]
44 3) that µ is a vector measure if and only if the corresponding variation |µ| is a scalar measure as explained in Section 3. [sent-318, score-0.359]
45 For any scalar measure ν ∈ M (Z × B1 ), we define a Y -valued function on B(Z ), by the equation Z µ(E) := ydν(x, y), ∀E ∈ B(Z ). [sent-322, score-0.365]
46 Specifically, for any y ∈ Y , we associate to any µ ∈ M (Z , Y ), a scalar measure µy defined, for any E ∈ B(Z ), by the equation µy (E) := (y, µ(E)). [sent-329, score-0.365]
47 It is interesting to remark that, for any µ ∈ M (Z , Y ) we have established in the proof of Lemma 10 that there exists a regular scalar measure ν on Z × B1 such that Lµ f = Z Z ×B1 ( f (x), y)dν(x, y). [sent-341, score-0.394]
48 Since we established the isometry between C ∗ (Z , Y ) and M (Z , Y ), it follows that, for every regular vector measure there corresponds a scalar measure on Z × B 1 for which Equation (15) holds true. [sent-342, score-0.383]
49 1 Product of Scalar Kernels and Operators Our first example is produced by coupling a scalar kernel with an operator in L + (Y ). [sent-385, score-0.505]
50 Given a scalar kernel G on X and an operator B ∈ L+ (Y ), we define the function K : X × X → L (Y ) by K(x,t) = G(x,t)B, ∀x,t ∈ X . [sent-386, score-0.505]
51 Our goal below is to use the feature representation of the scalar kernel G to introduce the corresponding one for kernel K. [sent-391, score-0.779]
52 To this end, we first let W be a Hilbert space and φ : X → W a feature map of the scalar kernel G, so that G(x,t) = (φ(x), φ(t))W , ∀x,t ∈ X . [sent-392, score-0.601]
53 The above tensor product suggests that we define the map Φ : X → L (Y , W ⊗ Y ) of kernel K by √ ∀x ∈ X , y ∈ Y , Φ(x)y := φ(x) ⊗ By, and it follows that Φ∗ : X → L (W ⊗ Y , Y ) is given by √ Φ∗ (x)(w ⊗ y) := (φ(x), w)W By, ∀x ∈ X , w ∈ W , and y ∈ Y . [sent-400, score-0.372]
54 Therefore, we conclude that Φ is a feature map for the multi-task kernel K. [sent-402, score-0.406]
55 We are now ready to present the result on universality of kernel K. [sent-404, score-0.435]
56 Then, K is a multi-task universal kernel if and only if the scalar kernel G is universal and the operator B is positive definite. [sent-406, score-1.317]
57 1630 U NIVERSAL MULTI - TASK KERNELS Proof By Theorem 11 and the feature representation (20), we only need to show that Φ(Z ) ⊥ = {0} if and only if G is universal and the operator B is positive definite. [sent-407, score-0.433]
58 To prove the necessity, suppose first that G is not universal and hence, we know that, for some compact subset Z of X , there exists a nonzero scalar measure ν ∈ M (Z ) such that ν ∈ φ(Z ) ⊥ , that Z is, Z (φ(x), w)dν(x) = 0 for any w ∈ W . [sent-417, score-0.736]
59 This suggests to us to choose a nonzero vector measure µ = y1 ν with some nonzero scalar measure ν. [sent-423, score-0.428]
60 Then, Theorem 12 tells us that the matrix-valued kernel K(x,t) := G(x,t)B is universal if and only if G is universal and the matrix B is of full rank. [sent-427, score-0.784]
61 1631 C APONNETTO , M ICCHELLI , P ONTIL AND Y ING We now proceed further and consider kernels produced by a finite combination of scalar kernels and operators. [sent-428, score-0.588]
62 Specifically, we consider, for any j ∈ Nm , that G j : X × X → R be a scalar kernel and B j ∈ L+ (Y ). [sent-429, score-0.454]
63 j∈Nm Suppose also, for each scalar kernel G j , that there exists a Hilbert feature space W j and a feature map φ j : X → W j . [sent-431, score-0.664]
64 To explain the associated feature map of kernel K, we need to define its feature space. [sent-432, score-0.388]
65 j∈Nm This observation suggests to us to define the feature space of kernel K by the direct sum Hilbert space W := ⊕ j∈Nm (W j ⊗ Y ), and its the map Φ : X → L (Y , W ), for any x ∈ X and y ∈ Y , by √ √ Φ(x)y := (φ1 (x) ⊗ B1 y, . [sent-450, score-0.427]
66 We are now in a position to state the result about the universality of the kernel K. [sent-463, score-0.435]
67 Theorem 13 Suppose that G j : X × X → R is a continuous scalar universal kernel, and B j ∈ L+ (Y ) for j ∈ Nm . [sent-464, score-0.54]
68 Hence, choosing a nonzero vector measure µ := y0 ν, with ν a nonzero scalar measure, implies that Equation (26) holds true and, thus kernel K is not universal. [sent-484, score-0.612]
69 Then this is an RKHS with universal multi-task kernel given, for every x,t ∈ X by (K(x,t)y)(r) = e −π|x−t| Z Rd e−π r−s y(s)ds, ∀y ∈ Y , r ∈ Rd . [sent-519, score-0.513]
70 To prove the universality of this kernel, let Z be any prescribed compact subset of X , we define the Laplace kernel, for any x,t ∈ R, by G(x,t) := e−|x−t| and the operator B : L2 (Rd ) → L2 (Rd ) by Bg(r) := Z Rd e− r−s g(s)ds, ∀ r ∈ Rd . [sent-532, score-0.363]
71 We investigate this kernel in the case that, for any ω ∈ Ω, G(ω) is a scalar kernel with a feature representation given, for any x,t ∈ X , by the formula G(ω)(x,t) = φω (x), φω (t) W . [sent-543, score-0.779]
72 By an argument similar to that used just before Theorem 13, we conclude that K is a multi-task kernel and has the feature map Φ with feature space W . [sent-546, score-0.481]
73 Theorem 15 Let p be a measure on Ω and for every ω in the support of p, let G(ω) be a continuous universal kernel and B(ω) a positive definite operator. [sent-548, score-0.656]
74 Then the kernel K(x,t) = 0∞ e−ω x−t B(ω)d p(ω) is a multi-task universal kernel. [sent-559, score-0.513]
75 With this choice of A, the universal kernel in Example 5 becomes K(x,t) ij = 1 x−t 2 +A , ij ∀i, j ∈ Nn . [sent-570, score-0.513]
76 2 Transformation Kernels In this subsection we explore matrix-valued kernels produced by transforming scalar kernels. [sent-572, score-0.4]
77 Then, given a continuous scalar kernel G : X × X → R, we consider the matrix-valued kernel on X defined by n K(x,t) := G(Tp x, Tqt) p,q=1 , ∀x,t ∈ X . [sent-574, score-0.753]
78 (33) Proposition 16 Let G be a scalar kernel and K be defined by (33). [sent-575, score-0.454]
79 p,i q, j Since G is a scalar reproducing kernel on Z , the last term of the above equality is nonnegative, and hence K is positive semi-definite matrix-valued kernel. [sent-578, score-0.545]
80 To this end, we assume that the scalar kernel G has a feature map φ : X → W and define the mapping Φ(x) : Rn → W , for any y = (y1 , . [sent-581, score-0.563]
81 Then, for any x,t ∈ X , the kernel K(x,t) = Φ∗ (x)Φ(t) and thus, we conclude that W is the feature space of K and Φ is its feature map. [sent-589, score-0.409]
82 Finally, for any scalar Borel measure ν on X and a continuous map T from X to X , we introduce the image measure ν ◦ T −1 on X defined, for any E ∈ B(X ), by (ν ◦ T −1 )(E) := ν({x ∈ X : T x ∈ E}). [sent-593, score-0.457]
83 We are ready to state the result about universality of the kernel K in Equation (33). [sent-594, score-0.435]
84 Proposition 17 Let G be a scalar universal kernel, Tp : X → X be continuous for each p ∈ Nn and define the kernel K by Equation (33). [sent-595, score-0.782]
85 Before doing so, we recall that, by Lemma 10 and the remark which followed it, for any vector measure µ ∈ M (Z , Rn ), there exists a scalar regular measure ν ∈ M (Z × B1 ) such that Z Z dµ(t) = y1 dν(t, y), . [sent-598, score-0.409]
86 (2006) that the scalar kernel G is universal on Z if and only if its feature map φ is universal on Z . [sent-610, score-1.105]
87 Then, the matrix-valued function defined by K(x,t) := e−σ pq x−y 2 n p,q=1 , ∀x,t ∈ X is a multi-task kernel if and only if for some constant σ, σ pq = σ for any p, q ∈ Nn . [sent-638, score-0.51]
88 Conversely, suppose K is a multi-task kernel which means, for any m ∈ N and xi ∈ X with i ∈ Nm , that the double-indexed nm × nm matrix 2 G((i, p), ( j, q)) = e−σ pq xi −x j (35) (i,p),( j,q)∈Nm ×Nn is positive semi-definite. [sent-640, score-1.18]
89 3 Hessian of Gaussian Kernels In this subsection we consider the universal example of the Hessian of scalar Gaussian kernels (Example 3 in Section 2). [sent-650, score-0.671]
90 Then, K is a matrix-valued kernel which is universal if and only if n = 1. [sent-655, score-0.513]
91 Hence, by Theorem 11, the kernel K is universal when n = 1. [sent-664, score-0.513]
92 Hence, the kernel K is not universal when n ≥ 2. [sent-667, score-0.513]
93 4 Projection Kernels In the final subsection we introduce a class of multi-task kernels associated with projection operators of scalar kernels. [sent-669, score-0.465]
94 We also need a continuous scalar kernel G : (X × Ω) × (X × Ω) :→ R with a feature representation given, for any x, x ∈ X and t, s ∈ Ω, by G((x,t), (x , s)) = φ(x,t), φ(x , s) W . [sent-672, score-0.594]
95 If G is a universal scalar kernel then K is a universal multi-task kernel. [sent-680, score-0.996]
96 By assumption, G is universal on X × Ω and φ is its feature map, and thus we conclude that the scalar measure dµ(x, s) is the zero measure. [sent-687, score-0.633]
97 The first space, CK (Z , Y ), is the closure of the linear span of kernel sections; the second space, CΦ (Z , Y ), is the closure of the linear span of the features associated with the kernel. [sent-693, score-0.464]
98 This result is important in that it allows one to verify the universality of a kernel directly by considering its features. [sent-695, score-0.435]
99 (2006) and the observation that a multi-task kernel can be reduced to a standard scalar-valued kernel on the cross product space Z × Y . [sent-698, score-0.552]
100 Moreover, recalling that Z × B1 is compact if B1 is equipped with the weak ˜ topology, by the Riesz representation theorem, for any L, there exists a scalar measure ν on Z × B1 such that Z ˜ L(F) = F(x, y)dν(x, y), ∀F ∈ C (Z × B1 ). [sent-739, score-0.461]
wordName wordTfidf (topN-words)
[('nm', 0.362), ('nn', 0.289), ('universal', 0.271), ('kernel', 0.242), ('micchelli', 0.239), ('scalar', 0.212), ('universality', 0.193), ('kernels', 0.188), ('aponnetto', 0.177), ('icchelli', 0.177), ('ontil', 0.177), ('tq', 0.176), ('niversal', 0.166), ('pq', 0.134), ('compact', 0.119), ('hilbert', 0.112), ('multi', 0.107), ('tp', 0.095), ('equation', 0.095), ('riesz', 0.088), ('borel', 0.085), ('cj', 0.084), ('vanishes', 0.079), ('folland', 0.077), ('uhl', 0.077), ('rkhs', 0.076), ('ing', 0.072), ('map', 0.072), ('rn', 0.07), ('pontil', 0.07), ('closure', 0.069), ('annihilator', 0.066), ('diestel', 0.066), ('operators', 0.065), ('rd', 0.065), ('reproducing', 0.063), ('banach', 0.06), ('theorem', 0.06), ('hausdorff', 0.059), ('caponnetto', 0.059), ('measure', 0.058), ('continuous', 0.057), ('conclude', 0.055), ('regular', 0.055), ('norm', 0.054), ('alternate', 0.051), ('operator', 0.051), ('ck', 0.05), ('nonzero', 0.05), ('topology', 0.05), ('integral', 0.048), ('ds', 0.047), ('plancherel', 0.047), ('vito', 0.047), ('sobolev', 0.046), ('rm', 0.046), ('representation', 0.046), ('proof', 0.043), ('span', 0.042), ('nishes', 0.038), ('signed', 0.038), ('space', 0.038), ('feature', 0.037), ('lemma', 0.036), ('sup', 0.035), ('functional', 0.035), ('hm', 0.034), ('adjoint', 0.034), ('hong', 0.034), ('kong', 0.034), ('bg', 0.033), ('carmeli', 0.033), ('tqt', 0.033), ('vazquez', 0.033), ('yq', 0.033), ('evgeniou', 0.032), ('task', 0.032), ('variation', 0.031), ('notion', 0.031), ('stein', 0.031), ('product', 0.03), ('consequently', 0.029), ('proposition', 0.029), ('kx', 0.029), ('ym', 0.029), ('hessian', 0.028), ('positive', 0.028), ('inner', 0.028), ('cucker', 0.028), ('tensor', 0.028), ('pairwise', 0.028), ('disjoint', 0.027), ('nite', 0.026), ('exists', 0.026), ('xi', 0.026), ('cd', 0.025), ('isometric', 0.025), ('aronszajn', 0.025), ('rp', 0.025), ('lt', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
2 0.16241233 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
3 0.15603915 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
4 0.097476304 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
5 0.087204695 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
6 0.085335955 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
7 0.083812334 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.080480918 86 jmlr-2008-SimpleMKL
9 0.059708212 56 jmlr-2008-Magic Moments for Structured Output Prediction
10 0.054870471 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.050021209 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
12 0.049803484 22 jmlr-2008-Closed Sets for Labeled Data
13 0.04672635 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
14 0.046630394 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
15 0.046031576 58 jmlr-2008-Max-margin Classification of Data with Absent Features
16 0.041038547 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
17 0.04042843 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
18 0.039303992 52 jmlr-2008-Learning from Multiple Sources
19 0.038887717 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
20 0.038753986 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
topicId topicWeight
[(0, 0.231), (1, -0.143), (2, 0.085), (3, -0.058), (4, 0.22), (5, -0.176), (6, 0.112), (7, 0.168), (8, 0.081), (9, 0.094), (10, -0.052), (11, -0.068), (12, 0.076), (13, 0.003), (14, 0.004), (15, 0.002), (16, -0.016), (17, -0.04), (18, -0.08), (19, -0.149), (20, 0.057), (21, -0.044), (22, -0.024), (23, -0.077), (24, 0.125), (25, -0.113), (26, -0.033), (27, -0.11), (28, -0.116), (29, 0.071), (30, -0.119), (31, -0.217), (32, -0.016), (33, -0.059), (34, 0.086), (35, 0.201), (36, 0.131), (37, 0.098), (38, 0.028), (39, 0.044), (40, 0.003), (41, 0.196), (42, -0.043), (43, 0.026), (44, 0.02), (45, 0.124), (46, 0.008), (47, 0.129), (48, 0.185), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.96579295 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
2 0.64559323 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
3 0.57488251 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
4 0.56156754 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
5 0.455708 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
Author: Matthias W. Seeger
Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization
6 0.42677915 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
7 0.32871062 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.31520265 86 jmlr-2008-SimpleMKL
9 0.29780868 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.2752952 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
11 0.23667361 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
12 0.23153178 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.20265542 58 jmlr-2008-Max-margin Classification of Data with Absent Features
14 0.20172153 22 jmlr-2008-Closed Sets for Labeled Data
15 0.19881007 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
16 0.18804134 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
17 0.18800546 57 jmlr-2008-Manifold Learning: The Price of Normalization
18 0.18667141 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
19 0.18411607 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
20 0.18386169 26 jmlr-2008-Consistency of Trace Norm Minimization
topicId topicWeight
[(0, 0.057), (40, 0.048), (54, 0.033), (58, 0.029), (66, 0.07), (76, 0.018), (78, 0.502), (88, 0.06), (92, 0.029), (94, 0.034), (99, 0.023)]
simIndex simValue paperId paperTitle
1 0.89142019 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
Author: Christian Igel, Verena Heidrich-Meisner, Tobias Glasmachers
Abstract: SHARK is an object-oriented library for the design of adaptive systems. It comprises methods for single- and multi-objective optimization (e.g., evolutionary and gradient-based algorithms) as well as kernel-based methods, neural networks, and other machine learning techniques. Keywords: machine learning software, neural networks, kernel-methods, evolutionary algorithms, optimization, multi-objective-optimization 1. Overview SHARK is a modular C++ library for the design and optimization of adaptive systems. It serves as a toolbox for real world applications and basic research in computational intelligence and machine learning. The library provides methods for single- and multi-objective optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning methods, neural networks, and many other machine learning techniques. Its main design criteria are flexibility and speed. Here we restrict the description of SHARK to its core components, albeit the library contains plenty of additional functionality. Further information can be obtained from the HTML documentation and tutorials. More than 60 illustrative example programs serve as starting points for using SHARK. 2. Basic Tools—Rng, Array, and LinAlg The library provides general auxiliary functions and data structures for the development of machine learning algorithms. The Rng module generates reproducible and platform independent sequences of pseudo random numbers, which can be drawn from 14 predefined discrete and continuous parametric distributions. The Array class provides dynamical array templates of arbitrary type and dimension as well as basic operations acting on these templates. LinAlg implements linear algebra algorithms such as matrix inversion and singular value decomposition. 3. ReClaM—Regression and Classification Methods The goal of the ReClaM module is to provide machine learning algorithms for supervised classification and regression in a unified, modular framework. It is built like a construction kit, where the main building blocks are adaptive data processing models, error functions, and optimization c 2008 Christian Igel, Verena Heidrich-Meisner and Tobias Glasmachers. I GEL , H EIDRICH -M EISNER AND G LASMACHERS 8 90736D 3 ¨¥¨¥¥£ ¡ §§©§¦¤¢ init(...) optimize(...) E 8973 B@ 6 4C3 A 86 973 543 %$#¨!
same-paper 2 0.88726467 92 jmlr-2008-Universal Multi-Task Kernels
Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying
Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
4 0.32327086 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
5 0.31374234 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
6 0.30880696 9 jmlr-2008-Active Learning by Spherical Subdivision
7 0.29912579 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
8 0.29170927 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
9 0.29147294 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.28893301 86 jmlr-2008-SimpleMKL
11 0.28639737 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
12 0.28297794 58 jmlr-2008-Max-margin Classification of Data with Absent Features
13 0.27308062 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
14 0.26841679 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
16 0.2658557 57 jmlr-2008-Manifold Learning: The Price of Normalization
17 0.26264423 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
18 0.2585001 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
19 0.25789243 56 jmlr-2008-Magic Moments for Structured Output Prediction
20 0.2559967 26 jmlr-2008-Consistency of Trace Norm Minimization