jmlr jmlr2008 jmlr2008-92 knowledge-graph by maker-knowledge-mining

92 jmlr-2008-Universal Multi-Task Kernels

Source: pdf

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufﬁcient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The reproducing kernel K associated to HK has its values as operators on Y . [sent-13, score-0.37]

2 We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. [sent-16, score-0.459]

3 Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces 1. [sent-18, score-0.576]

4 , 2006; Devinatz, 1960; Lowitzsh, 2005; Reisert and Burkhardt, 2007; Vazquez and Walter, 2003, and references therein for more information) A multi-task kernel K is the reproducing kernel of a Hilbert space of functions from an input space X which takes values in a Hilbert space Y . [sent-37, score-0.661]

5 Speciﬁcally, the RKHS is formed by taking the closure of the linear span of kernel sections {K(·, x)y, x ∈ X , y ∈ Y }, relative to the RKHS norm. [sent-42, score-0.353]

6 Here, we are concerned with conditions on the kernel K which ensure that all continuous functions from X to Y can be uniformly approximated on any compact subset of X by the linear span of kernel sections. [sent-44, score-0.702]

7 Steinwart uses the expression universal kernel and we follow that terminology here. [sent-46, score-0.513]

8 The problem of identifying universal kernels was also discussed by Poggio et al. [sent-47, score-0.459]

9 The question of identifying universal kernels has a practical basis. [sent-53, score-0.459]

10 Sometimes, such a kernel is called operator-valued or matrix-valued kernel if Y is inﬁnite of ﬁnite dimensional, respectively. [sent-65, score-0.484]

11 1616 U NIVERSAL MULTI - TASK KERNELS proved universal consistency of this algorithm assuming that the kernel is universal and fulﬁlls the additional condition that the operators K(x, x) have ﬁnite trace. [sent-67, score-0.849]

12 The results in these papers imply universal consistency of kernel-based learning algorithms when the considered kernel is universal. [sent-68, score-0.513]

13 One more interesting application of universal kernels is described in Gretton et al. [sent-69, score-0.459]

14 In Section 2, we review the basic deﬁnition and properties of multi-task kernels, deﬁne the notion of universal kernel and describe some examples. [sent-72, score-0.544]

15 In Section 3, we introduce the notion of feature map associated to a multi-task kernel and show its relevance to the question of universality. [sent-73, score-0.382]

16 The importance of this result is that universality of a kernel can be established directly by considering its features. [sent-75, score-0.435]

17 1 reproducing kernel Hilbert space of K mapping from X to L (Y , W ) space of continuous Y -valued functions on Z isometric mapping from C (Z , Y ) to C (Z × B1 ) subset of C (Z , Y ) generated by K, see Eq. [sent-79, score-0.463]

18 RKHS of Vector-Valued Functions In this section, we review the theory of reproducing kernels for Hilbert spaces of vector-valued functions as in Micchelli and Pontil (2005) and introduce the notion of universal kernels. [sent-83, score-0.553]

19 Throughout this paper, we assume that the kernel K is continuous relative to the operator norm on L (Y ). [sent-98, score-0.404]

20 We also deﬁne, for every multi-task kernel K, the subspace of C (Z , Y ) CK (Z , Y ) := span{Kx y : x ∈ Z , y ∈ Y }, (2) where the closure is relative to the norm in the space C (Z , Y ). [sent-101, score-0.403]

21 Deﬁnition 2 We say that a multi-task kernel K is a universal kernel if, for any compact subset Z of X , CK (Z , Y ) = C (Z , Y ). [sent-102, score-0.874]

22 In order to describe some of the examples of multi-task kernels below, it is useful to ﬁrst present the following generalization of Schur product of scalar kernels to matrix-valued kernels. [sent-105, score-0.618]

23 Then, the element-wise product kernel K ◦ G : X × X → Rn × Rn deﬁned, for any x,t ∈ X and p, q ∈ Nn , by K ◦ G(x,t) pq := K(x,t) pq G(x,t) pq is an n × n multi-task kernel. [sent-113, score-0.674]

24 Example 1 If, for every j ∈ Nm the function G j : X × X → R is a scalar kernel and B j ∈ L+ (Y ), then the function K(x,t) = ∑ G j (x,t)B j , ∀x,t ∈ X (5) j∈Nm is a multi-task kernel. [sent-121, score-0.454]

25 Thus, the above kernel combines two heterogeneous kernels to form a more ﬂexible one. [sent-138, score-0.43]

26 Example 2 If X0 is a compact Hausdorff space, for p ∈ Nn , Tp is a map from X from X0 (not necessary linear) and G : X0 × X0 → R is a scalar kernel, then n K(x,t) := G(Tp x, Tqt) p,q=1 , ∀x,t ∈ X is a matrix-valued kernel on X . [sent-145, score-0.645]

27 It is interesting to note, in passing, that, although one would expect the function K(x,t) := e−σ pq x−t 2 n p,q=1 , ∀x,t ∈ X (6) to be a kernel over X = Rd , we will show later in Section 5 that this is not true, unless all entries of the matrix σ are the same. [sent-154, score-0.376]

28 We also postpone to that section the proof of the claims in Examples 1-4 as well as the discussion about the universality of the kernels therein. [sent-189, score-0.424]

29 It is well known that universality of kernels is a main hypothesis in the proof of the consistency of kernel-based learning algorithms. [sent-191, score-0.424]

30 Universal Kernels by Features In this section, we prove that a multi-task kernel is universal if and only if its feature representation is universal. [sent-200, score-0.596]

31 A feature representation associated with a multi-task kernel K is a continuous function Φ : X → L (Y , W ) such that, for every x,t ∈ X K(x,t) = Φ∗ (x)Φ(t), (7) where, we recall, for each x ∈ X , Φ∗ (x) is the adjoint of Φ(x) and, therefore, it is in L (W , Y ). [sent-203, score-0.416]

32 Therefore, according to (7) the matrix representation of the multi-task kernel K is, for each x,t ∈ X , (K(x,t)) pq = ∑ Φrp (x)Φrq (t), p, q ∈ Nn . [sent-209, score-0.422]

33 Rk Rn , r∈Nk Returning to the general case, we emphasize that we assume that the kernel K has the representation in Equation (7), although if it corresponds to a compact integral operator, such a representation will follow from the spectral theorem and Mercer Theorem (see, e. [sent-210, score-0.561]

34 The theorem below demonstrates, as we mentioned above, that the kernel K is universal if and only if its feature representation is universal. [sent-217, score-0.656]

35 As we know, the feature representation of a given kernel is not unique, therefore we conclude by Theorem 4 that if some feature representation of a multi-task kernel is universal then every feature representation is universal. [sent-220, score-1.059]

36 Similarly, to any multi-task kernel K on Z we associate a scalar kernel J on Z × B 1 deﬁned, for every (x, y), (t, u) ∈ X × B1 , as J((x, y), (t, u)) := (K(x,t)u, y). [sent-245, score-0.696]

37 Lemma 5 If Z is a compact subset of X and K is a continuous multi-task kernel then ι(C K (Z , Y )) = CJ (Z × B1 ). [sent-250, score-0.418]

38 Lemma 6 If Z is a compact subset of X and K is a continuous multi-task kernel with feature representation Φ then ι(CΦ (Z , Y )) = CΨ (Z × B1 ). [sent-257, score-0.501]

39 Proposition 7 For any compact set Z ⊆ X , and any continuous multi-task kernel K with feature representation Φ, we have that CΨ (Z × B1 ) = CJ (Z × B1 ). [sent-277, score-0.501]

40 (2006) to the scalar kernel J on the set Z × B1 with the feature representation given by Equation (12). [sent-280, score-0.537]

41 Further Perspectives for Universality In this section, we provide an alternate proof of Theorem 4 using the notion of vector measure and also highlight the notion of the annihilator of a set, a useful tool for our examples of multi-task kernels in Section 5. [sent-296, score-0.468]

42 At ﬁrst glance, the reduction of the question of when a multi-task kernel is universal to the scalar case, as explained in Section 3, seems compelling. [sent-297, score-0.725]

43 Finally, as we demonstrated in Section 3 universality of multi-task kernels concerns density in the subspace CJ (Z × B1 ), not the full space C (Z × B1 ), an issue heretofore not considered. [sent-301, score-0.419]

44 3) that µ is a vector measure if and only if the corresponding variation |µ| is a scalar measure as explained in Section 3. [sent-318, score-0.359]

45 For any scalar measure ν ∈ M (Z × B1 ), we deﬁne a Y -valued function on B(Z ), by the equation Z µ(E) := ydν(x, y), ∀E ∈ B(Z ). [sent-322, score-0.365]

46 Speciﬁcally, for any y ∈ Y , we associate to any µ ∈ M (Z , Y ), a scalar measure µy deﬁned, for any E ∈ B(Z ), by the equation µy (E) := (y, µ(E)). [sent-329, score-0.365]

47 It is interesting to remark that, for any µ ∈ M (Z , Y ) we have established in the proof of Lemma 10 that there exists a regular scalar measure ν on Z × B1 such that Lµ f = Z Z ×B1 ( f (x), y)dν(x, y). [sent-341, score-0.394]

48 Since we established the isometry between C ∗ (Z , Y ) and M (Z , Y ), it follows that, for every regular vector measure there corresponds a scalar measure on Z × B 1 for which Equation (15) holds true. [sent-342, score-0.383]

49 1 Product of Scalar Kernels and Operators Our ﬁrst example is produced by coupling a scalar kernel with an operator in L + (Y ). [sent-385, score-0.505]

50 Given a scalar kernel G on X and an operator B ∈ L+ (Y ), we deﬁne the function K : X × X → L (Y ) by K(x,t) = G(x,t)B, ∀x,t ∈ X . [sent-386, score-0.505]

51 Our goal below is to use the feature representation of the scalar kernel G to introduce the corresponding one for kernel K. [sent-391, score-0.779]

52 To this end, we ﬁrst let W be a Hilbert space and φ : X → W a feature map of the scalar kernel G, so that G(x,t) = (φ(x), φ(t))W , ∀x,t ∈ X . [sent-392, score-0.601]

53 The above tensor product suggests that we deﬁne the map Φ : X → L (Y , W ⊗ Y ) of kernel K by √ ∀x ∈ X , y ∈ Y , Φ(x)y := φ(x) ⊗ By, and it follows that Φ∗ : X → L (W ⊗ Y , Y ) is given by √ Φ∗ (x)(w ⊗ y) := (φ(x), w)W By, ∀x ∈ X , w ∈ W , and y ∈ Y . [sent-400, score-0.372]

54 Therefore, we conclude that Φ is a feature map for the multi-task kernel K. [sent-402, score-0.406]

55 We are now ready to present the result on universality of kernel K. [sent-404, score-0.435]

56 Then, K is a multi-task universal kernel if and only if the scalar kernel G is universal and the operator B is positive deﬁnite. [sent-406, score-1.317]

57 1630 U NIVERSAL MULTI - TASK KERNELS Proof By Theorem 11 and the feature representation (20), we only need to show that Φ(Z ) ⊥ = {0} if and only if G is universal and the operator B is positive deﬁnite. [sent-407, score-0.433]

58 To prove the necessity, suppose ﬁrst that G is not universal and hence, we know that, for some compact subset Z of X , there exists a nonzero scalar measure ν ∈ M (Z ) such that ν ∈ φ(Z ) ⊥ , that Z is, Z (φ(x), w)dν(x) = 0 for any w ∈ W . [sent-417, score-0.736]

59 This suggests to us to choose a nonzero vector measure µ = y1 ν with some nonzero scalar measure ν. [sent-423, score-0.428]

60 Then, Theorem 12 tells us that the matrix-valued kernel K(x,t) := G(x,t)B is universal if and only if G is universal and the matrix B is of full rank. [sent-427, score-0.784]

61 1631 C APONNETTO , M ICCHELLI , P ONTIL AND Y ING We now proceed further and consider kernels produced by a ﬁnite combination of scalar kernels and operators. [sent-428, score-0.588]

62 Speciﬁcally, we consider, for any j ∈ Nm , that G j : X × X → R be a scalar kernel and B j ∈ L+ (Y ). [sent-429, score-0.454]

63 j∈Nm Suppose also, for each scalar kernel G j , that there exists a Hilbert feature space W j and a feature map φ j : X → W j . [sent-431, score-0.664]

64 To explain the associated feature map of kernel K, we need to deﬁne its feature space. [sent-432, score-0.388]

65 j∈Nm This observation suggests to us to deﬁne the feature space of kernel K by the direct sum Hilbert space W := ⊕ j∈Nm (W j ⊗ Y ), and its the map Φ : X → L (Y , W ), for any x ∈ X and y ∈ Y , by √ √ Φ(x)y := (φ1 (x) ⊗ B1 y, . [sent-450, score-0.427]

66 We are now in a position to state the result about the universality of the kernel K. [sent-463, score-0.435]

67 Theorem 13 Suppose that G j : X × X → R is a continuous scalar universal kernel, and B j ∈ L+ (Y ) for j ∈ Nm . [sent-464, score-0.54]

68 Hence, choosing a nonzero vector measure µ := y0 ν, with ν a nonzero scalar measure, implies that Equation (26) holds true and, thus kernel K is not universal. [sent-484, score-0.612]

69 Then this is an RKHS with universal multi-task kernel given, for every x,t ∈ X by (K(x,t)y)(r) = e −π|x−t| Z Rd e−π r−s y(s)ds, ∀y ∈ Y , r ∈ Rd . [sent-519, score-0.513]

70 To prove the universality of this kernel, let Z be any prescribed compact subset of X , we deﬁne the Laplace kernel, for any x,t ∈ R, by G(x,t) := e−|x−t| and the operator B : L2 (Rd ) → L2 (Rd ) by Bg(r) := Z Rd e− r−s g(s)ds, ∀ r ∈ Rd . [sent-532, score-0.363]

71 We investigate this kernel in the case that, for any ω ∈ Ω, G(ω) is a scalar kernel with a feature representation given, for any x,t ∈ X , by the formula G(ω)(x,t) = φω (x), φω (t) W . [sent-543, score-0.779]

72 By an argument similar to that used just before Theorem 13, we conclude that K is a multi-task kernel and has the feature map Φ with feature space W . [sent-546, score-0.481]

73 Theorem 15 Let p be a measure on Ω and for every ω in the support of p, let G(ω) be a continuous universal kernel and B(ω) a positive deﬁnite operator. [sent-548, score-0.656]

74 Then the kernel K(x,t) = 0∞ e−ω x−t B(ω)d p(ω) is a multi-task universal kernel. [sent-559, score-0.513]

75 With this choice of A, the universal kernel in Example 5 becomes K(x,t) ij = 1 x−t 2 +A , ij ∀i, j ∈ Nn . [sent-570, score-0.513]

76 2 Transformation Kernels In this subsection we explore matrix-valued kernels produced by transforming scalar kernels. [sent-572, score-0.4]

77 Then, given a continuous scalar kernel G : X × X → R, we consider the matrix-valued kernel on X deﬁned by n K(x,t) := G(Tp x, Tqt) p,q=1 , ∀x,t ∈ X . [sent-574, score-0.753]

78 (33) Proposition 16 Let G be a scalar kernel and K be deﬁned by (33). [sent-575, score-0.454]

79 p,i q, j Since G is a scalar reproducing kernel on Z , the last term of the above equality is nonnegative, and hence K is positive semi-deﬁnite matrix-valued kernel. [sent-578, score-0.545]

80 To this end, we assume that the scalar kernel G has a feature map φ : X → W and deﬁne the mapping Φ(x) : Rn → W , for any y = (y1 , . [sent-581, score-0.563]

81 Then, for any x,t ∈ X , the kernel K(x,t) = Φ∗ (x)Φ(t) and thus, we conclude that W is the feature space of K and Φ is its feature map. [sent-589, score-0.409]

82 Finally, for any scalar Borel measure ν on X and a continuous map T from X to X , we introduce the image measure ν ◦ T −1 on X deﬁned, for any E ∈ B(X ), by (ν ◦ T −1 )(E) := ν({x ∈ X : T x ∈ E}). [sent-593, score-0.457]

83 We are ready to state the result about universality of the kernel K in Equation (33). [sent-594, score-0.435]

84 Proposition 17 Let G be a scalar universal kernel, Tp : X → X be continuous for each p ∈ Nn and deﬁne the kernel K by Equation (33). [sent-595, score-0.782]

85 Before doing so, we recall that, by Lemma 10 and the remark which followed it, for any vector measure µ ∈ M (Z , Rn ), there exists a scalar regular measure ν ∈ M (Z × B1 ) such that Z Z dµ(t) = y1 dν(t, y), . [sent-598, score-0.409]

86 (2006) that the scalar kernel G is universal on Z if and only if its feature map φ is universal on Z . [sent-610, score-1.105]

87 Then, the matrix-valued function deﬁned by K(x,t) := e−σ pq x−y 2 n p,q=1 , ∀x,t ∈ X is a multi-task kernel if and only if for some constant σ, σ pq = σ for any p, q ∈ Nn . [sent-638, score-0.51]

88 Conversely, suppose K is a multi-task kernel which means, for any m ∈ N and xi ∈ X with i ∈ Nm , that the double-indexed nm × nm matrix 2 G((i, p), ( j, q)) = e−σ pq xi −x j (35) (i,p),( j,q)∈Nm ×Nn is positive semi-deﬁnite. [sent-640, score-1.18]

89 3 Hessian of Gaussian Kernels In this subsection we consider the universal example of the Hessian of scalar Gaussian kernels (Example 3 in Section 2). [sent-650, score-0.671]

90 Then, K is a matrix-valued kernel which is universal if and only if n = 1. [sent-655, score-0.513]

91 Hence, by Theorem 11, the kernel K is universal when n = 1. [sent-664, score-0.513]

92 Hence, the kernel K is not universal when n ≥ 2. [sent-667, score-0.513]

93 4 Projection Kernels In the ﬁnal subsection we introduce a class of multi-task kernels associated with projection operators of scalar kernels. [sent-669, score-0.465]

94 We also need a continuous scalar kernel G : (X × Ω) × (X × Ω) :→ R with a feature representation given, for any x, x ∈ X and t, s ∈ Ω, by G((x,t), (x , s)) = φ(x,t), φ(x , s) W . [sent-672, score-0.594]

95 If G is a universal scalar kernel then K is a universal multi-task kernel. [sent-680, score-0.996]

96 By assumption, G is universal on X × Ω and φ is its feature map, and thus we conclude that the scalar measure dµ(x, s) is the zero measure. [sent-687, score-0.633]

97 The ﬁrst space, CK (Z , Y ), is the closure of the linear span of kernel sections; the second space, CΦ (Z , Y ), is the closure of the linear span of the features associated with the kernel. [sent-693, score-0.464]

98 This result is important in that it allows one to verify the universality of a kernel directly by considering its features. [sent-695, score-0.435]

99 (2006) and the observation that a multi-task kernel can be reduced to a standard scalar-valued kernel on the cross product space Z × Y . [sent-698, score-0.552]

100 Moreover, recalling that Z × B1 is compact if B1 is equipped with the weak ˜ topology, by the Riesz representation theorem, for any L, there exists a scalar measure ν on Z × B1 such that Z ˜ L(F) = F(x, y)dν(x, y), ∀F ∈ C (Z × B1 ). [sent-739, score-0.461]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nm', 0.362), ('nn', 0.289), ('universal', 0.271), ('kernel', 0.242), ('micchelli', 0.239), ('scalar', 0.212), ('universality', 0.193), ('kernels', 0.188), ('aponnetto', 0.177), ('icchelli', 0.177), ('ontil', 0.177), ('tq', 0.176), ('niversal', 0.166), ('pq', 0.134), ('compact', 0.119), ('hilbert', 0.112), ('multi', 0.107), ('tp', 0.095), ('equation', 0.095), ('riesz', 0.088), ('borel', 0.085), ('cj', 0.084), ('vanishes', 0.079), ('folland', 0.077), ('uhl', 0.077), ('rkhs', 0.076), ('ing', 0.072), ('map', 0.072), ('rn', 0.07), ('pontil', 0.07), ('closure', 0.069), ('annihilator', 0.066), ('diestel', 0.066), ('operators', 0.065), ('rd', 0.065), ('reproducing', 0.063), ('banach', 0.06), ('theorem', 0.06), ('hausdorff', 0.059), ('caponnetto', 0.059), ('measure', 0.058), ('continuous', 0.057), ('conclude', 0.055), ('regular', 0.055), ('norm', 0.054), ('alternate', 0.051), ('operator', 0.051), ('ck', 0.05), ('nonzero', 0.05), ('topology', 0.05), ('integral', 0.048), ('ds', 0.047), ('plancherel', 0.047), ('vito', 0.047), ('sobolev', 0.046), ('rm', 0.046), ('representation', 0.046), ('proof', 0.043), ('span', 0.042), ('nishes', 0.038), ('signed', 0.038), ('space', 0.038), ('feature', 0.037), ('lemma', 0.036), ('sup', 0.035), ('functional', 0.035), ('hm', 0.034), ('adjoint', 0.034), ('hong', 0.034), ('kong', 0.034), ('bg', 0.033), ('carmeli', 0.033), ('tqt', 0.033), ('vazquez', 0.033), ('yq', 0.033), ('evgeniou', 0.032), ('task', 0.032), ('variation', 0.031), ('notion', 0.031), ('stein', 0.031), ('product', 0.03), ('consequently', 0.029), ('proposition', 0.029), ('kx', 0.029), ('ym', 0.029), ('hessian', 0.028), ('positive', 0.028), ('inner', 0.028), ('cucker', 0.028), ('tensor', 0.028), ('pairwise', 0.028), ('disjoint', 0.027), ('nite', 0.026), ('exists', 0.026), ('xi', 0.026), ('cd', 0.025), ('isometric', 0.025), ('aronszajn', 0.025), ('rp', 0.025), ('lt', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

2 0.16241233 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufﬁciently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classiﬁcation results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality

3 0.15603915 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classiﬁer with no tuning parameter. It is a combination of SVM classiﬁers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classiﬁcation, support vector machines, learning rates, approximation, aggregation of classiﬁers

4 0.097476304 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a ﬁnite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classiﬁer with a larger or even an inﬁnite number of hypotheses. In addition, constructing an inﬁnite ensemble itself is a challenging task. In this paper, we formulate an inﬁnite ensemble learning framework based on the support vector machine (SVM). The framework can output an inﬁnite and nonsparse ensemble through embedding inﬁnitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies inﬁnitely many decision trees, and can thus be explained through inﬁnite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the beneﬁt of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

5 0.087204695 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-speciﬁed kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semideﬁnite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-inﬁnite linear programming (SILP) formulation is derived to further improve the efﬁciency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semideﬁnite programming, quadratically constrained quadratic programming, semi-inﬁnite linear programming

6 0.085335955 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

7 0.083812334 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

8 0.080480918 86 jmlr-2008-SimpleMKL

9 0.059708212 56 jmlr-2008-Magic Moments for Structured Output Prediction

10 0.054870471 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

11 0.050021209 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

12 0.049803484 22 jmlr-2008-Closed Sets for Labeled Data

13 0.04672635 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

14 0.046630394 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

15 0.046031576 58 jmlr-2008-Max-margin Classification of Data with Absent Features

16 0.041038547 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

17 0.04042843 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

18 0.039303992 52 jmlr-2008-Learning from Multiple Sources

19 0.038887717 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

20 0.038753986 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, -0.143), (2, 0.085), (3, -0.058), (4, 0.22), (5, -0.176), (6, 0.112), (7, 0.168), (8, 0.081), (9, 0.094), (10, -0.052), (11, -0.068), (12, 0.076), (13, 0.003), (14, 0.004), (15, 0.002), (16, -0.016), (17, -0.04), (18, -0.08), (19, -0.149), (20, 0.057), (21, -0.044), (22, -0.024), (23, -0.077), (24, 0.125), (25, -0.113), (26, -0.033), (27, -0.11), (28, -0.116), (29, 0.071), (30, -0.119), (31, -0.217), (32, -0.016), (33, -0.059), (34, 0.086), (35, 0.201), (36, 0.131), (37, 0.098), (38, 0.028), (39, 0.044), (40, 0.003), (41, 0.196), (42, -0.043), (43, 0.026), (44, 0.02), (45, 0.124), (46, 0.008), (47, 0.129), (48, 0.185), (49, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96579295 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

2 0.64559323 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller

3 0.57488251 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

4 0.56156754 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

5 0.455708 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efﬁcient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the ﬁtting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classiﬁcation tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classiﬁcation, kernel logistic regression, hierarchical classiﬁcation, cross validation optimization, Newton-Raphson optimization

6 0.42677915 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

7 0.32871062 56 jmlr-2008-Magic Moments for Structured Output Prediction

8 0.31520265 86 jmlr-2008-SimpleMKL

9 0.29780868 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.2752952 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

11 0.23667361 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

12 0.23153178 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

13 0.20265542 58 jmlr-2008-Max-margin Classification of Data with Absent Features

14 0.20172153 22 jmlr-2008-Closed Sets for Labeled Data

15 0.19881007 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

16 0.18804134 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

17 0.18800546 57 jmlr-2008-Manifold Learning: The Price of Normalization

18 0.18667141 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

19 0.18411607 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

20 0.18386169 26 jmlr-2008-Consistency of Trace Norm Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.057), (40, 0.048), (54, 0.033), (58, 0.029), (66, 0.07), (76, 0.018), (78, 0.502), (88, 0.06), (92, 0.029), (94, 0.034), (99, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89142019 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)

Author: Christian Igel, Verena Heidrich-Meisner, Tobias Glasmachers

Abstract: SHARK is an object-oriented library for the design of adaptive systems. It comprises methods for single- and multi-objective optimization (e.g., evolutionary and gradient-based algorithms) as well as kernel-based methods, neural networks, and other machine learning techniques. Keywords: machine learning software, neural networks, kernel-methods, evolutionary algorithms, optimization, multi-objective-optimization 1. Overview SHARK is a modular C++ library for the design and optimization of adaptive systems. It serves as a toolbox for real world applications and basic research in computational intelligence and machine learning. The library provides methods for single- and multi-objective optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning methods, neural networks, and many other machine learning techniques. Its main design criteria are ﬂexibility and speed. Here we restrict the description of SHARK to its core components, albeit the library contains plenty of additional functionality. Further information can be obtained from the HTML documentation and tutorials. More than 60 illustrative example programs serve as starting points for using SHARK. 2. Basic Tools—Rng, Array, and LinAlg The library provides general auxiliary functions and data structures for the development of machine learning algorithms. The Rng module generates reproducible and platform independent sequences of pseudo random numbers, which can be drawn from 14 predeﬁned discrete and continuous parametric distributions. The Array class provides dynamical array templates of arbitrary type and dimension as well as basic operations acting on these templates. LinAlg implements linear algebra algorithms such as matrix inversion and singular value decomposition. 3. ReClaM—Regression and Classiﬁcation Methods The goal of the ReClaM module is to provide machine learning algorithms for supervised classiﬁcation and regression in a uniﬁed, modular framework. It is built like a construction kit, where the main building blocks are adaptive data processing models, error functions, and optimization c 2008 Christian Igel, Verena Heidrich-Meisner and Tobias Glasmachers. I GEL , H EIDRICH -M EISNER AND G LASMACHERS 8 90736D 3 ¨¥¨¥¥£ ¡ §§©§¦¤¢ init(...) optimize(...) E 8973 B@ 6 4C3 A 86 973 543 %$#¨!

same-paper 2 0.88726467 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

3 0.3267262 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

4 0.32327086 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

Author: Andreas Christmann, Arnout Van Messem

Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in inﬁnite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand inﬂuence function (BIF) a modiﬁcation of F.R. Hampel’s inﬂuence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s inﬂuence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on inﬂuence functions. Keywords: Bouligand derivatives, empirical risk minimization, inﬂuence function, robustness, support vector machines

5 0.31374234 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufﬁcient conditions for the consistency of group Lasso under practical assumptions, such as model misspeciﬁcation. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this inﬁnite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisﬁed. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

6 0.30880696 9 jmlr-2008-Active Learning by Spherical Subdivision

7 0.29912579 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

8 0.29170927 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

9 0.29147294 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.28893301 86 jmlr-2008-SimpleMKL

11 0.28639737 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

12 0.28297794 58 jmlr-2008-Max-margin Classification of Data with Absent Features

13 0.27308062 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

14 0.26841679 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

15 0.26586363 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity (Special Topic on Model Selection)

16 0.2658557 57 jmlr-2008-Manifold Learning: The Price of Normalization

17 0.26264423 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

18 0.2585001 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

19 0.25789243 56 jmlr-2008-Magic Moments for Structured Output Prediction

20 0.2559967 26 jmlr-2008-Consistency of Trace Norm Minimization