jmlr jmlr2005 jmlr2005-48 knowledge-graph by maker-knowledge-mining

48 jmlr-2005-Learning the Kernel Function via Regularization


Source: pdf

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 UK Department of Computer Science University College London Gower Street, London WC1E, UK Editor: Peter Bartlett Abstract We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. [sent-8, score-0.267]

2 We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. [sent-9, score-0.19]

3 We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. [sent-10, score-0.191]

4 Introduction A widely used approach to estimate a function from empirical data consists in minimizing a regularization functional in a Hilbert space H of real–valued functions f : X → IR, where X is a set. [sent-13, score-0.092]

5 Specifically, regularization estimates f as a minimizer of the functional Q(Ix ( f )) + µΩ( f ) where µ is a positive parameter, Ix ( f ) = ( f (x j ) : j ∈ INm ) is the vector of values of f on the set x = {x j : j ∈ INm } and INm = {1, . [sent-14, score-0.112]

6 This functional trades off empirical error, measured by the function Q : IRm → IR+ , with smoothness of the solution, measured by the functional Ω : H → IR+ . [sent-18, score-0.094]

7 In this paper we assume that H is a reproducing kernel Hilbert space (RKHS) HK with kernel K and choose Ω( f ) = f , f , where ·, · is the inner product in HK , although some of the ideas we develop may be relevant in other circumstances. [sent-21, score-0.098]

8 Using these equations, the regularization functional in (1) can be rewritten as a functional of w. [sent-31, score-0.139]

9 When the kernel is fixed, an immediate concern with problem (1) is the choice of the regularization parameter µ. [sent-44, score-0.094]

10 A similar circumstance occurs for translation invariant kernels modeled by Gaussian mixtures. [sent-53, score-0.09]

11 In this paper, we propose a method for finding a kernel function which belongs to a compact and convex set K . [sent-58, score-0.22]

12 In contrast to the point of view of these papers, our setting applies to convex combinations of kernels parameterized by a compact set, a circumstance which is relevant for applications. [sent-65, score-0.274]

13 The simplest case of our setup is a set of convex combinations of finitely many kernels {K j : j ∈ INn }. [sent-68, score-0.182]

14 In all of these cases our method will seek the optimal convex combination of these kernels. [sent-70, score-0.129]

15 Another example included in our framework is learning the optimal radial kernel or the optimal polynomial kernel in which case the space K is the convex hull of a prescribed set of kernels parameterized by a locally compact set. [sent-71, score-0.441]

16 By a kernel we mean a symmetric function K : X × X → IR such that for every finite set of inputs x = {x j : j ∈ INm } ⊆ X and every m ∈ IN, the m × m matrix Kx := (K(xi , x j ) : i, j ∈ INm ) is positive semi-definite. [sent-77, score-0.099]

17 Also, we use A (X ) for the set of all kernels on the set X and A+ (X ) for the subset of kernels K such that, for each input x, Kx ∈ L+ (IRm ). [sent-79, score-0.136]

18 We prescribe a nonnegative function Q : IRm → IR+ and introduce the regularization functional Qµ ( f , K) := Q(Ix ( f )) + µ f 2 (4) K where f 2 := f , f , µ is a positive constant and Q depends on y but we suppress it in our noK tation as it is fixed throughout our discussion. [sent-85, score-0.092]

19 A noteworthy special case of Qµ is the square loss regularization functional given by Sµ ( f , K) := y − Ix ( f ) 2 +µ f 2 K (5) where · is the standard Euclidean norm on IRm . [sent-86, score-0.137]

20 Associated with the functional Qµ and the kernel K is the variational problem Qµ (K) := inf{Qµ ( f , K) : f ∈ HK } (6) which defines a function Qµ : A (X ) → IR+ . [sent-88, score-0.127]

21 Moreover, when Q is convex this function is unique. [sent-92, score-0.114]

22 The uniqueness of the solution relies on the fact that when Q is convex Qµ is strictly convex because µ is positive. [sent-95, score-0.228]

23 The point of view of this paper is that the functional (6) can be used as a design criterion to select the kernel K. [sent-96, score-0.096]

24 To this end, we specify an arbitrary convex subset K of A (X ) and focus on the problem Qµ (K ) = inf{Qµ (K) : K ∈ K }. [sent-97, score-0.114]

25 Also, we say G is a compact convex set of kernels whenever for each x ⊆ X , G (x) is a compact convex set of matrices in S (IRm ). [sent-110, score-0.426]

26 Lemma 2 If K is a compact and convex subset of A+ (X ) and Q : IRm → IR is continuous then the minimum of (7) exists. [sent-112, score-0.171]

27 Fix x ⊆ X , choose a minimizing sequence of kernels {K n : n ∈ IN}, that is, limn→∞ Qµ (K n ) = Qµ (K ) and a sequence of vectors {cn : n ∈ IN} such that n n Qµ (K n ) = Q(Kx cn ) + µ(cn , Kx cn ). [sent-114, score-0.104]

28 The proof of this lemma requires that all kernels in K are in A+ (X ). [sent-122, score-0.103]

29 If we wish to use kernels K only in A (X ) we may always modify them by adding any positive multiple of the delta function kernel ∆ defined, for x,t ∈ X , as 1, x = t ∆(x,t) = (8) 0, x = t 1103 M ICCHELLI AND P ONTIL that is, replace K by K + a∆ where a is a positive constant. [sent-123, score-0.117]

30 There are two useful cases of a set K of kernels which are compact and convex. [sent-124, score-0.125]

31 The first is formed by the convex hull of a finite number of kernels in A+ (X ). [sent-125, score-0.217]

32 We let M (Ω) be the set of all probability measures on Ω and observe that K (G) := Z Ω G(ω)d p(ω) : p ∈ M (Ω) (9) is a compact and convex set of kernels in A+ (X ). [sent-129, score-0.239]

33 By the definition of convex hull, we obtain, for some sequence of probability measures {p : ∈ IN}, R that Kx R lim →∞ Ω Gx (ω)d p (ω) where each p is a finite sum of point measures. [sent-139, score-0.13]

34 Observe that the set G = {G(ω) : ω ∈ Ω} in the above lemma is compact since G is continuous and Ω compact. [sent-146, score-0.092]

35 In general, we wish to point out a useful fact about the kernels in coG whenever G is a compact set of kernels. [sent-147, score-0.125]

36 Theorem 4 If A is a subset of IRn then every a ∈ coA is a convex combination of at most n + 1 elements of A. [sent-150, score-0.148]

37 Lemma 5 If A is a compact subset of IRn then coA is compact and every element in it is a convex combination of at most n + 1 elements of A. [sent-152, score-0.262]

38 Corollary 6 If G is a compact set of kernels on X × X then coG is a compact set of kernels. [sent-154, score-0.182]

39 Moreover, for each input set x a matrix C ∈ coG (x) if and only if there exists a kernel T which is a convex combination of at most m(m+1) + 1 kernels in G and Tx = C. [sent-155, score-0.258]

40 2 Our next result shows whenever K is the closed convex hull of a compact set of kernels G that the optimal kernel lies in a the convex hull of some finite subset of G . [sent-156, score-0.472]

41 Theorem 7 If G ⊆ A+ (X ) is a compact set of basic kernels, K = coG , Q : IRm → IR+ is continuous and µ is a positive number then there exists T ⊆ G containing at most m + 2 basic kernels such that ˜ Qµ admit a minimizer K ∈ coT and Qµ (T ) = Qµ (K ). [sent-157, score-0.145]

42 By Lemma 5 the vector ˆ ˆ ˆ ˆ ˆ ˆ (Kx c, (c, Kx c)) can be written as a convex combination of at most m + 2 vectors in V , that is ˆ ˆ ˆ ˆ ˆ ˆ ˜ ˆ ˆ ˜ ˆ (Kx c, (c, Kx c)) = (Kx c, (c, Kx c)) ˜ where K is the convex combination of at most m + 2 kernels in G . [sent-162, score-0.326]

43 Note that Theorem 7 asserts the existence of a q which is at most m + 2, that is, an optimal kernel is expressed by a convex combination of at most m + 2 kernels. [sent-164, score-0.178]

44 We address this issue in the case that K is the convex hull of a finite set of kernels. [sent-167, score-0.149]

45 Lemma 8 If Kn = {K j : j ∈ INn } is a family of kernels on X × X and f ∈ inf{ f K : K ∈ coKn } = min ∑ fj Kj j∈INn :f= ∑ j∈INn 1105 L j∈INn HK j then f j , f ∈ HK , ∈ INn . [sent-169, score-0.105]

46 This lemma suggests a reformulation of our extremal problem (7) for kernels of the form (9) where G is expressed in terms of a feature map. [sent-173, score-0.103]

47 Next, we establish that the variational problem (7) is a convex optimization problem. [sent-175, score-0.145]

48 Specifically, we shall show that if the function mapping Q : IRm → IR is convex then the functional Qµ : A+ (K ) → IR+ is a convex as well. [sent-176, score-0.275]

49 Also, for each fixed t > 0, φr (t) is a non-increasing function of r and, for each r > 0, φr is continuously differentiable, decreasing and convex on IR+ . [sent-184, score-0.114]

50 Lemma 9 If K ∈ A (X ), x a set of m distinct points of X such that Kx ∈ L+ (IRm ) and Q : IRm → IR a convex function, then there exists r0 > 0 such that for all r > r0 there holds the formula Qµ (K) = sup {φr ((v, Kx v)) − Q∗ (v) : v ∈ IRm } . [sent-185, score-0.141]

51 Lemma 10 If K ∈ A (X ), x a set of m distinct points of X such that Kx ∈ L+ (IRm ) and Q : IRm → IR a convex function, then there holds the formula Qµ (K) = sup − 1 (v, Kx v) − Q∗ (v) : v ∈ IRm . [sent-202, score-0.141]

52 Definition 12 A function φ : B → IR is said convex on B ⊆ A (X ) if, for every A, B ∈ B and λ ∈ [0, 1] there holds the inequality φ(λA + (1 − λ)B) ≤ λφ(A) + (1 − λ)φ(B). [sent-211, score-0.133]

53 Proposition 13 If Q : IRm → IR+ is convex then for every µ > 0 the function Qµ : A+ (X ) → IR+ is convex and non-increasing. [sent-213, score-0.247]

54 Specifically, equation (11) expresses Qµ as the supremum of a family of functions which are convex and non-increasing on A (X ). [sent-216, score-0.114]

55 (2004) for the hinge loss and stated in (Ong, Smola and Williamson, 2003) for differentiable convex loss functions. [sent-218, score-0.142]

56 Square Regularization In this section we exclusively study the case of the square loss regularization functional Sµ in equation (5) and provide improvements and simplifications of our previous results. [sent-220, score-0.123]

57 Recall, for every kernel K on X × X , that the minimal norm interpolation to the data D is the solution to the variational problem ρ(K) := min{ f 2 K : f ∈ HK , f (x j ) = y j , j ∈ INm }. [sent-232, score-0.133]

58 1 γ(A) ≤ Lemma 17 The function γ is convex and the function γ−1 concave. [sent-242, score-0.114]

59 Similarly, using this equation we have that γ(A) = max (c, Ac)−1 : c ∈ IRm , (c, y) = 1 thereby expressing γ as a maximum of a family of convex functions. [sent-248, score-0.129]

60 Lemma 16 and 17 established that the function φ : L+ (IRm ) → IR defined, for some d ∈ IRm and all A ∈ L+ (IRm ), as φ(A) = (d, A−1 d) is non-increasing and convex (see also the work of Marshall and Olkin, 1979). [sent-251, score-0.114]

61 Lemma 18 If K is a compact and convex set of kernels in A+ (X ) then the minimum of (20) exists. [sent-255, score-0.239]

62 Let M be the convex hull of the set of vectors N := {Kx, j c : ˜ ∗ } ⊂ IRm . [sent-274, score-0.149]

63 17) every vector in M can be expressed as a convex combination of at most q := min(m + 1, |J ∗ |) ≤ min(m + 1, n) elements of N . [sent-276, score-0.148]

64 with at most m + 1 atoms, that is, K ˆ Theorem 20 If Ω is a compact Hausdorff topological space and G : Ω → A+ (X ) is continuous then R ˆ ˆ ˆ there exists a kernel K = Ω G(ω)d p(ω) ∈ K (G) such that p is a discrete probability measure in ˆ M (Ω) with at most m + 1 atoms. [sent-291, score-0.12]

65 We define the convex function ϕ : IRm → IR by setting for each c ∈ IRm , ϕ(c) := max{(c, Gx (ω)c) : ω ∈ Ω} and note that by Lemma 24 the directional derivative of ϕ along the “direction” d ∈ IRm , denoted by ϕ+ (c; d), is given by ϕ+ (c; d) = 2 max{(d, Gx (ω)c) : ω ∈ Ω∗ }. [sent-296, score-0.114]

66 Let M be the convex hull of the set of vectors N := {Gx (ω)c : ˜ ω ∈ Ω∗ } ⊂ IRm . [sent-298, score-0.149]

67 Since M ⊆ IRm , by the Caratheodory theorem every vector in M can be expressed as a convex combination of at most m + 1 elements of N . [sent-299, score-0.161]

68 ˆ ˆ ˆ ˆ ˆ 1113 M ICCHELLI AND P ONTIL We note that, in view of equations (13) and (17), Theorem 19 and Theorem 20 apply directly, up to an unimportant constant µ, to the square loss functional by merely adding the kernel µ∆ to the class of kernels considered in these theorems. [sent-313, score-0.195]

69 That is, we minimize the quantity in equation (17) over the compact convex set of kernels ˜ ˜ ˜ K = {K : K = K + µ∆, K ∈ K } where the kernel ∆ is defined in equation (8). [sent-314, score-0.288]

70 Note that the set IR+ is not compact and the kernel N(0) is not in A+ (IRd ). [sent-323, score-0.106]

71 Therefore, on both accounts Theorem 20 does not apply in this circumstance unless, of course, we impose a positive lower bound and a finite upper bound on the variance of the Gaussian kernels N(ω). [sent-324, score-0.09]

72 We may overcome this difficulty by a limiting process which can handle kernel maps on locally compact Hausdorff spaces. [sent-325, score-0.106]

73 Therefore, we can extract a convergent subsequence {pn : ∈ IN} of probability measures and kernels R ˆ ˆ ˆ {Kn : ∈ IN} such that lim →∞ pn = p, lim →∞ Kn = K, and K = IR+ N(ω) p(ω) with the provision ˆ that p may have atoms at either zero of infinity. [sent-335, score-0.147]

74 For example, they maximize the margin of a binary support vector machine (SVM) trained with the kernel K, which is the square root of the reciprocal of the quantity defined by the equation ρhard (K) = min f 2 K : y j f (x j ) ≥ 1, j ∈ INm . [sent-346, score-0.097]

75 In particular, if K j are positive semi-definite K could be the set of convex combination of such matrices. [sent-355, score-0.129]

76 Our observations in Section 2 confirm that the margin and the soft margin are convex functions of the kernel. [sent-357, score-0.114]

77 Ong, Smola and Williamson (2003) consider learning a kernel function rather than a kernel matrix. [sent-359, score-0.098]

78 This is a kernel H : X 2 × X 2 → IR, where X 2 = X × X , with the property that, for every (x,t) ∈ X 2 , H((x,t), (·, ·)) is a kernel on X × X . [sent-361, score-0.117]

79 This construction includes convex combinations of a possibly infinite number or kernels provided they are pointwise nonnegative. [sent-362, score-0.182]

80 For example Gaussian kernels or polynomial kernels with even degree satisfy this assumption although those with odd degree, such as linear kernels or other radial kernels do not. [sent-363, score-0.292]

81 2 Numerical Simulations In this section we discuss two numerical simulations we carried out to compute a convex combination of a finite set of kernels {K : ∈ INn } which minimizes the square loss regularization functional 1115 M ICCHELLI AND P ONTIL µ Method 1 Method 2 Method 3 10−4 2. [sent-365, score-0.332]

82 Method 1 is our proposed approach, method 2 is the average of the kernels, that is we use 1 the kernel K = n ∑ K and method 3 is the kernel K = K2 + K5 , the “ideal” kernel, that is, the kernel used to generate the target function. [sent-490, score-0.147]

83 Method 1 is our proposed approach, method 2 is the average of the kernels and method 3 is the ideal kernel given by K(x,t) = 1 2 3 sin(x) sin(t) + 3 sin(3x) sin(3t). [sent-495, score-0.117]

84 Since Qµ (K, y) is convex in K for each y ∈ IRm so is Qav (K). [sent-520, score-0.114]

85 Minimizing the quantity (33) over a convex class K may be valuable in image reconstruction and compression where we are provided with a collection of images and we wish to find a good average representation for them. [sent-523, score-0.114]

86 In particular, for square loss regularization and the Euclidean norm on IRm we obtain max{Sµ (K, y) : y ≤ 1} = max{µ(y, (Kx + µI)−1 y) : y ≤ 1} = µ λmin (Kx ) + µ where λmin (Kx ) is the smallest eigenvalue of the matrix Kx . [sent-529, score-0.102]

87 We may modify the functional Qµ with a penalty term which depends on the kernel matrix Kx to enforce uniqueness of the optimal kernel in K . [sent-539, score-0.157]

88 Therefore, we consider the variational problem min{Qµ (K) + R(Kx )} (34) where R is a strictly convex function on L (IRm ). [sent-540, score-0.145]

89 In this case, the method of proof of Theorem 7 1 shows that the optimal kernel can be found as a convex combination of at most 2 m(m + 1) kernels. [sent-541, score-0.178]

90 For example, to each A ∈ L (IRd ) we define the linear kernel KA (x,t) = (x, At), x,t ∈ IRd and so our results apply to any convex compact subset of L (IRd ) for this kernel map. [sent-551, score-0.269]

91 For compact convex sets of covariances our results say that Gaussian mixture models give optimal kernels. [sent-553, score-0.171]

92 Conclusion The intent of this paper is to enlarge the theoretical understanding of the study of optimal kernels via the minimization of a regularization functional. [sent-555, score-0.113]

93 In contrast to the point of view of these papers, our setting applies to convex combinations of kernels parameterized by a compact set. [sent-558, score-0.252]

94 Our analysis establishes that the regularization functional Qµ is convex in K and that any optimizing kernel can be expressed as the convex combination of at most m + 2 basic kernels. [sent-559, score-0.384]

95 Theorem 22 Let f : A × B → IR where A is a compact convex subset of a Hausdorff topological vector space X and B is a convex subset of a vector space Y . [sent-569, score-0.299]

96 1120 L EARNING THE K ERNEL F UNCTION VIA R EGULARIZATION Lemma 24 Let T a compact set and G(t, x) a real-valued function on T × X such that, for every x ∈ X G(·, x) is continuous on T and, for every t ∈ T , G(t, ·) is convex on X . [sent-578, score-0.209]

97 We define the realvalued convex function g on X by the formula g(x) := max{G(t, x) : t ∈ T }, x ∈ X and the set M(x) := {t : t ∈ T , G(t, x) = g(x)}. [sent-579, score-0.114]

98 To prove the reverse inequality we use the fact that if f is convex on [0, ∞) and f (0) = 0 then f (λ)/λ is a nondecreasing function of λ > 0. [sent-583, score-0.114]

99 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-959, score-0.114]

100 On the dual formulation of regularized linear systems with convex risks. [sent-965, score-0.114]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kx', 0.587), ('irm', 0.564), ('inn', 0.314), ('ir', 0.25), ('inm', 0.2), ('convex', 0.114), ('gx', 0.088), ('ird', 0.076), ('cog', 0.071), ('icchelli', 0.07), ('kernels', 0.068), ('micchelli', 0.067), ('cokn', 0.065), ('hk', 0.063), ('bv', 0.063), ('egularization', 0.063), ('ontil', 0.063), ('unction', 0.063), ('compact', 0.057), ('kernel', 0.049), ('bc', 0.048), ('functional', 0.047), ('pontil', 0.046), ('ernel', 0.046), ('regularization', 0.045), ('irn', 0.043), ('hausdorff', 0.038), ('kn', 0.038), ('prescribed', 0.036), ('lanckriet', 0.036), ('hull', 0.035), ('lemma', 0.035), ('ix', 0.033), ('atoms', 0.032), ('royden', 0.032), ('earning', 0.031), ('variational', 0.031), ('minimax', 0.028), ('sup', 0.027), ('sin', 0.027), ('argyriou', 0.026), ('herbster', 0.026), ('inf', 0.023), ('ong', 0.022), ('rockafellar', 0.022), ('circumstance', 0.022), ('hilbert', 0.021), ('interpolation', 0.02), ('minimizer', 0.02), ('radial', 0.02), ('borwein', 0.019), ('caratheodory', 0.019), ('fj', 0.019), ('marshall', 0.019), ('qav', 0.019), ('zhou', 0.019), ('ying', 0.019), ('rkhs', 0.019), ('aronszajn', 0.019), ('every', 0.019), ('cn', 0.018), ('min', 0.018), ('zhang', 0.017), ('cristianini', 0.017), ('square', 0.017), ('vito', 0.016), ('wahba', 0.016), ('lim', 0.016), ('matrices', 0.016), ('combination', 0.015), ('subsequence', 0.015), ('rj', 0.015), ('bach', 0.015), ('max', 0.015), ('preprint', 0.014), ('smale', 0.014), ('norm', 0.014), ('topological', 0.014), ('loss', 0.014), ('theorem', 0.013), ('parameterized', 0.013), ('kimeldorf', 0.013), ('neumann', 0.013), ('bousquet', 0.013), ('atom', 0.013), ('coa', 0.013), ('intersects', 0.013), ('olkin', 0.013), ('reciprocal', 0.013), ('thibaux', 0.013), ('uncountable', 0.013), ('ac', 0.012), ('conjugate', 0.012), ('matrix', 0.012), ('ucl', 0.012), ('von', 0.012), ('alternate', 0.012), ('simulations', 0.012), ('hastie', 0.011), ('compactness', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 48 jmlr-2005-Learning the Kernel Function via Regularization

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.

2 0.2284632 47 jmlr-2005-Learning from Examples as an Inverse Problem

Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone

Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE

3 0.085482143 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

Author: Simone Fiori

Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient

4 0.080052644 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

5 0.062636189 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

6 0.05021745 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

7 0.046186969 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

8 0.037352309 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

9 0.037141461 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

10 0.031948678 64 jmlr-2005-Semigroup Kernels on Measures

11 0.025358045 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

12 0.02353804 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

13 0.023367589 41 jmlr-2005-Kernel Methods for Measuring Independence

14 0.020846058 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

15 0.019952422 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

16 0.019767635 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

17 0.019469734 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

18 0.017106498 36 jmlr-2005-Gaussian Processes for Ordinal Regression

19 0.01669273 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

20 0.016472246 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, 0.135), (2, 0.118), (3, 0.38), (4, 0.063), (5, 0.438), (6, 0.005), (7, 0.042), (8, -0.068), (9, 0.099), (10, 0.026), (11, 0.0), (12, -0.18), (13, -0.117), (14, 0.106), (15, 0.007), (16, -0.095), (17, -0.179), (18, 0.037), (19, 0.029), (20, 0.12), (21, -0.001), (22, 0.123), (23, 0.046), (24, 0.092), (25, -0.095), (26, 0.043), (27, -0.258), (28, -0.08), (29, 0.091), (30, -0.037), (31, -0.153), (32, -0.036), (33, -0.05), (34, 0.008), (35, 0.038), (36, 0.003), (37, -0.063), (38, -0.04), (39, -0.106), (40, -0.029), (41, 0.051), (42, 0.096), (43, 0.006), (44, -0.004), (45, -0.017), (46, 0.016), (47, 0.011), (48, -0.049), (49, -0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95769012 48 jmlr-2005-Learning the Kernel Function via Regularization

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.

2 0.69340324 47 jmlr-2005-Learning from Examples as an Inverse Problem

Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone

Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE

3 0.25198939 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

Author: Simone Fiori

Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient

4 0.20992333 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

5 0.17582025 71 jmlr-2005-Variational Message Passing

Author: John Winn, Christopher M. Bishop

Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing

6 0.14854158 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

7 0.13234019 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

8 0.12461937 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

9 0.12144527 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

10 0.10769928 64 jmlr-2005-Semigroup Kernels on Measures

11 0.088486932 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

12 0.08764077 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

13 0.086345561 12 jmlr-2005-An MDP-Based Recommender System

14 0.084317446 41 jmlr-2005-Kernel Methods for Measuring Independence

15 0.081261151 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

16 0.080151297 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

17 0.078648046 29 jmlr-2005-Efficient Margin Maximizing with Boosting

18 0.075918734 20 jmlr-2005-Clustering with Bregman Divergences

19 0.07465373 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

20 0.069310457 39 jmlr-2005-Information Bottleneck for Gaussian Variables


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.355), (13, 0.029), (17, 0.023), (19, 0.026), (21, 0.05), (36, 0.032), (37, 0.024), (43, 0.043), (44, 0.029), (47, 0.016), (52, 0.05), (59, 0.014), (70, 0.015), (76, 0.022), (88, 0.109), (90, 0.03), (94, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69808954 48 jmlr-2005-Learning the Kernel Function via Regularization

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.

2 0.36937225 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

3 0.366528 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming

4 0.35081604 47 jmlr-2005-Learning from Examples as an Inverse Problem

Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone

Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE

5 0.34466237 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

6 0.34407115 20 jmlr-2005-Clustering with Bregman Divergences

7 0.34277901 11 jmlr-2005-Algorithmic Stability and Meta-Learning

8 0.33823469 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

9 0.33685821 39 jmlr-2005-Information Bottleneck for Gaussian Variables

10 0.33671066 71 jmlr-2005-Variational Message Passing

11 0.33644733 3 jmlr-2005-A Classification Framework for Anomaly Detection

12 0.33407706 44 jmlr-2005-Learning Module Networks

13 0.33386284 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

14 0.33372524 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

15 0.33014339 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

16 0.32916245 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

17 0.32803583 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

18 0.32579127 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

19 0.32239616 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

20 0.32227561 67 jmlr-2005-Stability of Randomized Learning Algorithms