jmlr jmlr2006 jmlr2006-93 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
Reference: text
sentIndex sentText sentNum sentScore
1 A number of concrete examples are given of kernels with this universal approximating property. [sent-7, score-0.366]
2 Keywords: density, translation invariant kernels, radial kernels 1. [sent-8, score-0.347]
3 We shall call a function K from X × X to C a kernel on X provided that for any finite sequence of inputs x := {x j : j ∈ Nn } ⊆ X the matrix Kx := (K(x j , xk ) : j, k ∈ Nn ) (1) is Hermitian and positive semi-definite. [sent-13, score-0.258]
4 Besides their superior performance on a wide spectrum of learning tasks from data, they have a substantial theoretical basis, as they are reproducing kernels of Hilbert spaces of functions on X for which point evaluation is always continuous (Aronszajn, 1950). [sent-16, score-0.245]
5 We shall refer to the function in the sum on the right hand side of (2) as sections of the kernel K. [sent-23, score-0.258]
6 M ICCHELLI , X U AND Z HANG Certainly, the choice of the kernel in (2) affects the performance of kernel based learning algorithms and so, is important. [sent-26, score-0.35]
7 Indeed, it is well-known that if we use the norm in the RKHS whose kernel is K then all members of this Hilbert space are approximable arbitrarily by functions of the type appearing in (2). [sent-38, score-0.311]
8 Actually, this is the way the Hilbert space associated with a kernel is constructed from the kernel itself (Aronszajn, 1950). [sent-39, score-0.394]
9 To this end, we assume that the input space X is a Hausdorff topological space and that all kernels to be considered are continuous on X × X . [sent-41, score-0.268]
10 To begin to address the problem which interests us here, we let Z be a fixed but arbitrary compact subset of X and, as usual, let C(Z ) be the space of all continuous complex-valued functions from Z to C equipped with maximum norm · Z . [sent-42, score-0.323]
11 Given a kernel K we form the space of kernel sections K(Z ) := span {Ky : y ∈ Z }, where Ky : X → C is the function defined at every x ∈ X by the equation Ky (x) := K(x, y). [sent-45, score-0.664]
12 We want to identify kernels with the following universal approximating property: given any prescribed compact subset Z of X , any positive number ε and any function f ∈ C(Z ) there is a function g ∈ K(Z ) such that f − g Z ≤ ε. [sent-47, score-0.603]
13 That is, for any choice of compact subset Z of the input space X , the set K(Z ) is dense in C(Z ) in the maximum norm. [sent-48, score-0.276]
14 When a kernel has this property we call it a universal kernel. [sent-49, score-0.381]
15 In other words, a universal kernel K has the property that K(Z ) = C(Z ). [sent-50, score-0.381]
16 It is this question of characterizing universal kernels that we address here. [sent-51, score-0.338]
17 We shall demonstrate that it has a satisfactory resolution in terms of any feature map representation of the kernel K. [sent-52, score-0.378]
18 Indeed, we provide a necessary and sufficient condition for K to have the universal approximating property in terms of its features, thereby completing preliminary remarks made about this problem in Micchelli and Pontil (2004) and Micchelli et al. [sent-53, score-0.234]
19 In Section 4, we stress translation invariant kernels on Rd and give several useful sufficient conditions for K to be a universal translation invariant kernel. [sent-56, score-0.706]
20 Kernels Defined by Feature Maps We start from a Hilbert space W over C and a continuous kernel K on X × X . [sent-60, score-0.267]
21 A feature map for the kernel K is any continuous function Φ : X → W such that for each (x, y) ∈ X × X K(x, y) = (Φ(x), Φ(y))W (3) where (·, ·)W is the inner product on W . [sent-61, score-0.308]
22 Every kernel has such a representation and conversely whenever it does then it is a kernel. [sent-62, score-0.239]
23 To establish the converse, we first construct the Hilbert space H associated with the continuous kernel K and then observe for all x, y ∈ X , by the reproducing kernel property, that K(x, y) := (Kx , Ky )H . [sent-67, score-0.476]
24 This feature space representation is continuous because for all x, y ∈ X we have that Φ(x) − Φ(y) 2 = K(x, x) + K(y, y) − K(x, y) − K(y, x) W where · W denotes the norm on W . [sent-69, score-0.231]
25 There are alternate means to construct a feature space representation for a continuous kernel K which has the advantage that the Hilbert space W can be chosen to be separable. [sent-70, score-0.392]
26 To construct such a representation we must choose a compact subset Z of X and a finite Borel measure µ on Z with supp (µ) = Z (see (15) for the definition of the support of a Borel measure). [sent-71, score-0.761]
27 Next, we use the feature space representation of the kernel in (3) on the right hand side of the equation to obtain the equation Z Z ¯ K(x, y)d ν(y)dν(x) U(ν) 2 = W Z Z (11) ¯ where ν is the conjugate of the complex-valued measure ν defined for each Borel set S ⊆ Z by ¯ ν(S ) := ν(S ). [sent-93, score-0.459]
28 We remark here that since the kernel K is continuous and Z is compact, Fubini’s theorem assures that we can interchange the order in the integral on the right hand side of the above equation (see, for example, Royden, 1988). [sent-94, score-0.361]
29 Furthermore, the closed linear span of subset V of C(Z ), denoted by span V , has the same annihilator as the set V itself. [sent-101, score-0.558]
30 Moreover, two subsets V1 and V2 of C(Z ) have the same annihilator if and only if span V1 = span V2 . [sent-102, score-0.508]
31 Proof By the Hahn Banach theorem, the linear span of a subset V is dense in C(Z ), that is, span V = C(Z ) if and only if V ⊥ = {0} (Lax, 2002; Royden, 1988; Rudin, 1991). [sent-108, score-0.539]
32 To this end, we introduce a subspace of W defined as Φ(Z ) := span {Φ(x) : x ∈ Z }. [sent-115, score-0.252]
33 From these definitions it follows that Q( span S ) = span Q(S ). [sent-118, score-0.43]
34 We recall that the linear span of a subset S is dense in W , that is, span S = W , if and only if the only u ∈ W with (u, v) = 0 for all v ∈ S is u = 0 (We already use a similar fact for the space C(Z )). [sent-130, score-0.583]
35 It follows directly from Equation (14), for any subset S of W such that span S is dense in W , that V (S )⊥ = N (U) and so with this remark and Proposition 1 we conclude that K(Z ) = spanV (S ). [sent-131, score-0.324]
36 Corresponding to each element y in an orthonormal basis Y of W we define the function Fy ∈ C(Z ) at x ∈ Z by the equation Fy (x) = (Φ(x), y)W and introduce the corresponding subspace of C(Z ) Φ(Y ) := span {Fy : y ∈ Y }. [sent-136, score-0.369]
37 Combining the above remarks we obtain the following equivalence between density of kernel representation and feature function density in C(Z ). [sent-139, score-0.328]
38 Theorem 4 If Z is a compact subset of the input space X , K a kernel with feature space representation (3) and Y an orthornormal basis for W then K(Z ) = Φ(Y ). [sent-140, score-0.517]
39 Parallel to our notion that a kernel is universal, we say a feature map Φ is universal provided that given any compact subset Z of the input space X , any positive number ε and any function f ∈ C(Z ) there is a function g ∈ Φ(Y ) such that f − g Z ≤ ε. [sent-141, score-0.683]
40 That is, for any choice of compact subset Z of the input space X , the set {Fy : y ∈ Y } is dense in C(Z ). [sent-142, score-0.276]
41 Therefore, with this terminology we can succinctly summarize our conclusion in Theorem 4 by saying that a kernel K expressed in feature space form (3) is universal if and only if its features are universal! [sent-144, score-0.471]
42 We now consider an alternate way to express the universality of a kernel K in terms of the operator T and the corresponding measure µ defined in Equation (4) which determines it. [sent-145, score-0.366]
43 To this end, we recall the that the support of a Borel measure ν on Z is defined to be the closed set supp (ν) := \ S ⊆ Z : S is closed, ν(SC ) = 0 . [sent-146, score-0.553]
44 (15) Consequently, if Z f (x)dν(x) = 0, ν a Borel measure with supp (ν) = Z and f is a nonnegative and continuous function on Z then f = 0. [sent-147, score-0.641]
45 Since supp (µ) = Z , we obtain that g = 0, that is, ν ∈ K(Z )⊥ . [sent-157, score-0.504]
46 We begin with a set {φ j : j ∈ I } of continuous complex-valued functions on X where I is a countable set of indices and define the kernel K by K(x, y) := ∑ φ j (x)φ j (y), j∈I (x, y) ∈ X × X , (16) where we assume that the series converges uniformly on Z × Z for every compact subset Z of X . [sent-163, score-0.396]
47 Theorem 7 The kernel K defined by Equation (16) is universal if and only if the set of features {φ j : j ∈ I } is universal. [sent-165, score-0.381]
48 We shall now apply Theorem 7 to dot product kernels on various domains of R d and Cd . [sent-166, score-0.277]
49 The function G induces the dot product kernel K defined at x, y ∈ Cd by the equation K(x, y) := G((x, y)) 2658 (18) U NIVERSAL K ERNELS where we shall always use (x, y) for the standard inner product between the vectors x and y. [sent-168, score-0.375]
50 Corollary 8 The dot product kernel defined in Equation (18) is universal on C d and Rd . [sent-171, score-0.443]
51 Using the multinomial + expansion we conclude that the dot product kernel defined in Equation (18) can be expressed in the form K(x, y) := ∑ φα (x)φα (y), x, y ∈ Cd α∈Zd + where the features are defined for α ∈ Zd at x ∈ Cd as + φα (x) := a|α| |α| α xα . [sent-173, score-0.237]
52 As is well-known, for example as a special case of the Stone-Weierstrass approximation theorem (Rudin, 1991, page 122), these features are universal on C d and Rd so the result follows from Theorem 7. [sent-174, score-0.254]
53 (21) Theorem 10 The kernel given by Equation (21) is universal on S d if and only if for all k ∈ Z+ , ak is positive. [sent-187, score-0.434]
54 Now, if all the ak , k ∈ Z+ are positive we conclude that span {φ : ∈ I } is the linear space of all polynomials and, in particular, is universal. [sent-196, score-0.346]
55 Translation Invariant Kernels on Rd In the remaining part of this paper we shall focus on translation invariant kernels on R d which have the form K(x, y) = k(x − y), x, y ∈ Rd for some function k which is continuous on Rd . [sent-199, score-0.447]
56 Recall that Bochner (1959) proved that K is a kernel if and only if there is a unique finite Borel measure µ on Rd such that k at any x ∈ Rd has the form k(x) := Z Rd ei(x,y) dµ(y). [sent-200, score-0.224]
57 (23) We shall study the question of the universality of the kernel K in terms of the properties of its corresponding finite Borel measure µ. [sent-201, score-0.385]
58 We start by identifying the input space as X := R d and then introduce our Hilbert space W of all complex-valued functions on supp (µ) with inner product 2660 U NIVERSAL K ERNELS ( f , g)W := Z supp (µ) f (x)g(x)dµ(x). [sent-202, score-1.096]
59 Note that in this case the Hilbert space W is the usual L2 ( supp (µ), µ) space of all square integrable complex-valued functions relative to the measure µ on supp (µ). [sent-204, score-1.145]
60 Since µ is a finite Borel measure every bounded continuous function on supp (µ) is contained in W . [sent-205, score-0.601]
61 Next, we introduce the set of exponentials E (µ) := {Φ(x) : x ∈ supp (µ)}. [sent-206, score-0.504]
62 We say E (µ) is universal provided that E (µ) is dense in C(Z ) for every compact subset Z of R d . [sent-207, score-0.438]
63 Lemma 11 For each compact subset Z of Rd , K(Z ) = C (Z ) if and only if span E (µ) = C(Z ). [sent-208, score-0.388]
64 Proof We observe by Fubini’s theorem that the map U : B(Z ) → W corresponding to Φ in Equation (24) is identified by formula (7) for each ν ∈ B(Z ) and y ∈ supp (µ) to be U(ν)(y) := Z Z ei(x,y) dν(x). [sent-209, score-0.634]
65 Hence, we see that N (U) = E (µ)⊥ and so U is injective if and only if span E (µ) is dense in C(Z ). [sent-210, score-0.307]
66 Theorem 12 The translation kernel K is universal if and only if the set of exponential features E (µ) is universal. [sent-213, score-0.53]
67 An interesting feature of this result is that whenever a translation kernel K is universal with corresponding measure µ then any kernel corresponding to any other measure ρ with the same support as µ is also universal! [sent-214, score-0.849]
68 Specifically, let Z be a compact subset of R d , ν a finite Borel measure on Z such that supp (ν) = Z and T the integral operator defined by (4) then Theorem 6 and Lemma 11 yield the following result. [sent-216, score-0.825]
69 Corollary 13 If K is a translation kernel on Rd then R (T ) = C(Z ) if and only if span E (µ) = C(Z ). [sent-217, score-0.539]
70 2661 M ICCHELLI , X U AND Z HANG Proposition 14 If supp (µ) is a uniqueness subset of Cd then the translation kernel K is universal. [sent-223, score-0.907]
71 Proof By Theorem 12, it suffices to show that for each compact set Z of R d there does not exist nontrivial ν ∈ B(Z ) satisfying for each y ∈ supp (µ) that Z Z ei(x,y) dν(x) = 0. [sent-224, score-0.627]
72 (25) Suppose there exists ν ∈ B(Z ) that satisfies (25) for all y ∈ supp (µ). [sent-225, score-0.504]
73 Then the entire function F defined for each z ∈ Cd as Z F(z) := ei(z,x) dν(x) Z vanishes on supp (µ). [sent-226, score-0.504]
74 We note here that the proof above adapts to show that for each finite Borel measure ω on Z and p ∈ [1, ∞), K(Z ) = L p (Z , ω) if and only if span E (µ) = L p (Z , ω). [sent-228, score-0.264]
75 Proposition 15 If supp (µ) has positive Lebesgue measure on R d then the translation kernel K is universal. [sent-230, score-0.877]
76 Proposition 16 If the continuous part of µ in its Lebesgue decomposition is nonzero then the translation kernel K is universal. [sent-235, score-0.372]
77 Proof We only need to show that supp (µ) has positive Lebesgue measure if the continuous part µ c of µ in its Lebesgue decomposition is nonzero. [sent-236, score-0.601]
78 Since supp (µc ) = supp (g) ⊆ supp (µ) it follows that supp (µ) has positive Lebesgue measure. [sent-239, score-2.016]
79 A continuous function g : R+ → R determines a radial kernel on Rd × Rd by the formula K(x, y) := g( x − y 2 ), x, y ∈ Rd (27) where x := (x, x) is the usual euclidean norm of x ∈ Rd . [sent-241, score-0.355]
80 It was proved in Schoenberg (1938) that K is a kernel on Rd × Rd for all d ∈ N if and only if there exists a finite Borel measure µ on R + such that for all t ∈ R+ Z g(t) := R+ e−tσ dµ(σ). [sent-242, score-0.224]
81 Theorem 17 If the measure µ in Equation (28) is not concentrated at zero then the radial kernel K in (27) is universal. [sent-246, score-0.255]
82 Using Fubini’s theorem, we express the function k in (23) for the kernel K in (27) at x ∈ Rd as Z k(x) := Rd ei(x,y) f (y)dy where the function f is defined at y ∈ Rd by the equation 1 f (y) := (2π)d Z ∞ π d/2 − e σ a y 2 4σ dµ(σ). [sent-249, score-0.23]
83 Z supp (µ) σ|α| xα e−σ x 2 yα e−σ y 2 dµ(σ). [sent-253, score-0.504]
84 This suggests that we introduce the Hilbert space W of real-valued functions on the set M := Zd × supp (µ) with inner product + (F, G)W := ∑ α∈Zd + |α| 2|α| α |α|! [sent-254, score-0.548]
85 Z supp (µ) 2663 F(α, σ)G(α, σ)dµ(σ) M ICCHELLI , X U AND Z HANG and a feature map Φ : Rd → W defined at (α, σ) ∈ M and x ∈ Rd as Φ(x)(α, σ) := σ|α|/2 xα e−σ x 2 . [sent-255, score-0.589]
86 Hence we have the feature space representation for the kernel K given for each x, y ∈ R d as K(x, y) = (Φ(x), Φ(y))W . [sent-256, score-0.3]
87 Now, we let Z be some prescribed compact subset of Rd . [sent-257, score-0.237]
88 Therefore, if there is a positive ρ ∈ supp (µ) and ν ∈ N (U) then for any α ∈ Z d we have that + Z Z xα e−ρ x 2 dν(x) = 0. [sent-259, score-0.504]
89 Next, we give a quite different condition on the support of the measure µ so that the corresponding translation kernel is universal. [sent-263, score-0.373]
90 Proposition 18 If supp (µ) is a subgroup of Rd such that for each x ∈ Rd \ {0} the set {(x, y) : y ∈ supp (µ)} Z then K is universal. [sent-265, score-1.058]
91 Proof By Theorem 12, it suffices to show that span E (µ) is dense in C(Z ). [sent-266, score-0.274]
92 Since supp (µ) is a subgroup of Rd we see that 1 ∈ span E (µ) and that for all f , g ∈ span E (µ), both f g and f¯ belong to span E (µ). [sent-268, score-1.199]
93 That is, for each y ∈ supp (µ) ei(x1 −x2 ,y) = 1 or in other words, (x1 − x2 , y)/2π ∈ Z. [sent-270, score-0.504]
94 Corollary 19 If d = 1, supp (µ) is a subgroup of R and there exists y 1 , y2 ∈ supp (µ) \ {0} such that y1 /y2 is an irrational number then K is universal. [sent-272, score-1.058]
95 Conclusion We have provided a variety of conditions for a kernel to be universal in terms of properties of its features. [sent-276, score-0.381]
96 Several examples of universal dot product kernels are given. [sent-277, score-0.4]
97 In the case of translation kernels we showed that universality depends on the density of a set of complex exponentials. [sent-278, score-0.395]
98 With this available information a complete characterization of univariate translation kernels follows. [sent-281, score-0.281]
99 Our study indicates that there is intimate relationship between uniformly approximating a prescribed target function by a kernel and approximating by its features. [sent-283, score-0.295]
100 Given a prescribed error ε > 0 and a prescribed target function f , what is the relationship between the number of features needed to represent f with error ε and the number of kernel sections needed for the same purpose. [sent-285, score-0.303]
wordName wordTfidf (topN-words)
[('supp', 0.504), ('micchelli', 0.237), ('rd', 0.228), ('span', 0.215), ('universal', 0.206), ('borel', 0.177), ('kernel', 0.175), ('icchelli', 0.156), ('niversal', 0.156), ('schoenberg', 0.156), ('translation', 0.149), ('lax', 0.136), ('hang', 0.132), ('kernels', 0.132), ('compact', 0.123), ('ernels', 0.109), ('cd', 0.106), ('sd', 0.104), ('rudin', 0.104), ('royden', 0.097), ('nn', 0.095), ('hilbert', 0.091), ('zd', 0.089), ('lebesgue', 0.087), ('proposition', 0.085), ('shall', 0.083), ('annihilator', 0.078), ('nhk', 0.078), ('universality', 0.078), ('pk', 0.065), ('prescribed', 0.064), ('operator', 0.064), ('dot', 0.062), ('orthonormal', 0.062), ('hk', 0.061), ('fubini', 0.059), ('ky', 0.059), ('dense', 0.059), ('argyriou', 0.058), ('stein', 0.058), ('yuesheng', 0.058), ('norm', 0.058), ('proceeding', 0.056), ('ei', 0.056), ('equation', 0.055), ('fy', 0.054), ('ak', 0.053), ('subgroup', 0.05), ('subset', 0.05), ('measure', 0.049), ('theorem', 0.048), ('continuous', 0.048), ('pontil', 0.047), ('feature', 0.046), ('ucl', 0.044), ('space', 0.044), ('formula', 0.043), ('jk', 0.042), ('nonnegative', 0.04), ('map', 0.039), ('albany', 0.039), ('beurling', 0.039), ('forthcoming', 0.039), ('haizhang', 0.039), ('syr', 0.039), ('syracuse', 0.039), ('szeg', 0.039), ('kx', 0.038), ('subspace', 0.037), ('density', 0.036), ('representation', 0.035), ('invariant', 0.035), ('integral', 0.035), ('appearing', 0.034), ('reproducing', 0.034), ('lanckriet', 0.034), ('polynomials', 0.034), ('hausdorff', 0.033), ('hermitian', 0.033), ('injective', 0.033), ('fitzgerald', 0.033), ('mercer', 0.033), ('consequence', 0.032), ('corollary', 0.031), ('spaces', 0.031), ('radial', 0.031), ('xu', 0.03), ('tv', 0.03), ('riesz', 0.03), ('china', 0.03), ('sonnenburg', 0.03), ('neumann', 0.03), ('ray', 0.03), ('chebyshev', 0.03), ('uniqueness', 0.029), ('conversely', 0.029), ('rkhs', 0.029), ('approximating', 0.028), ('edition', 0.028), ('eigenfunctions', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 93 jmlr-2006-Universal Kernels
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
2 0.12616417 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
3 0.084963568 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani
Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines
4 0.076662555 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
5 0.067099437 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
Author: Ross A. Lippert, Ryan M. Rifkin
Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines
6 0.065413617 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)
7 0.063707925 45 jmlr-2006-Learning Coordinate Covariances via Gradients
8 0.062138811 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
9 0.056342103 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations
10 0.055756748 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
11 0.044688907 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)
12 0.04326316 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
13 0.039844345 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
15 0.036342319 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
16 0.035653934 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
17 0.035469081 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
19 0.034337398 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
topicId topicWeight
[(0, 0.191), (1, -0.101), (2, 0.051), (3, -0.172), (4, 0.046), (5, 0.043), (6, -0.04), (7, 0.003), (8, -0.101), (9, -0.044), (10, 0.146), (11, -0.124), (12, -0.097), (13, 0.098), (14, 0.032), (15, -0.025), (16, 0.093), (17, -0.138), (18, -0.228), (19, -0.076), (20, 0.09), (21, -0.134), (22, 0.047), (23, -0.047), (24, 0.094), (25, 0.191), (26, 0.253), (27, -0.092), (28, -0.162), (29, -0.172), (30, 0.047), (31, -0.012), (32, 0.102), (33, -0.097), (34, -0.03), (35, 0.14), (36, 0.012), (37, -0.068), (38, -0.008), (39, 0.029), (40, 0.045), (41, -0.042), (42, -0.137), (43, -0.078), (44, 0.027), (45, -0.106), (46, -0.114), (47, -0.009), (48, 0.086), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.96894228 93 jmlr-2006-Universal Kernels
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
2 0.69330209 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
Author: Andrea Passerini, Paolo Frasconi, Luc De Raedt
Abstract: We develop kernels for measuring the similarity between relational instances using background knowledge expressed in first-order logic. The method allows us to bridge the gap between traditional inductive logic programming (ILP) representations and statistical approaches to supervised learning. Logic programs are first used to generate proofs of given visitor programs that use predicates declared in the available background knowledge. A kernel is then defined over pairs of proof trees. The method can be used for supervised learning tasks and is suitable for classification as well as regression. We report positive empirical results on Bongard-like and M-of-N problems that are difficult or impossible to solve with traditional ILP techniques, as well as on real bioinformatics and chemoinformatics data sets. Keywords: kernel methods, inductive logic programming, Prolog, learning from program traces
4 0.43123156 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
Author: Ross A. Lippert, Ryan M. Rifkin
Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines
5 0.40008175 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations
Author: Bernhard Moser
Abstract: Kernels are two-placed functions that can be interpreted as inner products in some Hilbert space. It is this property which makes kernels predestinated to carry linear models of learning, optimization or classification strategies over to non-linear variants. Following this idea, various kernel-based methods like support vector machines or kernel principal component analysis have been conceived which prove to be successful for machine learning, data mining and computer vision applications. When applying a kernel-based method a central question is the choice and the design of the kernel function. This paper provides a novel view on kernels based on fuzzy-logical concepts which allows to incorporate prior knowledge in the design process. It is demonstrated that kernels mapping to the unit interval with constant one in its diagonal can be represented by a commonly used fuzzylogical formula for representing fuzzy rule bases. This means that a great class of kernels can be represented by fuzzy-logical concepts. Apart from this result, which only guarantees the existence of such a representation, constructive examples are presented and the relation to unlabeled learning is pointed out. Keywords: kernel, triangular norm, T -transitivity, fuzzy relation, residuum 1. Motivation Positive-definiteness plays a prominent role especially in optimization and machine learning due to the fact that two-place functions with this property, so-called kernels, can be represented as inner products in some Hilbert space. Thereby, optimization techniques conceived on the basis of linear models can be extended to non-linear algorithms. For a survey of applications see, for example, ¨ Jolliffe (1986), Sch¨ lkopf and Smola (2002) and Scholkopf et al. (1998). o Recently in Moser (2006) it was shown that kernels with values from the unit interval can be interpreted as fuzzy equivalence relations motivated by the idea that kernels express a kind of similarity. This means that the concept of fuzzy equivalence relations, or synonymously fuzzy similarity relations, is more general than that of kernels, provided only values in the unit interval are considered. Fuzzy equivalence relations distinguish from Boolean equivalence relations by a many-valued extension of transitivity which can be interpreted as many-valued logical model of the statement “IF x is similar to y AND y is similar to z THEN x is similar to z”. In contrast to the Boolean case, in many-valued logics the set of truth values is extended such that also assertions, for example, whether two elements x and y are similar, can be treated as a matter of degree. The standard model for the set of (quasi) truth values of fuzzy logic and other many-valued logical systems is the unit interval. If E(x, y) represents the (quasi) truth value of the statement that x is c 2006 Bernhard Moser. M OSER similar to y, then the many-valued version of transitivity is modeled by T (E(x, y), E(y, z)) ≤ E(x, z) where T is a so-called triangular norm which is an extension of the Boolean conjunction. This many-valued concept for transitivity is called T -transitivity. For a survey on triangular norms see, for example, Dubois and Prade (1985), Gottwald (1986), Gottwald (1993) and Klement et al. (2000), ¨ and for fuzzy equivalence relations and T -transitivity see, for example, Bodenhofer (2003), H ohle (1993), H¨ hle (1999), Klement et al. (2000), and Zadeh (1971). o Based on the semantics of fuzzy logic, this approach allows to incorporate knowledge-based models for the design of kernels. From this perspective, the most interesting mathematical question is how positive-semidefinite fuzzy equivalence relations can be characterized or at least constructed under some circumstances. At least for some special cases, proofs are provided in Section 4, which motivate further research aiming at establishing a more general theory on the positive-definiteness of fuzzy equivalence relations. These cases are based on the most prominent representatives of triangular norms, that is the Minimum, the Product and the Łukasiewicz t-norm. The paper is structured as follows. First of all, in Section 2, some basic prerequisites concerning kernels and fuzzy relations are outlined. In Section 3, a result about the T -transitivity of kernels from Moser (2006) is cited and interpreted as existence statement that guarantees a representation of kernels mapping to the unit interval with constant 1 in its diagonal by a certain, commonly used, fuzzy-logical construction of a fuzzy equivalence relation. Finally, in contrast to the pure existence theorem of Section 3, in Section 4 constructive examples of fuzzy equivalence relations are provided which are proven to be kernels. In a concluding remark, the relationship to the problem of labeled and unlabeled learning is pointed out. 2. Prerequisites This section summarizes definitions and facts from the theory of kernels as well as from fuzzy set theory which are needed later on. 2.1 Kernels and Positive-Semidefiniteness Preserving Functions There is an extensive literature concerning kernels and kernel-based methods like support vector machines or kernel principal component analysis especially in the machine learning, data mining ¨ and computer vision communities. For an overview and introduction, see, for example, Sch olkopf and Smola (2002). Here we present only what is needed later on. For completeness let us recall the basic definition for kernels and positive-semidefiniteness. Definition 1 Let X be a non-empty set. A real-valued function k : X × X → R is said to be a kernel iff it is symmetric, that is, k(x, y) = k(y, x) for all x, y ∈ X , and positive-semidefinite, that is, ∑n j=1 ci c j k(xi , x j ) ≥ 0 for any n ∈ N, any choice of x1 , . . . , xn ∈ X and any choice of c1 , . . . , cn ∈ R. i, One way to generate new kernels from known kernels is to apply operations which preserve the positive-semidefiniteness property. A characterization of such operations is provided by C. H. FitzGerald (1995). Theorem 2 (Closeness Properties of Kernels) Let f : Rn → R, n ∈ N, then k : X × X → R given by k(x, y) := f (k1 (x, y), . . . , kn (x, y)) 2604 G ENERATING K ERNELS BY F UZZY R ELATIONS is a kernel for any choice of kernels k1 , . . . , kn on X × X iff f is the real restriction of an entire function on Cn of the form f (x1 , . . . , xn ) = ∑ r1 ≥0,...,rn ≥0 r r cr1 ,...,rn x11 · · · xnn (1) where cr1 ,...,rn ≥ 0 for all nonnegative indices r1 , . . . , rn . 2.2 Triangular Norms Triangular norms have been originally studied within the framework of probabilistic metric spaces, see Schweizer and Sklar (1961) and Schweizer and Sklar (1983). In this context, t-norms proved to be an appropriate concept when dealing with triangle inequalities. Later on, t-norms and their dual version, t-conorms, have been used to model conjunction and disjunction for many-valued logic, see Dubois and Prade (1985), Gottwald (1986), Gottwald (1993) and Klement et al. (2000). Definition 3 A function T : [0, 1]2 → [0, 1] is called t-norm (triangular norm), if it satisfies the following conditions: (i) (ii) (iii) (iv) ∀x, y ∈ [0, 1] : ∀x, y, z ∈ [0, 1] : ∀x, y, z ∈ [0, 1] : ∀x, y ∈ [0, 1] : T (x, y) = T (y, x) T (x, T (y, z)) = T (T (x, y), z) y ≤ z =⇒ T (x, y) ≤ T (x, z) T (x, 1) = x ∧ T (1, y) = y (commutativity) (associativity) (monotonicity) (boundary condition) Further, a t-norm is called Archimedean if it is continuous and satisfies x ∈ (0, 1) ⇒ T (x, x) < x. Due to its associativity, many-placed extensions Tn : [0, 1]n → [0, 1], n ∈ N, of a t-norm T are uniquely determined by Tn (x1 , . . . , xn ) = T (x1 , Tn−1 (x2 , . . . , xn )). Archimedean t-norms are characterized by the following representation theorem due to Ling (1965): Theorem 4 Let T : [0, 1]2 → [0, 1] be a t-norm. Then T is Archimedean if, and only if, there is a continuous, strictly decreasing function f : [0, 1] → [0, ∞] with f (1) = 0 such that for x, y ∈ [0, 1], T (x, y) = f −1 (min( f (x) + f (y), f (0))). By setting g(x) = exp (− f (x)), Ling’s characterization yields an alternative representation with a multiplicative generator function T (x, y) = g−1 (max(g(x) g(y), g(0))). For g(x) = x we get the product TP (x, y) = x y. The setting f (x) = 1 − x yields the so-called Łukasiewcz t-norm TL (x, y) = max(x + y − 1, 0). Due to Ling’s theorem 4 an Archimedean t-norm T is isomorphic either to TL or TP , depending on whether the additive generator takes a finite value at 0 or not. In the former case, the Archimedean t-norm is called non-strict, in the latter it is called strict. 2605 M OSER A many-valued model of an implication is provided by the so-called residuum given by → T (a, b) = sup{c ∈ [0, 1]|T (a, c) ≤ b} (2) where T is a left-continuous t-norm. Equation (2) is uniquely determined by the so-called adjunction property → ∀a, b, c ∈ [0, 1] : T (a, b) ≤ c ⇔ a ≤ T (b, c). Consequently, the operator ↔ → → T (a, b) = min T (a, b), T (b, a) (3) (4) models a biimplication. For details, for example, see Gottwald (1986) and Klement et al. (2000). → Tables 1 and 2 list examples of t-norms with their induced residuum T . For further examples see, for example, Klement et al. (2000). √ √ Tcos (a, b) = max(ab − 1 − a2 1 − b2 , 0) TL (a, b) = max(a + b − 1, 0) TP (a, b) = ab TM (a, b) = min(a, b) Table 1: Examples of t-norms → T cos (a, b) = → T L (a, b) = → = T P (a, b) → T M (a, b) = cos(arccos(b) − arccos(a)) if a > b, 1 else min(b − a + 1, 1) b if a > b, a 1 else b if a > b, 1 else Table 2: Examples of residuums 2.3 T -Equivalences If we want to classify based on a notion of similarity or indistinguishability, we face the problem of transitivity. For instance, let us consider two real numbers to be indistinguishable if and only if they differ by at most a certain bound ε > 0, this is modeled by the relation ∼ ε given by x ∼ε y :⇔ |x−y| < ε, ε > 0, x, y ∈ R. Note that the relation ∼ε is not transitive and, therefore, not an equivalence relation. The transitivity requirement turns out to be too strong for this example. The problem of identification and transitivity in the context of similarity of physical objects was early pointed out and discussed philosophically by Poincar´ (1902) and Poincar´ (1904). In the framework of fuzzy e e logic, the way to overcome this problem is to model similarity by fuzzy relations based on a many¨ valued concept of transitivity, see Bodenhofer (2003), H ohle (1993), H¨ hle (1999), Klement et al. o (2000) and Zadeh (1971). 2606 G ENERATING K ERNELS BY F UZZY R ELATIONS Definition 5 A function E : X 2 −→ [0, 1] is called a fuzzy equivalence relation, or synonymously, T -equivalence with respect to the t-norm T if it satisfies the following conditions: (i) ∀x ∈ X : E(x, x) = 1 (reflexivity) (ii) ∀x, y ∈ X : E(x, y) = E(y, x) (symmetry) (iii) ∀x, y, z ∈ X : T (E(x, y), E(y, z)) ≤ E(x, z) (T-transitivity). The value E(x, y) can be also looked at as the (quasi) truth value of the statement “x is equal to y”. Following this semantics, T-transitivity can be seen as a many-valued model of the proposition, “If x is equal to y and y is equal to z, then x is equal to z”. T -equivalences for Archimedean t-norms are closely related to metrics and pseudo-metrics as shown by Klement et al. (2000) and Moser (1995). Theorem 6 Let T be an Archimedean t-norm given by ∀a, b ∈ [0, 1] : T (a, b) = f −1 (min( f (a) + f (b), f (0))), where f : [0, 1] → [0, ∞] is a strictly decreasing, continuous function with f (1) = 0. (i) If d : X 2 → [0, ∞[ is a pseudo-metric, then the function Ed : X 2 → [0, 1] defined by Ed (x, y) = f −1 (min(d(x, y), f (0))) is a T -equivalence with respect to the t-norm T . (ii) If E : X 2 → [0, 1] is a T -equivalence relation, then the function dE : X 2 → [0, ∞] defined by dE (x, y) = f (E(x, y)) is a pseudo-metric. → Another way to construct T -equivalences is to employ T -operators. The proof of the following assertion can be found in Trillas and Valverde (1984), Kruse et al. (1993) and Kruse et al. (1994). ↔ Theorem 7 Let T be a left-continuous t-norm, T its induced biimplication, µi : X → [0, 1], i ∈ I, I non-empty; then E : X × X → [0, 1] given by ↔ E(x, y) = inf T (µi (x), µi (y)) i∈I (5) is a T -equivalence relation. ¨ For further details on T -equivalences see also Boixader and Jacas (1999), H oppner et al. (2002), Jacas (1988), Trillas et al. (1999) and Valverde (1985). 3. Representing Kernels by T -Equivalences It is interesting that the concept of kernels, which is motivated by geometric reasoning in terms of inner products and mappings to Hilbert spaces and which is inherently formulated by algebraic terms, is closely related to the concept of fuzzy equivalence relations as demonstrated and discussed in more detail in Moser (2006). In this section, we start with the result that any kernel k : X × X → [0, 1] with k(x, x) = 1 for all x ∈ X is T -transitive and, therefore, a fuzzy equivalence relation. The proof can be found in Moser (2006), see also Appendix A.1. 2607 M OSER Theorem 8 Any kernel k : X × X → [0, 1] with k(x, x) = 1 is (at least) Tcos -transitive, where 1 − a2 Tcos (a, b) = max{a b − 1 − b2 , 0}. (6) The nomenclature is motivated by the fact that the triangular norm defined by Equation (6) is an Archimedean t-norm which is generated by the arcosine function as its additive generator. From this result, the following existence theorem can be derived, which guarantees that any kernel under consideration can be represented by the fuzzy-logical formula given by (5). In fuzzy systems, this formula is commonly used for modeling rule bases (see, for example, Kruse et al., 1993, 1994). Theorem 9 Let X be a non-empty universe of discourse, k : X × X → [0, 1] a kernel in the sense of Definition 1 and k(x, x) = 1 for all x ∈ X ; then there is a family of membership functions µ i : X → [0, 1], i ∈ I, I non-empty and a t-norm T , such that ↔ ∀x, y ∈ X : k(x, y) = inf T (µi (x), µi (y)). i∈I (7) Proof. Let us set I := X , µx0 (x) = k(x, x0 ) and let us choose Tcos as t-norm. For convenience let us denote ↔ h(x, y) = inf T cos (µx0 (x), µx0 (y)), x0 ∈X which is equivalent to ↔ h(x, y) = inf T cos (k(x0 , x), k(x0 , y)). x0 ∈X According to Theorem 8, k is Tcos -transitive, that is, ↔ ∀x0 , x, y ∈ X : T cos (k(x0 , x), k(x0 , y)) ≤ k(x, y). This implies that h(x, y) ≤ k(x, y) for all x, y ∈ X . Now let us consider the other inequality. Due to the adjunction property (3), we obtain → Tcos (k(x, y), k(x0 , y)) ≤ k(x, x0 ) ⇔ k(x, y) ≤ T cos (k(x0 , y), k(x, x0 )) and → Tcos (k(x, y), k(x0 , x)) ≤ k(y, x0 ) ⇔ k(x, y) ≤ T cos (k(x0 , x), k(y, x0 )), from which it follows that → → ∀x, y, x0 ∈ X : k(x, y) ≤ min{ T cos (k(x0 , y), k(x, x0 )), T cos (k(x0 , x), k(y, x0 ))}. Hence by Definition 4, ∀x, y ∈ X : k(x, y) ≤ h(x, y) which ends the proof. For an arbitrary choice of fuzzy membership functions, there is no necessity that the resulting relation (7) implies positive-semidefiniteness and, therefore, a kernel. For an example of a Tcos equivalence which is not a kernel see Appendix A.4. Theorem 9 guarantees only the existence of a representation of the form (5) but it does not tell us how to construct the membership functions µ i . In the following section, we provide examples of fuzzy equivalence relations which yield kernels for any choice of membership functions. 2608 G ENERATING K ERNELS BY F UZZY R ELATIONS 4. Constructing Kernels by Fuzzy Equivalence Relations In the Boolean case, positive-definiteness and equivalence are synonymous, that is, a Boolean relation R : X × X → {0, 1} is positive-definite if and only if R is the indicator function of an equivalence relation ∼ that is, R(x, y) = 1 if x ∼ y and R(x, y) = 0 if x ∼ y. For a proof, see Appendix A.2. This = = =, relationship can be used to obtain an extension to fuzzy relations as given by the next theorem whose proof can be found in the Appendix A.3. Theorem 10 Let X be a non-empty universe of discourse, µ i : X → [0, 1], i ∈ I, I non-empty; then the fuzzy equivalence relation EM : X × X → [0, 1] given by ↔ EM (x, y) = inf T M (µi (x), µi (y)) i∈I is positive-semidefinite. In the following, the most prominent representatives of Archimedean t-norms, the Product TP and the Łukasiewicz t-norm TL , are used to construct positive-semidefinite fuzzy similarity relations. Though the first part can also be derived from a result due to Yaglom (1957) that characterizes isotropic stationary kernels by its spectral representation, here we prefer to present a direct, elementary proof. Compare also Bochner (1955) and Genton (2001). Theorem 11 Let X be a non-empty universe of discourse, ν : X → [0, 1] and let h : [0, 1] → [0, 1] be an isomorphism of the unit interval that can be expanded in the manner of Equation (1), that is h(x) = ∑k ck xk with ck ≥ 0; then the fuzzy equivalence relations EL,h , EP,h : X × X → [0, 1] given by ↔ EL,h (x, y) = h T L h−1 (ν(x)) , h−1 (ν(y)) and ↔ EP,h (x, y) = h T P h−1 (ν(x)) , h−1 (ν(y)) (8) (9) are positive-semidefinite. Proof. To prove the positive-definiteness of the two-placed functions E L,h and EP,h given by equations (8) and (9) respectively, we have to show that n n ∑ i, j=1 EL,h (xi , xi ) ci c j ≥ 0, ∑ i, j=1 EP,h (xi , x j ) ci c j ≥ 0 for any n ∈ N and any choice of x1 , . . . , xn ∈ X , respectively. According to an elementary result from Linear Algebra this is equivalent to the assertion that the determinants (1 ≤ m ≤ n) Dm = det (E(xi , x j ))i, j∈{1,...,m} of the minors of the matrix (E(xi , x j ))i, j satisfy ∀m ∈ {1, . . . , n} : Dm ≥ 0, where E denotes either EL,h or EP,h . Recall that the determinant of a matrix is invariant with respect to renaming the indices, that is, if σ : {1, . . . , n} → {1, . . . , n} is a permutation then det [(ai j )i, j ] = det (aσ(i)σ( j) )i, j . 2609 M OSER For convenience, let denote µi = h−1 (ν(xi )). Then, without loss of generality, we may assume that the values µi are ordered monotonically decreasing, that is, µi ≥ µ j for i < j. ↔ → (10) → Case TL : Note that T L (a, b) = min{ T L (a, b), T L (b, a)} = 1 − |a − b|. Then we have to show that for all dimensions n ∈ N, the determinant of E (n) = (1 − |µi − µ j |)i, j∈{1,...,n} is non-negative, that is Due to the assumption (10), we have det[E (n) ] ≥ 0. 1 − |µi − µ j | = 1 − (µi − µ j ) if i ≤ j, 1 − (µ j − µi ) else which yields . . . 1 − (µ1 − µn−1 ) 1 − (µ1 − µn ) . . . 1 − (µ2 − µn−1 ) 1 − (µ2 − µn ) . . . 1 − (µ3 − µn−1 ) 1 − (µ3 − µn ) (n) E = . . . .. . . . . . 1 − (µ1 − µn−1 ) 1 − (µ2 − µn−1 ) . . . 1 1 − (µn−1 − µn ) 1 − (µ1 − µn ) 1 − (µ2 − µn ) . . . 1 − (µn−1 − µn ) 1 1 − (µ1 − µ2 ) 1 1 − (µ2 − µ3 ) . . . 1 1 − (µ1 − µ2 ) 1 − (µ1 − µ3 ) . . . Now let us apply determinant-invariant elementary column operations to simplify this matrix by subtracting the column with index i − 1 from the column with index i, i ≥ 2. This yields 1 µ2 − µ1 ... µn−1 − µn−2 µn − µn−1 1 − (µ1 − µ2 ) −(µ2 − µ1 ) . . . µn−1 − µn−2 µn − µn−1 1 − (µ1 − µ3 ) −(µ2 − µ1 ) . . . µn−1 − µn−2 µn − µn−1 ˜ E (n) = . . . . . .. . . . . . . . . . 1 − (µ1 − µn−1 ) −(µ2 − µ1 ) . . . −(µn−2 − µn−1 ) µn − µn−1 1 − (µ1 − µn ) −(µ2 − µ1 ) . . . −(µn−2 − µn−1 ) −(µn−1 − µn ) Therefore, α = n ∏(µi−1 − µi ) ≥ 0 (11) i=2 ˜ ˆ det[E (n) ] = det[E (n) ] = α det[En ], where . . . −1 −1 . . . −1 −1 . . . −1 −1 (n) ˆ E = . . . .. . . . . . 1 − (µ1 − µn−1 ) +1 . . . +1 −1 1 − (µ1 − µn ) +1 . . . +1 +1 1 1 − (µ1 − µ2 ) 1 − (µ1 − µ3 ) . . . 2610 −1 +1 +1 . . . (12) G ENERATING K ERNELS BY F UZZY R ELATIONS Let us apply Laplacian determinant expansion by minors to the first column of matrix (12), that is n det[A] = ∑ (−1)i+ j ai j det[Ai j ] i=1 where A = (ai j ) is an n × n-matrix, j arbitrarily chosen from {1, . . . , n} and Ai j is the matrix corresponding to the cofactor ai j obtained by canceling out the i-th row and the j-th column from A (see, ˆ for example, Muir, 1960). For n = 1, we get the trivial case det[ E (1) ] = 1. Note that the first and (n) ˆ the last rows of the matrices Ei,1 for 1 < i < n only differ by their signum, consequently the minors ˆ (n) det[Ei,1 ] for 1 < i < n, n ≥ 2, are vanishing, that is, det[Ai,1 ] = 0, for 1 < i < n. Therefore, according to the Laplacian expansion, we get (n) (n) ˆ ˆ ˆ det[E (n) ] = 1 · det[E1,1 ] + (−1)n (1 − (µ1 − µn )) · det[E1,n ]. (13) Observe that (n) ˆ det[E1,1 ] = 2n−2 (n) ˆ det[E1,n ] = (−1)n−1 2n−2 . Consequently, Equation (13) simplifies to ˆ det[E (n) ] = 2n−2 1 + (−1)n (−1)n−1 2n−2 (1 − (µ1 − µn )) = 2n−2 (1 − (1 − (µ1 − µn ))) = 2n−2 (µ1 − µn ) ≥ 0 which together with (11) proves the first case. ↔ → → Case TP : First of all, let us compute T P (a, b) = min{ T P (a, b), T L (b, a)}. Hence, min{ b , a } if a, b > 0, a b 0 ↔ if a = 0 and b > 0 , T P (a, b) = 0 if b = 0 and a > 0 , 1 if a = 0 and b = 0 . Again, without loss of generality, let us suppose that the values µ i , i ∈ {1, . . . , n} are ordered monotonically decreasing, that is µ1 ≥ µ2 ≥ . . . ≥ µn . Before checking the general case, let us consider the special case of vanishing µ-values. For this, let us assume for the moment that µi = > 0 if i < i0 , 0 else ↔ ↔ which implies that T P (µi , µ j ) = 0 for i < i0 and j ≥ i0 and T P (µi , µ j ) = 1 for i ≥ i0 and j ≥ i0 . This leads to a decomposition of the matrix ↔ E (n) = T P (µi , µ j ) 2611 ij M OSER such that det[E (n) ] = det[E (i0 −1) ] · det[In−i0 −1 ] where Ik denotes the k × k-matrix with constant entries 1, hence det[In−i0 −1 ] ∈ {0, 1}. Therefore, we may assume that µ1 ≥ µ2 ≥ . . . ≥ µn > 0. Then we have to show that for all dimensions n ∈ N, the determinant of µi µ j , µ j µi E (n) = min i, j∈{1,...,n} is non-negative, that is det[E (n) ] ≥ 0. Consider 1 µ2 µ1 µ3 µ (n) E = .1 . . µn−1 µ1 µn µ1 µ2 µ1 1 µ3 µ2 . . . µn−1 µ2 µn µ2 ... ... ... .. . ... ... Now, multiply the i-th column by −µi+1 /µi and add 1 ≤ i < n, then we get 1 0 ... 2 µ2 ∗ 1 − ... µ1 ∗ ∗ ... . ˜ .. E (n) = . . . . . . ∗ ∗ ... 1− ∗ ∗ ... µn−1 µ1 µn−1 µ2 µn−1 µ3 µn µ1 µn µ2 µn µ3 . . . . µn µn−1 1 . . . 1 µn µn−1 (14) it to the (i + 1)-th column of matrix (14), 0 0 0 0 0 . . . 0 . . . µn−1 µn−2 2 ∗ 0 1− µn µn−1 2 (15) where ∗ is a placeholder for any real value. By this, the determinant of the matrix in Equation (15) readily turns out to be n−1 µi+1 ˜ det[E (n) ] = det[E (n) ] = ∏ 1 − µi i=1 2 ≥0 which together with Theorem (2) ends the proof. Note that relations (8) and (9) are T -transitive with respect to the corresponding isomorphic Archimedean t-norms, TL,h (x, y) = h(TL (h−1 (x), h−1 (x))) and TP,h (x, y) = h(TP (h−1 (x), h−1 (x))), respectively. 2612 G ENERATING K ERNELS BY F UZZY R ELATIONS Corollary 12 Let X be a non-empty universe of discourse, µ i : X → [0, 1], λi ∈ ]0, 1] with ∑i λi = 1 ˜ ˜ where i ∈ {1, . . . , n}, n ∈ N, then the fuzzy equivalence relations EL , EP : X × X → [0, 1] given by n ↔ ˜ EL (x, y) = ∑ λi T L (µi (x), µi (y)) (16) i=1 and n ↔ ˜ EP (x, y) = ∏ T P (µi (x), µi (y)) λi (17) i=1 are TL - and TP -equivalences, respectively, and kernels. Proof. First of all, let us check the TL -transitivity of formula (16). This can readily be shown by ↔ means of the definition of TL and the TL -transitivity of T L due to the following inequalities: n TL n ↔ i=1 n n ↔ ↔ n ↔ ↔ ∑ λi T L (µi (x), µi (y)) + ∑ λi T L (µi (y), µi (z)) − 1 , 0 i=1 i=1 n max = i=1 i=1 n = i=1 ∑ λi T L (µi (x), µi (y)) + ∑ λi T L (µi (y), µi (z)) − 1, 0 max max ↔ ∑ λi T L (µi (x), µi (y)), ∑ λi T L (µi (y), µi (yz) n ↔ ↔ ∑ λi TL T L (µi (x), µi (y)), ∑ λi T L (µi (y), µi (z)) , 0 i=1 i=1 n max ↔ ∑ λi T L (µi (x), µi (z)), 0 ≤ ≤ = i=1 ↔ λi T L (µi (x), µi (z)). ↔ This, together with the TP -transitivity of T P , proves that the formulas given by (16) and (17) are TL and TP -equivalences, respectively. Expanding the factors of formula (17) yields 1 if µi (x) = µi (y) = 0, λi ↔ λi λi (18) T P (µi (x), µi (y)) = min(µiλi(x),µiλi(y)) else max(µi (x),µi (y)) which by comparing case TP of the proof of Theorem 11 shows that the left-hand side of Equation (18) is positive-semidefinite. As the convex combination and the product are special cases of positive-semidefiniteness preserving functions according to Theorem 1, the functions defined by equations (16) and (17) prove to be again positive-semidefinite and, therefore, kernels. It is interesting to observe that both formulas (16) and (17) can be expressed in the form, f ( τ(x) − τ(y) 1 ), where f : I → [0, 1], I some interval, is a strictly decreasing function, τ : X → I n , I some interval, τ(x) = (τ1 (x), . . . , τn (x)) and τ(x) 1 = ∑n |τi (x)|. Indeed, for Equation (16) let us define i=1 fL : [0, 1] → [0, 1], fL (a) = 1 − a τL : X → [0, 1] , τL (x) = (λ1 µ1 (x), . . . , λn µn (x)) n 2613 M OSER and for Equation (17) and positive membership functions µ i , µi (x) > 0 for all x ∈ X , let us define fP : [0, ∞[→ [0, 1], fP (a) = e−a τP : X → ] − ∞, 1]n , τP (x) = (λ1 ln(µ1 (x)), . . . , λn ln(µn (x))) Therefore, we get ˜ EL (x, y) = 1 − τL (x) − τL (y) ˜ EP (x, y) = e− τP (x)−τP (y) 1 . 1 (19) (20) While formulas (19) and (20) provide a geometrical interpretation by means of the norm . 1 , the corresponding formulas (16) and (17) yield a semantical model of the assertion “IF x is equal to y with respect to feature µ1 AND . . . AND x is equal to y with respect to feature µn THEN x is equal to y” as aggregation of biimplications in terms of fuzzy logic. While in the former case, the aggregation has some compensatory effect, the latter is just a conjunction in terms of the Product triangular norm. For details on aggregation operators see, for example, Saminger et al. (2002) and Calvo et al. (2002). The formulas (16) and (17) coincide for the following special case. If the membership functions µi are indicator functions of sets Ai ⊆ X which form a partition of X , then the kernels (16) and (17) reduce to the indicator function characterizing the Boolean equivalence relation induced by this partition {A1 , . . . , An }. The formulas (16) and (17) for general membership functions therefore provide kernels which can be interpreted to be induced by a family of fuzzy sets and, in particular, by fuzzy partitions, that is, families of fuzzy sets fulfilling some criteria which extend the axioms for a Boolean partition in a many-valued logical sense. For definitions and further details on fuzzy partitions see, for ¨ example, De Baets and Mesiar (1998), Demirci (2003) and H oppner and Klawonn (2003). It is a frequently used paradigm that the decision boundaries for a classification problem lie between clusters rather than intersecting them. Due to this cluster hypothesis, the problem of designing kernels based on fuzzy partitions is closely related to the problem of learning kernels from unlabeled data. For further details on semi-supervised learning see, for example, Seeger (2002), Chapelle et al. (2003) and T. M. Huang (2006). It is left to future research to explore this relationship to the problem of learning from labeled and unlabeled data and related concepts like covariance kernels. 5. Conclusion In this paper, we have presented a novel view on kernels from a fuzzy logical point of view. Particularly, the similarity-measure aspect of a kernel is addressed and investigated by means of the so-called T -transitivity which is characteristic for fuzzy equivalence relations. As a consequence, we derived that a large class of kernels can be represented in a way that is commonly used for representing fuzzy rule bases. In addition to this proof for the existence of such a representation, constructive examples are presented. It is the idea of this research to look for a combination of knowledge-based strategies with kernel-based methods in order to facilitate a more flexible designing process of kernels which also allows to incorporate prior knowledge. Further research aims at 2614 G ENERATING K ERNELS BY F UZZY R ELATIONS analyzing the behavior of kernels constructed in this way when applied in the various kernel methods like support vector machines, kernel principal components analysis and others. In particular, it is intended to focus on the problem of learning kernels from unlabeled data where the fuzzy partitions are induced by appropriate clustering principles. Acknowledgments Bernhard Moser gratefully acknowledges partial support by the Austrian Government, the State of Upper Austria, and the Johannes Kepler University Linz in the framework of the Kplus Competence Center Program. Furthermore special thanks go to the anonymous reviewers who gave helpful suggestions and to Felix Kossak for careful proof-reading. Appendix A. For sake of completeness the following sections provide proofs regarding Theorem 8, the characterization of kernels in the Boolean case and the construction of kernels by means of the minimum t-norm TM . Furthermore, in Section A.4 an example of a non-positive-semidefinite Tcos -equivalence is given. A.1 Proof of Theorem 8 Let us start with the analysis of 3-dimensional matrices. Lemma 13 Let M = (mi j )i j ∈ [0, 1]3×3 be a 3 × 3 symmetric matrix with mii = 1, i = 1, 2, 3; then M is positive-semidefinite iff for all i, j, k ∈ {1, 2, 3} there holds mi j m jk − 1 − m2j i 1 − m2 ≤ mik jk Proof. For simplicity, let a = m1,2 , b = m1,3 and c = m2,3 . Then the determinant of M, Det(M), is a function of the variables a, b, c given by D(a, b, c) = 1 + 2abc − a2 − b2 − c2 . For any choice of a, b, the quadratic equation D(a, b, c) = 0 can be solved for c, yielding two solutions c1 = c1 (a, b) and c2 = c2 (a, b) as functions of a and b, c1 (a, b) = ab − c2 (a, b) = ab + 1 − a2 1 − a2 1 − b2 1 − b2 . Obviously, for all |a| ≤ 1 and |b| ≤ 1, the values c1 (a, b) and c2 (a, b) are real. By substituting a = cos α and b = cos(β) with α, β ∈ [0, π ], it becomes readily clear that 2 c1 (a, b) = c1 (cos(α), cos(β)) = cos(α) cos(β) − sin(α) sin(β) = cos(α + β) ∈ [−1, 1] 2615 M OSER and, analogously, c2 (a, b) = c2 (cos(α), cos(β)) = cos(α) cos(β) + sin(α) sin(β) = cos(α − β) ∈ [−1, 1]. As for all a, b ∈ [−1, 1] the determinant function Da,b (c) := D(a, b, c) is quadratic in c with negative coefficient for c2 , there is a uniquely determined maximum at c0 (a, b) = ab. Note that for all a, b ∈ [−1, 1], we have c1 (a, b) ≤ c0 (a, b) ≤ c2 (a, b) and D(a, b, c0 (a, b)) = 1 + 2ab(ab) − a2 − b2 − (ab)2 = (1 − a2 )(1 − b2 ) ≥ 0. Therefore, D(a, b, c) ≥ 0 if and only if c ∈ [c1 (a, b), c2 (a, b)]. Recall from linear algebra that by renaming the indices, the determinant does not change. Therefore, without loss of generality, we may assume that a ≥ b ≥ c. For convenience, let Q = {(x, y, z) ∈ [0, 1]3 |x ≥ y ≥ z}. Then, obviously, for any choice of a, b ∈ [0, 1] there holds (a, b, c1 (a, b)) ∈ Q. Elementary algebra shows that (a, b, c2 (a, b)) ∈ Q is only the case for a = b = 1. As for a = b = 1 the two solutions c1 , c2 coincide, that is, c1 (1, 1) = c2 (1, 1) = 1, it follows that for any choice of (a, b, c) ∈ Q, there holds D(a, b, c) ≥ 0 if and only if c1 (a, b) = ab − 1 − a2 1 − b2 ≤ c. (21) If (a, b, c) ∈ Q, then the inequality (21) is trivially satisfied which together with (21) proves the lemma Now Theorem 8 immediately follows from Definition (1), Lemma (13) and the characterizing inequality (21). A.2 Characterization of Kernels in the Boolean Case ¨ The following lemma and proposition can also be found as an exercise in Sch olkopf and Smola (2002). Lemma 14 Let ∼ be an equivalence relation on X and let k : X × X → {0, 1} be induced by ∼ via k(x, y) = 1 if and only if x ∼ y; then k is a kernel. Proof. By definition of positive-definiteness, let us consider an arbitrary sequence of elements x1 , . . . , xn . Then there are at most n equivalence classes Q1 , . . . , Qm on the set of indices {1, . . . , n}, S / m ≤ n, where i=1,...,m Qi = {1, . . . , n} and Qi ∩ Q j = 0 for i = j. Note that k(xi , x j ) = 0 if the indices 2616 G ENERATING K ERNELS BY F UZZY R ELATIONS i, j belong to different equivalence classes. Then, for any choice of reals c 1 , . . . , cn , we obtain ∑ ci c j k(xi , x j ) m = i, j ∑ ∑ ci c j k(xi , x j ) p=1 i, j∈Q p m = ∑ ∑ p=1 i, j∈Q p ci c j · 1 2 m = ∑ ∑ ci p=1 i∈Q p ≥ 0 Proposition 15 k : X × X → {0, 1} with k(x, x) = 1 for all x ∈ X is a kernel if and only if it is induced by an equivalence relation. Proof. It only remains to be shown that if k is a kernel, then it is the indicator function of an equivalence relation, that is, it is induced by an equivalence relation. If k is a kernel, according to Lemma 13, for all x, y, z ∈ X , it has to satisfy Tcos (k(x, y), k(y, z)) ≤ k(x, z), which implies, k(x, y) = 1, k(y, z) = 1 =⇒ k(x, z) = 1. Obviously, we have k(x, x) = 1 and k(x, y) = k(y, x) due to the reflexivity and symmetry assumption of k, respectively. A.3 Constructing Kernels by TM For convenience let us recall the basic notion of an α-cut from fuzzy set theory: Definition 16 Let X be a non-empty set and µ : X → [0, 1]; then [µ]α = {x ∈ X | µ(x) ≥ α} is called the α-cut of the membership function µ. Lemma 17 k : X × X → [0, 1] is a TM -equivalence if and only if all α-cuts of k are Boolean equivalence relations. Proof. (i) Let us assume that k is a TM -equivalence. Let α ∈ [0, 1], then by definition, [k]α = {(x, y) ∈ X × X | k(x, y) ≥ α}. In order to show that [k]α is a Boolean equivalence, the axioms for reflexivity, symmetry and transitivity have to be shown. Reflexivity and symmetry are trivially satisfied as for all x, y ∈ X , there holds by assumption that k(x, x) = 1 and k(x, y) = k(y, x). In order to show transitivity, let us consider (x, y), (y, z) ∈ [k]α , that means k(x, y) ≥ α and k(y, z) ≥ α; then by the TM -transitivity assumption it follows that α ≤ min(k(x, y), k(y, z)) ≤ k(x, z), hence (x, z) ∈ [k]α . 2617 M OSER (ii) Suppose now that all α-cuts of k are Boolean equivalence relations. Then, in particular, [k] α with α = 1 is reflexive, hence k(x, x) = 1 for all x ∈ X . The symmetry of k follows from the fact that for all α ∈ [0, 1] and pairs (x, y) ∈ [k]α , by assumption, we have (y, x) ∈ [k]α . In order to show the TM -transitivity property, let us consider arbitrarily chosen elements x, y, z ∈ X . Let α = min(k(x, y), k(y, z)); then by the transitivity assumption of [k] α , it follows that (x, z) ∈ [k]α , consequently k(x, z) ≥ α = min(k(x, y), k(y, z)). Proposition 18 If k : X × X → [0, 1] is a TM -equivalence then it is positive-semidefinite. Proof. Choose arbitrary elements x1 , . . . , xn ∈ X and consider the set of values which are taken by all combinations k(xi , x j ), i, j ∈ {1, . . . , n} and order them increasingly, that is k(xi , x j )| i, j ∈ {1, . . . , n}} = {α1 , . . . , αm , where 0 ≤ α1 ≤ · · · αm ≤ 1. Observe that for all pairs (xi , x j ), i, j ∈ {1, . . . , n} there holds m k(xi , x j ) = ∑ (αv − αv−1 )1[k] αv v=2 (xi , x j ) + α1 1[k]α1 (xi , x j ) showing that on the set {x1 , . . . , xn } × {x1 , . . . , xn }, the function k is a linear combination of indicator functions of Boolean equivalences (which are positive-semidefinite by Proposition 15) with nonnegative coefficients and, consequently, it has to be positive semidefinite. A.4 Example of a Non-Positive-Semidefinite Tcos -Equivalence For dimensions n > 3, the Tcos -transitivity is no longer sufficient to guarantee positive semi(n) definiteness. Consider, for example An = (ai j )i j where λ (n) ai j = 1 0 if min(i, j) = 1, max(i, j) > 1 , if i = j, else . (22) √ (n) (n) (n) Choose λ = 1/ 2, then Tcos (λ, λ) = 0, hence we have Tcos (ai j , a jk ) ≤ aik for all indices i, j, k ∈ 1, . . . , n. As det(An ) < 0 for n > 3, the matrix An cannot be positive-semidefinite though the Tcos transitivity conditions are satisfied. References S. Bochner. Harmonic Analysis and the Theory of Probability. University of California Press, Los Angeles, California, 1955. U. Bodenhofer. A note on approximate equality versus the Poincar´ paradox. Fuzzy Sets and e Systems, 133(2):155–160, 2003. 2618 G ENERATING K ERNELS BY F UZZY R ELATIONS D. Boixader and J. Jacas. T -indistinguishability operators and approximate reasoning via CRI. In D. Dubois, E. P. Klement, and H. Prade, editors, Fuzzy Sets, Logics and Reasoning about Knowledge, volume 15 of Applied Logic Series, pages 255–268. Kluwer Academic Publishers, Dordrecht, 1999. A. Pinkus C. H. FitzGerald, C.A. Micchelli. Functions that preserve families of positive semidefinite matrices. Linear Alg. and Appl., 221:83–102, 1995. T. Calvo, G. Mayor, and R. Mesiar, editors. Aggregation Operators, volume 97 of Studies in Fuzziness and Soft Computing. Physica-Verlag, Heidelberg, 2002. ¨ O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. volume 15 of NIPS. 2003. B. De Baets and R. Mesiar. T -partitions. Fuzzy Sets and Systems, 97:211–223, 1998. M. Demirci. On many-valued partitions and many-valued equivalence relations. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 11(2):235–253, 2003. D. Dubois and H. Prade. A review of fuzzy set aggregation connectives. Inform. Sci., 36:85–121, 1985. M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001. S. Gottwald. Fuzzy set theory with t-norms and Φ-operators. In A. Di Nola and A. G. S. Ventre, editors, The Mathematics of Fuzzy Systems, volume 88 of Interdisciplinary Systems Research, ¨ pages 143–195. Verlag TUV Rheinland, K¨ ln, 1986. o S. Gottwald. Fuzzy Sets and Fuzzy Logic. Vieweg, Braunschweig, 1993. U. H¨ hle. Fuzzy equalities and indistinguishability. In Proc. 1st European Congress on Fuzzy and o Intelligent Technologies, volume 1, pages 358–363, Aachen, 1993. U. H¨ hle. The Poincar´ paradox and non-classical logics. In D. Dubois, E. P. Klement, and H. Prade, o e editors, Fuzzy Sets, Logics and Reasoning about Knowledge, volume 15 of Applied Logic Series, pages 7–16. Kluwer Academic Publishers, Dordrecht, 1999. F. H¨ ppner and F. Klawonn. Improved fuzzy partitions for fuzzy regression models. Internat. J. o Approx. Reason., 32:85–102, 2003. F. H¨ ppner, F. Klawonn, and P. Eklund. Learning indistinguishability from data. Soft Computing, 6 o (1):6–13, 2002. J. Jacas. On the generators of T -indistinguishability operators. Stochastica, 12:49–63, 1988. I. T. Jolliffe. Principal Component Analysis. Springer Verlag, New York, 1986. E. P. Klement, R. Mesiar, and E. Pap. Triangular Norms, volume 8 of Trends in Logic. Kluwer Academic Publishers, Dordrecht, 2000. 2619 M OSER R. Kruse, J. Gebhardt, and F. Klawonn. Fuzzy-Systeme. B. G. Teubner, Stuttgart, 1993. R. Kruse, J. Gebhardt, and F. Klawonn. Foundations of Fuzzy Systems. John Wiley & Sons, New York, 1994. C. H. Ling. Representation of associative functions. Publ. Math. Debrecen, 12:189–212, 1965. B. Moser. On the t-transitivity of kernels. Fuzzy Sets and Systems, 157:1787–1796, 2006. B. Moser. A New Approach for Representing Control Surfaces by Fuzzy Rule Bases. PhD thesis, Johannes Kepler Universit¨ t Linz, October 1995. a T. Muir. A Treatise on the Theory of Determinants. Dover, New York, 1960. H. Poincar´ . La Science et l’Hypoth´ se. Flammarion, Paris, 1902. e e H. Poincar´ . La Valeur de la Science. Flammarion, Paris, 1904. e S. Saminger, R. Mesiar, and U. Bodenhofer. Domination of aggregation operators and preservation of transitivity. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 10(Suppl.):11–35, 2002. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, 2002. o ¨ B. Sch¨ lkopf, A. J. Smola, and K. R. Muller. Nonlinear component analysis as a kernel eigenvalue o problem. Neural Computation, 10:1299–1319, 1998. B. Schweizer and A. Sklar. Associative functions and statistical triangle inequalities. Publ. Math. Debrecen, 8:169–186, 1961. B. Schweizer and A. Sklar. Probabilistic Metric Spaces. North-Holland, Amsterdam, 1983. M. Seeger. Covariance kernels from bayesian generative models. Neural Information Processing Systems, 14:905–912, 2002. I. Kopriva T. M. Huang, V. Kecman. Kernel Based Algorithms for Mining Huge Data Sets, Supervised, Semi-supervised, and Unsupervised Learning. Springer-Verlag, Berlin, 2006. E. Trillas and L. Valverde. An inquiry into indistinguishability operators. In H. J. Skala, S. Termini, and E. Trillas, editors, Aspects of Vagueness, pages 231–256. Reidel, Dordrecht, 1984. E. Trillas, S. Cubillo, and E. Casti˜ eira. Menger and Ovchinnikov on indistinguishabilities revisited. n Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 7(3):213–218, 1999. L. Valverde. On the structure of F-indistinguishability operators. Fuzzy Sets and Systems, 17(3): 313–328, 1985. A. M. Yaglom. Some classes of random fields in n-dimensional space, related to stationary random processes. Theory of Probability and its Applications, 2:273–320, 1957. L. A. Zadeh. Similarity relations and fuzzy orderings. Inform. Sci., 3:177–200, 1971. 2620
6 0.37572756 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
7 0.30809537 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
8 0.28832108 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
9 0.24748622 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
10 0.24131529 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
11 0.23471349 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
13 0.22909774 45 jmlr-2006-Learning Coordinate Covariances via Gradients
14 0.22294611 48 jmlr-2006-Learning Minimum Volume Sets
16 0.21358019 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
18 0.20189907 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
19 0.19514906 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
topicId topicWeight
[(8, 0.011), (36, 0.036), (46, 0.012), (50, 0.083), (63, 0.021), (67, 0.011), (78, 0.01), (81, 0.024), (84, 0.014), (90, 0.634), (91, 0.017), (96, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.95349544 93 jmlr-2006-Universal Kernels
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
2 0.86607099 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola
Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces
3 0.57222021 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
4 0.39016968 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
5 0.38938749 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
6 0.38379893 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
7 0.38324773 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
8 0.38004532 45 jmlr-2006-Learning Coordinate Covariances via Gradients
9 0.37804371 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
10 0.3722522 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
11 0.35889754 75 jmlr-2006-Policy Gradient in Continuous Time
12 0.33835343 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs
13 0.32139379 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
14 0.31463251 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
15 0.31096438 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
16 0.30576345 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems
17 0.3041625 16 jmlr-2006-Bounds for Linear Multi-Task Learning
18 0.30248839 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
19 0.30187336 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting