jmlr jmlr2006 jmlr2006-23 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
Reference: text
sentIndex sentText sentNum sentScore
1 These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. [sent-7, score-0.21]
2 Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation 1. [sent-8, score-0.324]
3 , (Xn ,Yn ) of a random variable (X,Y ) ∈ Rd × {−1, 1}, we study in this paper the limit and consistency of learning algorithms that solve the following problem: arg min f ∈H σ c 2006 R´ gis Vert and Jean-Philippe Vert. [sent-15, score-0.115]
4 Recent years have witnessed important theoretical advances aimed at understanding the behavior of such regularized algorithms when n tends to infinity and λ decreases to 0. [sent-19, score-0.124]
5 In particular the consistency and convergence rates of the two-class SVM (see, e. [sent-20, score-0.113]
6 The case of more general convex loss functions has also attracted a lot of attention recently (Zhang, 2004; Lugosi and Vayatis, 2004; Bartlett et al. [sent-23, score-0.108]
7 All results published so far, however, study the case where λ decreases as the number of points tends to infinity (or, equivalently, where λσ−d converges to 0 if one uses the classical nonnormalized version of the Gaussian kernel instead of (2)). [sent-25, score-0.099]
8 The main contribution of this paper is to show that Bayes consistency for the classification error can be obtained for algorithms that solve (1) without decreasing λ, if instead the bandwidth σ of the Gaussian kernel decreases at a suitable rate. [sent-33, score-0.223]
9 We prove upper bounds on the convergence rate of the classification error towards the Bayes risk for a variety of functions φ and of distributions P, in particular for SVM (Theorem 6). [sent-34, score-0.126]
10 818 C ONSISTENCY AND C ONVERGENCE R ATES OF O NE -C LASS SVM S AND R ELATED A LGORITHMS is devoted to the proof of the main theorem that describes the speed of convergence of the regularized φ-risk of its empirical minimizer towards its minimum. [sent-43, score-0.329]
11 Finally the proof of the consistency of the one-class SVM as a density level set estimator is postponed to Section 8. [sent-47, score-0.348]
12 We assume throughout this paper that the marginal distribution of X has a density ρ : Rd → R with respect to the Lebesgue measure, and that its support is included in a compact set X ⊂ Rd . [sent-50, score-0.11]
13 The normalized Gaussian radial basis function (RBF) kernel kσ with bandwidth parameter σ > 0 is defined for any (x, x′ ) ∈ Rd × Rd by:2 kσ (x, x′ ) := √ 1 2πσ d exp − x − x′ 2σ2 2 , the corresponding reproducing kernel Hilbert space (RKHS) is denoted by H σ , with associated norm . [sent-53, score-0.227]
14 Denoting by M the set of measurable real-valued functions on Rd , we define several risks for functions f ∈ M : • The classification error rate, usually ref R ( f ) := P (sign ( f (X)) = Y ) , and the minimum achievable classification error rate over M is called the Bayes risk: R∗ := inf R( f ). [sent-57, score-0.209]
15 f ∈M • For a scalar λ > 0 fixed throughout this paper and a convex function φ : R → R, the φ-risk regularized by the RKHS norm is defined, for any σ > 0 and f ∈ H σ , by Rφ,σ ( f ) := EP [φ (Y f (X))] + λ f 2 σ , H and the minimum achievable Rφ,σ -risk over H σ is denoted by R∗ := inf Rφ,σ ( f ) . [sent-58, score-0.311]
16 For example, for the hinge loss φ(u) = max(0, 1 − u) one can take L(r) = 1, and for the squared hinge loss φ(u) = max(0, 1 − u)2 one can take L(r) = 2(r + 1). [sent-63, score-0.124]
17 In particular, the following empirical version of Rφ,σ will be used ∀σ > 0, f ∈ H σ , Rφ,σ ( f ) := 1 n ∑ φ (Yi f (Xi )) + λ f 2 σ H n i=1 Furthermore, fˆφ,σ denotes the minimizer of Rφ,σ over H σ (see Steinwart, 2005a, for a proof of existence and uniqueness of fˆφ,σ ). [sent-72, score-0.11]
18 The main focus of this paper is the analysis of learning algorithms that minimize the empirical φ-risk regularized by the RKHS norm Rφ,σ , and their limit as the number of points tends to infinity and the kernel width σ decreases to 0 at a suitable rate when n tends to ∞, λ being kept fixed. [sent-73, score-0.237]
19 This stems from the fact that the empirical average term in the definition of Rφ,σ converges to its corresponding expectation, while the norm in H σ of a function f decreases to its L2 norm when σ decreases to zero. [sent-75, score-0.12]
20 First, we observe that the minimizer of Rφ,0 is indeed well-defined and can often be explicitly computed (the following lemma is part of Theorem 26 and is proved in Section 6. [sent-77, score-0.183]
21 α∈R Then fφ,0 is measurable and satisfies: Rφ,0 fφ,0 = inf Rφ,0 ( f ) f ∈M 820 C ONSISTENCY AND C ONVERGENCE R ATES OF O NE -C LASS SVM S AND R ELATED A LGORITHMS Second, let us recall the notion of modulus of continuity (DeVore and Lorentz, 1993, p. [sent-79, score-0.245]
22 Then its modulus of continuity in the L1 -norm is defined for any δ ≥ 0 as follows ω( f , δ) := sup 0≤ t ≤δ f (. [sent-81, score-0.096]
23 The main result of this paper, whose proof is postponed to Section 4, can now be stated as follows: Theorem 3 (Main Result) Let σ1 > σ > 0, 0 < p < 2, δ > 0, and let fˆφ,σ denote the minimizer of the Rφ,σ risk over H σ , where φ is assumed to be convex. [sent-84, score-0.192]
24 Assume that the marginal density ρ is bounded, and let M := supx∈Rd ρ(x). [sent-85, score-0.11]
25 From (5), we can deduce the Rφ,0 -consistency of fˆφ,σ for Lipschitz loss functions, as soon as fφ,0 is integrable, σ = o n−1/(d+ε) for some ε > 0, and σ1 → 0 with σ/σ1 → 0. [sent-105, score-0.121]
26 In order to state a refined version of it in the particular case of the support vector machine algorithm, we first need to introduce the notion of low density exponent: Definition 5 We say that a distribution P with marginal density ρ w. [sent-118, score-0.22]
27 Lebesgue measure has a low density exponent γ ≥ 0 if there exists (c2 , ε0 ) ∈ (0, ∞)2 such that ∀ε ∈ [0, ε0 ], P x ∈ Rd : ρ(x) ≤ ε ≤ c2 εγ . [sent-121, score-0.145]
28 In combination with Theorem 3, it states the consistency of SVM, and gives upper bounds on the convergence rates, for the first time in a situation where the effect of regularization does not vanish asymptotically. [sent-127, score-0.192]
29 In fact, Theorem 6 is a particular case of a more general result (Theorem 29) valid for a large class of convex loss functions. [sent-128, score-0.108]
30 Another consequence of the Rφ,0 -consistency of an algorithm is the L2 convergence of the function output by the algorithm to the minimizer of the Rφ,0 -risk : Lemma 7 (Relating Rφ,0 -Consistency with L2 -Consistency) For any f ∈ M , the following holds: f − fφ,0 2 L2 ≤ 1 Rφ,0 ( f ) − R∗ . [sent-131, score-0.094]
31 H f ∈H σ n i=1 Then, under the general conditions of Theorem 3, and assuming that limh→0 ω (ρλ , h) = 0, lim n→+∞ fˆσ − ρλ L2 = 0 , in probability, for a well-calibrated bandwidth σ. [sent-139, score-0.099]
32 A very interesting by-product of this theorem is the consistency of the one-class SVM algorithm for density level set estimation, which to the best of our knowledge has not been stated so far (the proof being postponed to Section 8. [sent-141, score-0.383]
33 Then, under the general assumptions of Theorem 3, and assuming that limh→0 ω (ρλ , h) = 0, we have lim HP (Cµ ) − HP Cµ = 0 , in probability, n→+∞ for a well-calibrated bandwidth σ. [sent-144, score-0.099]
34 The excess-mass functional was first introduced by Hartigan (1987) to assess the quality of density level set estimators. [sent-145, score-0.156]
35 It is maximized by the true density level set Cµ and acts as a risk functional in the one-class framework. [sent-146, score-0.212]
36 The proof of Theorem 9 is based on the following general ˆ result: if ρ is a density estimator converging to the true density ρ in the L2 sense, then for any ˆ fixed 0 < µ < supRd {ρ}, the excess mass of the level set of ρ at level µ converges to the excess mass of Cµ . [sent-147, score-0.748]
37 In other words, as is the case in the classification framework, plug-in rules built on L2 consistent density estimators are consistent with respect to the excess mass. [sent-148, score-0.252]
38 In particular, it shows that the family (H σ )σ>0 forms a nested collection of models, and that for any fixed function, the RKHS norm decreases to the L2 -norm as the kernel bandwidth decreases to 0. [sent-169, score-0.193]
39 3 Convolution with the Gaussian Kernel Besides its positive definiteness, the Gaussian kernel is commonly used as a kernel for function approximation through convolution. [sent-190, score-0.106]
40 (22) The convolution with a Gaussian RBF kernel is a technically convenient tool to map any square integrable function to a Gaussian RKHS. [sent-196, score-0.12]
41 Convolving a function with a Gaussian kernel with decreasing bandwidth is known to provide an approximation of the original function under general conditions. [sent-204, score-0.11]
42 We provide below a more quantitative estimate for the rate of this convergence under some assumption on the modulus of continuity of f (see Definition 2), using methods from DeVore and Lorentz (1993, Chap. [sent-206, score-0.119]
43 ) denotes the modulus of continuity of f in the L1 norm. [sent-212, score-0.096]
44 Proof of Theorem 3 The proof of Theorem 3 is based on the following decomposition of the excess Rφ,0 -risk for the minimizer of the Rφ,σ -risk: √ Lemma 14 For any 0 < σ < 2σ1 and any sample (Xi ,Yi )i=1,. [sent-224, score-0.252]
45 ) For σ > 0, let F σ be a convex subX set of H σ and let φ be a convex loss function. [sent-267, score-0.177]
46 In order to apply this theorem we now provide uniform upper bounds over G σ for the variance of g and its uniform norm, as well as an upper bound on the covering number of G σ . [sent-275, score-0.165]
47 (37) Using first the definition of the restricted RKHS (26), then (37) and (27), we obtain X RX fφ,σ = inf EP|X φ Y f X (X) φ,σ X f X ∈H σ = inf EP|X φ Y f|X (X) f ∈H σ + λ f X H σX + λ f|X H σX = inf EP [φ (Y f (X))] + λ f H σ (38) f ∈H σ = Rφ,σ fφ,σ , which proves the first statement. [sent-283, score-0.33]
48 n i=1 i=1 ˆX ˆ As a result, we get RX fˆφ,σ|X ≤ Rφ,σ fˆφ,σ = RX φ,σ f φ,σ , from which we deduce that f φ,σ|X φ,σ X X and fˆφ,σ both minimize the strictly convex functional RX on H σ , proving (40). [sent-293, score-0.151]
49 H (43) On the other hand, we deduce from the convexity of φ that for any (x, y) ∈ X × {−1, +1} and any f ∈ F σ : φ(y f (x)) + λ f 2 σ + φ(y fφ,σ (x)) + λ fφ,σ 2 σ H H 2 f 2 σ + fφ,σ 2 σ y f (x) + y fφ,σ (x) H H +λ ≥φ 2 2 f − fφ,σ 2 f + fφ,σ 2 f + fφ,σ (x) + λ =φ y Hσ + λ Hσ . [sent-302, score-0.111]
50 of (45) is easily upper bounded by: log N 0, B−1 φ(0) , ε, L2 (T ) ≤ log φ(0) Bε , and we finally get: √ φ(0) X log N B−1 G σ , 2ε, L2 (T ) ≤ log N B σ , 2ε κσ , L2 (T ) + log Bε . [sent-320, score-0.17]
51 Some Properties of the L2 -Norm-Regularized φ-Risk In this section we investigate the conditions on the loss function φ under which the Bayes consistency of the minimization of the regularized φ-risk holds. [sent-330, score-0.207]
52 (2006), we introduce a notion of classification-calibration for regularized loss functions φ, and upper bound the excess risk of any classifier f in terms of its excess of regularized φ-risk. [sent-332, score-0.56]
53 We also upper-bound the L2 -distance between f and fφ,0 in terms of the excess of regularized φ-risk of f , which is useful to proove Bayes consistency in the one-class setting. [sent-333, score-0.31]
54 The loss function φ is said to be classification-calibrated if, for any η ∈ [0, 1]\{1/2}: inf α∈R:α(2η−1)≤0 Cη (α) > inf Cη (α) α∈R The importance here is in the strict inequality, which implies in particular that if the global infimum of Cη is reached at some point α, then α > 0 (resp. [sent-337, score-0.269]
55 This condition, that generalizes the requirement that the minimizer of Cη (α) has the correct sign, is a minimal condition that can be viewed as a pointwise form of Fisher consistency for classification. [sent-340, score-0.161]
56 However, as soon as ρ < 2λ, the global minimum of Cη,ρ (α) is reached for α = 0 and therefore: inf α∈R:α(2η−1)≤0 Cη,ρ (α) = inf Cη,ρ (α) = 2, α∈R which shows that φ is not R-classification-calibrated in this case. [sent-353, score-0.255]
57 Because of (50), it is clear that such functions satisfy inf Cη (α) = inf Cη (α) = inf Cη (α) = 0 , α≤0 α≥0 α∈R which shows that they are not classification-calibrated. [sent-357, score-0.306]
58 φ being nonnegative, the corresponding generic conditional risk Cη (α) = ηφ(α) + (1 − η) φ(−α) + tα2 satisfies: lim Cη (α) = lim Cη (α) = +∞ . [sent-359, score-0.162]
59 As a result, inf α∈R:α(2η−1)≤0 ˜ ˜ Cη,ρ (α) = Cη (αη ) > Cη (−αη ) ≥ inf Cη,ρ (α) , α∈R showing that φ(x)+tx2 is classification-calibrated, and therefore that φ is R-classification-calibrated. [sent-364, score-0.204]
60 2 Classification-Calibration of Convex Loss Functions The following lemma states the equivalence between classification calibration and R-classification calibration for convex loss functions, and it gives a simple characterization of this property. [sent-366, score-0.25]
61 From this and lemma 23, we deduce that φ is R-classification-calibrated iff φ(x) + tx2 is classification-calibrated for any t > 0, iff φ(x) + tx2 is differentiable at 0 with negative derivative (for any t > 0), iff φ(x) is differentiable at 0 with negative derivative. [sent-373, score-0.327]
62 (52) Proof For any (η, ρ) ∈ [0, 1] × [0, +∞), the function Gη,ρ (α) is the sum of the convex func+ tion ρCη (α) and of the strictly convex function λα2 . [sent-378, score-0.138]
63 From this result we obtain the following characterization and properties of the minimizer of the Rφ,0 -risk: Theorem 26 If φ : R → [0, +∞) is a convex function, then the function fφ,0 : Rd → R defined for any x ∈ Rd by fφ,0 (x) := α(η(x), ρ(x)) satisfies: 1. [sent-385, score-0.14]
64 fφ,0 minimizes the Rφ,0 -risk: Rφ,0 fφ,0 = inf Rφ,0 ( f ) . [sent-388, score-0.102]
65 By convexity of Gη,ρ , this implies that its minimizer, namely α (η, ρ), is in the interval [α0 − ε, α0 + ε], as soon as (η, ρ) ∈ Bε , which concludes the proof of the continuity of (η, ρ) → α (η, ρ), and therefore of the measurability of fφ,0 . [sent-400, score-0.111]
66 4 Relating the Rφ,0 -Risk with the Classification Error Rate In the “classical” setting (with a regularization parameter converging to 0), the idea of relating the convexified risk to the true risk (more simply called risk) has recently gained a lot of interest. [sent-404, score-0.162]
67 Zhang (2004) and Lugosi and Vayatis (2004) upper bound the excess-risk by some function of the excess φrisk to prove consistency of various algorithms (and obtain upper bounds for the rates of convergence of the risk to the Bayes risk). [sent-405, score-0.361]
68 Finally, by lemma 24, φ is R-classification-calibrated iff φρ is classification-calibrated (because both properties are equivalent to saying that φ is differentiable at 0 and φ′ (0) < 0), iff ψρ (θ) > 0 for θ ∈ (0, 1] by Bartlett et al. [sent-422, score-0.17]
69 We are now in position to state a first result to relate the excess Rφ,0 -risk to the excess-risk. [sent-424, score-0.142]
70 The dependence on ρ(x) generates difficulties compared with the “classical” setting, which forces us to separate the low density regions from the rest in the analysis. [sent-425, score-0.11]
71 For any f ∈ M the following holds: R( f ) − R∗ ≤ inf P (Aε ) + ψ−1 Rφ,0 ( f ) − R∗ φ,0 ε ε>0 843 (57) V ERT AND V ERT Proof First note that for convex classification-calibrated functions, ψε is strictly increasing on [0, 1]. [sent-427, score-0.171]
72 This important result shows that any consistency result for the regularized φ-risk implies consistency for the true risk, that is, convergence to the Bayes risk. [sent-441, score-0.281]
73 Besides, convergence rates for the regularized φ-risk towards its minimum translate into convergence rates for the risk towards the Bayes risk thanks to (57), as we will show in the next subsection. [sent-442, score-0.305]
74 6 Refinements under a Low Noise Assumption When the distribution P satisfies a low noise assumption as defined in section 2, we have the following result: Theorem 29 Let φ be a convex loss function such that there exist (κ, β, ν) ∈ (0, +∞)3 satisfying: ψ−1 (u) ≤ κuβ ε−ν . [sent-444, score-0.108]
75 ε ∀ (ε, u) ∈ (0, +∞) × R, Then for any distribution P with low density exponent γ, there exist constant (K, r) ∈ (0, +∞) such that for any f ∈ M with an excess regularized φ-risk upper bounded by r the following holds: R( f ) − R∗ ≤ K Rφ,0 ( f ) − R∗ φ,0 βγ γ+ν . [sent-445, score-0.39]
76 Consistency of SVMs In this section we illustrate the results obtained throughout Section 6 for a general loss function φ, in particular the control of the excess R-risk by the excess Rφ,0 -risk of Theorem 29, to the specific cases of the loss functions used in 1- and 2-SVM. [sent-456, score-0.362]
77 ρ (67) Remark 30 The minimum of Cη,ρ being reached on (−1, 1) for any (η, ρ) ∈ [0, 1] × (0, +∞), the result would be identical for any convex loss function φ0 that is equal to (1 − α)2 on (−∞, 1). [sent-473, score-0.159]
78 Indeed, the corresponding regularized generic conditional φ0 -risk would coincide with Cη,ρ on (−1, 1) and would be no smaller than Cη,ρ outside of this interval; it would therefore have the same minimal value reached at the same point, and consequently the same function M and ψ. [sent-474, score-0.126]
79 Consistency of One-class SVMs for Density Level Set Estimation In this section we focus on the one-class case: η is identically equal to 1, and P is just considered as a distribution on Rd , with density ρ with respect to the Lebesgue measure. [sent-487, score-0.11]
80 The first subsection is devoted to the proof of Theorem 8, and the second subsection to the proof of Theorem 9. [sent-488, score-0.102]
81 First, it follows from (64) that, in the one-class case where η = 1 on its domain, the asymptotic function fφ,0 equals the truncated density ρλ . [sent-491, score-0.138]
82 φ,0 λ Finally, under the assumption limδ→0 ω (ρλ , δ) = 0, using Theorem 3 with, for instance, p = 1, δ = 1, σ = (1/n)1/4d , σ1 = (1/n)1/2d , and x = log (n), we deduce that for any ε > 0, fˆσ − ρλ P as n → ∞. [sent-493, score-0.111]
83 Before going to this point, let us recall some specific notation in the context of density level set estimation. [sent-497, score-0.156]
84 The aim is to estimate a density level set of level µ, for some µ > 0: Cµ := x ∈ Rd : ρ (x) ≥ µ . [sent-498, score-0.202]
85 (68) The estimator that is considered here is the plug-in density level set estimator associated with fˆσ , denoted by Cµ : Cµ := x ∈ Rd : 2λ fˆσ (x) ≥ µ . [sent-499, score-0.23]
86 (69) Recall that the asymptotic behaviour of fˆσ in the one-class case is given in Theorem 8: fˆσ converge to ρλ , which is proportional to the density ρ truncated at level 2λ. [sent-500, score-0.184]
87 The density ρ is still assumed to have a compact support S ⊂ X . [sent-502, score-0.11]
88 To assess the quality of Cµ , we use the so-called excess mass functional, first introduced by Hartigan (1987), which is defined for any measurable subset C of Rd as follows: HP (C) := P (C) − µLeb (C) , (70) where Leb is the Lebesgue measure. [sent-503, score-0.227]
89 Hence, the quality of an estimate C depends here on how its excess mass is close to this of Cµ . [sent-505, score-0.18]
90 The following lemma relates the L2 convergence of a density estimator to the consistency of the associated plug-in density level set estimator, with respect to the excess mass criterion: 849 V ERT AND V ERT Lemma 32 Let P be a probability distribution on Rd with compact support S ⊂ X . [sent-506, score-0.682]
91 Assume that P is absolutely continuous with respect to the Lebesgue’s measure, and let ρ denote its associated density function. [sent-507, score-0.11]
92 Consider a non-negative density ˆ estimate ρ defined on Rd . [sent-509, score-0.11]
93 Proof To prove the lemma, it is convenient to first build an artificial classification problem using the density function ρ and the desired level µ, then to relate the excess-risk involved in this classification problem to the excess-mass involved in the original one-class problem. [sent-511, score-0.156]
94 The mixture coefficient m is determined by the initially desired density level µ. [sent-516, score-0.156]
95 We could just directly apply this lemma to fˆσ , ρλ and the distribution Pλ defined5 through ρλ , but this would prove the consistency of fˆσ with respect to the excess mass HPλ , which is different from the criterion HP of interest. [sent-525, score-0.356]
96 The following lemma implies that the plug-in density level set estimator at level 0 < µ < 2λ based on the one-class SVM estimator is indeed consistent with respect to the excess mass HP . [sent-526, score-0.542]
97 Estimation of a convex density contour in two dimensions. [sent-625, score-0.179]
98 On the estimation of a probability density function by the maximum penalized likelihood method. [sent-678, score-0.11]
99 Consistency of support vector machines and other regularized kernel classifiers. [sent-697, score-0.131]
100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-743, score-0.215]
wordName wordTfidf (topN-words)
[('ert', 0.499), ('rd', 0.298), ('hp', 0.244), ('ates', 0.225), ('elated', 0.225), ('lass', 0.225), ('onsistency', 0.191), ('onvergence', 0.171), ('rx', 0.165), ('leb', 0.162), ('rkhs', 0.161), ('ep', 0.154), ('excess', 0.142), ('steinwart', 0.129), ('lgorithms', 0.123), ('dx', 0.114), ('density', 0.11), ('svm', 0.108), ('inf', 0.102), ('consistency', 0.09), ('lemma', 0.086), ('deduce', 0.082), ('regularized', 0.078), ('theorem', 0.072), ('minimizer', 0.071), ('scovel', 0.07), ('convex', 0.069), ('fourier', 0.068), ('bandwidth', 0.057), ('risk', 0.056), ('bartlett', 0.055), ('modulus', 0.053), ('kernel', 0.053), ('measurable', 0.047), ('level', 0.046), ('gaussian', 0.045), ('dt', 0.044), ('convolution', 0.043), ('covering', 0.043), ('continuity', 0.043), ('lebesgue', 0.043), ('devore', 0.042), ('lim', 0.042), ('classi', 0.041), ('loss', 0.039), ('proof', 0.039), ('mass', 0.038), ('zr', 0.038), ('folland', 0.037), ('lorentz', 0.037), ('matache', 0.037), ('parseval', 0.037), ('bayes', 0.037), ('norm', 0.037), ('estimator', 0.037), ('exponent', 0.035), ('risks', 0.035), ('iff', 0.031), ('devroye', 0.031), ('lipschitz', 0.03), ('convexity', 0.029), ('log', 0.029), ('rademacher', 0.029), ('truncated', 0.028), ('alamos', 0.028), ('calibration', 0.028), ('vert', 0.028), ('reproducing', 0.027), ('vanish', 0.027), ('regularization', 0.027), ('proved', 0.026), ('rbf', 0.026), ('postponed', 0.026), ('hartigan', 0.026), ('statement', 0.026), ('reached', 0.026), ('minimum', 0.025), ('cationcalibrated', 0.025), ('gis', 0.025), ('limh', 0.025), ('suprd', 0.025), ('upper', 0.025), ('constants', 0.024), ('integrable', 0.024), ('devoted', 0.024), ('statements', 0.024), ('lugosi', 0.024), ('proves', 0.024), ('hinge', 0.023), ('decreases', 0.023), ('tends', 0.023), ('relating', 0.023), ('convergence', 0.023), ('quantile', 0.023), ('generic', 0.022), ('towards', 0.022), ('fi', 0.022), ('differentiable', 0.022), ('derivative', 0.022), ('ball', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
2 0.12616417 93 jmlr-2006-Universal Kernels
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
3 0.10766488 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
Author: Ross A. Lippert, Ryan M. Rifkin
Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines
4 0.096086137 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
Author: Guillaume Lecué
Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence
5 0.075329408 48 jmlr-2006-Learning Minimum Volume Sets
Author: Clayton D. Scott, Robert D. Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency
6 0.072596923 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
7 0.066747993 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
8 0.061479144 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
9 0.059162721 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
10 0.058732349 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
11 0.057191618 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.050806887 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
13 0.049523585 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
15 0.048032533 65 jmlr-2006-Nonparametric Quantile Estimation
16 0.04799078 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
17 0.047300499 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
18 0.04725901 45 jmlr-2006-Learning Coordinate Covariances via Gradients
19 0.04682596 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
topicId topicWeight
[(0, 0.221), (1, -0.07), (2, 0.073), (3, -0.246), (4, -0.013), (5, 0.175), (6, -0.097), (7, 0.057), (8, -0.034), (9, -0.111), (10, 0.054), (11, -0.043), (12, -0.032), (13, 0.0), (14, 0.071), (15, -0.105), (16, 0.108), (17, -0.059), (18, -0.214), (19, -0.028), (20, -0.077), (21, -0.162), (22, 0.103), (23, 0.05), (24, 0.066), (25, 0.056), (26, 0.149), (27, -0.051), (28, -0.188), (29, -0.07), (30, 0.047), (31, -0.039), (32, 0.016), (33, 0.027), (34, -0.007), (35, 0.036), (36, -0.053), (37, -0.162), (38, -0.076), (39, -0.094), (40, -0.136), (41, -0.154), (42, -0.164), (43, 0.002), (44, -0.066), (45, 0.027), (46, -0.079), (47, 0.072), (48, 0.129), (49, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.94289833 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
2 0.70912081 93 jmlr-2006-Universal Kernels
Author: Charles A. Micchelli, Yuesheng Xu, Haizhang Zhang
Abstract: In this paper we investigate conditions on the features of a continuous kernel so that it may approximate an arbitrary continuous target function uniformly on any compact subset of the input space. A number of concrete examples are given of kernels with this universal approximating property. Keywords: density, translation invariant kernels, radial kernels
3 0.5172196 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
Author: Ross A. Lippert, Ryan M. Rifkin
Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines
4 0.47634333 48 jmlr-2006-Learning Minimum Volume Sets
Author: Clayton D. Scott, Robert D. Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency
5 0.39357641 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
Author: Guillaume Lecué
Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence
6 0.32368195 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
7 0.32231486 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
8 0.31138495 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
9 0.29664028 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
10 0.25714421 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
11 0.24661726 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
13 0.23403314 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
14 0.23318803 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
15 0.23241441 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
16 0.22784199 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
17 0.2253398 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
18 0.22366475 65 jmlr-2006-Nonparametric Quantile Estimation
19 0.22010832 66 jmlr-2006-On Model Selection Consistency of Lasso
20 0.21962723 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
topicId topicWeight
[(8, 0.012), (36, 0.04), (45, 0.02), (50, 0.608), (63, 0.025), (67, 0.015), (76, 0.022), (78, 0.01), (81, 0.042), (84, 0.02), (90, 0.045), (91, 0.017), (96, 0.036)]
simIndex simValue paperId paperTitle
1 0.98864698 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations
Author: Bernhard Moser
Abstract: Kernels are two-placed functions that can be interpreted as inner products in some Hilbert space. It is this property which makes kernels predestinated to carry linear models of learning, optimization or classification strategies over to non-linear variants. Following this idea, various kernel-based methods like support vector machines or kernel principal component analysis have been conceived which prove to be successful for machine learning, data mining and computer vision applications. When applying a kernel-based method a central question is the choice and the design of the kernel function. This paper provides a novel view on kernels based on fuzzy-logical concepts which allows to incorporate prior knowledge in the design process. It is demonstrated that kernels mapping to the unit interval with constant one in its diagonal can be represented by a commonly used fuzzylogical formula for representing fuzzy rule bases. This means that a great class of kernels can be represented by fuzzy-logical concepts. Apart from this result, which only guarantees the existence of such a representation, constructive examples are presented and the relation to unlabeled learning is pointed out. Keywords: kernel, triangular norm, T -transitivity, fuzzy relation, residuum 1. Motivation Positive-definiteness plays a prominent role especially in optimization and machine learning due to the fact that two-place functions with this property, so-called kernels, can be represented as inner products in some Hilbert space. Thereby, optimization techniques conceived on the basis of linear models can be extended to non-linear algorithms. For a survey of applications see, for example, ¨ Jolliffe (1986), Sch¨ lkopf and Smola (2002) and Scholkopf et al. (1998). o Recently in Moser (2006) it was shown that kernels with values from the unit interval can be interpreted as fuzzy equivalence relations motivated by the idea that kernels express a kind of similarity. This means that the concept of fuzzy equivalence relations, or synonymously fuzzy similarity relations, is more general than that of kernels, provided only values in the unit interval are considered. Fuzzy equivalence relations distinguish from Boolean equivalence relations by a many-valued extension of transitivity which can be interpreted as many-valued logical model of the statement “IF x is similar to y AND y is similar to z THEN x is similar to z”. In contrast to the Boolean case, in many-valued logics the set of truth values is extended such that also assertions, for example, whether two elements x and y are similar, can be treated as a matter of degree. The standard model for the set of (quasi) truth values of fuzzy logic and other many-valued logical systems is the unit interval. If E(x, y) represents the (quasi) truth value of the statement that x is c 2006 Bernhard Moser. M OSER similar to y, then the many-valued version of transitivity is modeled by T (E(x, y), E(y, z)) ≤ E(x, z) where T is a so-called triangular norm which is an extension of the Boolean conjunction. This many-valued concept for transitivity is called T -transitivity. For a survey on triangular norms see, for example, Dubois and Prade (1985), Gottwald (1986), Gottwald (1993) and Klement et al. (2000), ¨ and for fuzzy equivalence relations and T -transitivity see, for example, Bodenhofer (2003), H ohle (1993), H¨ hle (1999), Klement et al. (2000), and Zadeh (1971). o Based on the semantics of fuzzy logic, this approach allows to incorporate knowledge-based models for the design of kernels. From this perspective, the most interesting mathematical question is how positive-semidefinite fuzzy equivalence relations can be characterized or at least constructed under some circumstances. At least for some special cases, proofs are provided in Section 4, which motivate further research aiming at establishing a more general theory on the positive-definiteness of fuzzy equivalence relations. These cases are based on the most prominent representatives of triangular norms, that is the Minimum, the Product and the Łukasiewicz t-norm. The paper is structured as follows. First of all, in Section 2, some basic prerequisites concerning kernels and fuzzy relations are outlined. In Section 3, a result about the T -transitivity of kernels from Moser (2006) is cited and interpreted as existence statement that guarantees a representation of kernels mapping to the unit interval with constant 1 in its diagonal by a certain, commonly used, fuzzy-logical construction of a fuzzy equivalence relation. Finally, in contrast to the pure existence theorem of Section 3, in Section 4 constructive examples of fuzzy equivalence relations are provided which are proven to be kernels. In a concluding remark, the relationship to the problem of labeled and unlabeled learning is pointed out. 2. Prerequisites This section summarizes definitions and facts from the theory of kernels as well as from fuzzy set theory which are needed later on. 2.1 Kernels and Positive-Semidefiniteness Preserving Functions There is an extensive literature concerning kernels and kernel-based methods like support vector machines or kernel principal component analysis especially in the machine learning, data mining ¨ and computer vision communities. For an overview and introduction, see, for example, Sch olkopf and Smola (2002). Here we present only what is needed later on. For completeness let us recall the basic definition for kernels and positive-semidefiniteness. Definition 1 Let X be a non-empty set. A real-valued function k : X × X → R is said to be a kernel iff it is symmetric, that is, k(x, y) = k(y, x) for all x, y ∈ X , and positive-semidefinite, that is, ∑n j=1 ci c j k(xi , x j ) ≥ 0 for any n ∈ N, any choice of x1 , . . . , xn ∈ X and any choice of c1 , . . . , cn ∈ R. i, One way to generate new kernels from known kernels is to apply operations which preserve the positive-semidefiniteness property. A characterization of such operations is provided by C. H. FitzGerald (1995). Theorem 2 (Closeness Properties of Kernels) Let f : Rn → R, n ∈ N, then k : X × X → R given by k(x, y) := f (k1 (x, y), . . . , kn (x, y)) 2604 G ENERATING K ERNELS BY F UZZY R ELATIONS is a kernel for any choice of kernels k1 , . . . , kn on X × X iff f is the real restriction of an entire function on Cn of the form f (x1 , . . . , xn ) = ∑ r1 ≥0,...,rn ≥0 r r cr1 ,...,rn x11 · · · xnn (1) where cr1 ,...,rn ≥ 0 for all nonnegative indices r1 , . . . , rn . 2.2 Triangular Norms Triangular norms have been originally studied within the framework of probabilistic metric spaces, see Schweizer and Sklar (1961) and Schweizer and Sklar (1983). In this context, t-norms proved to be an appropriate concept when dealing with triangle inequalities. Later on, t-norms and their dual version, t-conorms, have been used to model conjunction and disjunction for many-valued logic, see Dubois and Prade (1985), Gottwald (1986), Gottwald (1993) and Klement et al. (2000). Definition 3 A function T : [0, 1]2 → [0, 1] is called t-norm (triangular norm), if it satisfies the following conditions: (i) (ii) (iii) (iv) ∀x, y ∈ [0, 1] : ∀x, y, z ∈ [0, 1] : ∀x, y, z ∈ [0, 1] : ∀x, y ∈ [0, 1] : T (x, y) = T (y, x) T (x, T (y, z)) = T (T (x, y), z) y ≤ z =⇒ T (x, y) ≤ T (x, z) T (x, 1) = x ∧ T (1, y) = y (commutativity) (associativity) (monotonicity) (boundary condition) Further, a t-norm is called Archimedean if it is continuous and satisfies x ∈ (0, 1) ⇒ T (x, x) < x. Due to its associativity, many-placed extensions Tn : [0, 1]n → [0, 1], n ∈ N, of a t-norm T are uniquely determined by Tn (x1 , . . . , xn ) = T (x1 , Tn−1 (x2 , . . . , xn )). Archimedean t-norms are characterized by the following representation theorem due to Ling (1965): Theorem 4 Let T : [0, 1]2 → [0, 1] be a t-norm. Then T is Archimedean if, and only if, there is a continuous, strictly decreasing function f : [0, 1] → [0, ∞] with f (1) = 0 such that for x, y ∈ [0, 1], T (x, y) = f −1 (min( f (x) + f (y), f (0))). By setting g(x) = exp (− f (x)), Ling’s characterization yields an alternative representation with a multiplicative generator function T (x, y) = g−1 (max(g(x) g(y), g(0))). For g(x) = x we get the product TP (x, y) = x y. The setting f (x) = 1 − x yields the so-called Łukasiewcz t-norm TL (x, y) = max(x + y − 1, 0). Due to Ling’s theorem 4 an Archimedean t-norm T is isomorphic either to TL or TP , depending on whether the additive generator takes a finite value at 0 or not. In the former case, the Archimedean t-norm is called non-strict, in the latter it is called strict. 2605 M OSER A many-valued model of an implication is provided by the so-called residuum given by → T (a, b) = sup{c ∈ [0, 1]|T (a, c) ≤ b} (2) where T is a left-continuous t-norm. Equation (2) is uniquely determined by the so-called adjunction property → ∀a, b, c ∈ [0, 1] : T (a, b) ≤ c ⇔ a ≤ T (b, c). Consequently, the operator ↔ → → T (a, b) = min T (a, b), T (b, a) (3) (4) models a biimplication. For details, for example, see Gottwald (1986) and Klement et al. (2000). → Tables 1 and 2 list examples of t-norms with their induced residuum T . For further examples see, for example, Klement et al. (2000). √ √ Tcos (a, b) = max(ab − 1 − a2 1 − b2 , 0) TL (a, b) = max(a + b − 1, 0) TP (a, b) = ab TM (a, b) = min(a, b) Table 1: Examples of t-norms → T cos (a, b) = → T L (a, b) = → = T P (a, b) → T M (a, b) = cos(arccos(b) − arccos(a)) if a > b, 1 else min(b − a + 1, 1) b if a > b, a 1 else b if a > b, 1 else Table 2: Examples of residuums 2.3 T -Equivalences If we want to classify based on a notion of similarity or indistinguishability, we face the problem of transitivity. For instance, let us consider two real numbers to be indistinguishable if and only if they differ by at most a certain bound ε > 0, this is modeled by the relation ∼ ε given by x ∼ε y :⇔ |x−y| < ε, ε > 0, x, y ∈ R. Note that the relation ∼ε is not transitive and, therefore, not an equivalence relation. The transitivity requirement turns out to be too strong for this example. The problem of identification and transitivity in the context of similarity of physical objects was early pointed out and discussed philosophically by Poincar´ (1902) and Poincar´ (1904). In the framework of fuzzy e e logic, the way to overcome this problem is to model similarity by fuzzy relations based on a many¨ valued concept of transitivity, see Bodenhofer (2003), H ohle (1993), H¨ hle (1999), Klement et al. o (2000) and Zadeh (1971). 2606 G ENERATING K ERNELS BY F UZZY R ELATIONS Definition 5 A function E : X 2 −→ [0, 1] is called a fuzzy equivalence relation, or synonymously, T -equivalence with respect to the t-norm T if it satisfies the following conditions: (i) ∀x ∈ X : E(x, x) = 1 (reflexivity) (ii) ∀x, y ∈ X : E(x, y) = E(y, x) (symmetry) (iii) ∀x, y, z ∈ X : T (E(x, y), E(y, z)) ≤ E(x, z) (T-transitivity). The value E(x, y) can be also looked at as the (quasi) truth value of the statement “x is equal to y”. Following this semantics, T-transitivity can be seen as a many-valued model of the proposition, “If x is equal to y and y is equal to z, then x is equal to z”. T -equivalences for Archimedean t-norms are closely related to metrics and pseudo-metrics as shown by Klement et al. (2000) and Moser (1995). Theorem 6 Let T be an Archimedean t-norm given by ∀a, b ∈ [0, 1] : T (a, b) = f −1 (min( f (a) + f (b), f (0))), where f : [0, 1] → [0, ∞] is a strictly decreasing, continuous function with f (1) = 0. (i) If d : X 2 → [0, ∞[ is a pseudo-metric, then the function Ed : X 2 → [0, 1] defined by Ed (x, y) = f −1 (min(d(x, y), f (0))) is a T -equivalence with respect to the t-norm T . (ii) If E : X 2 → [0, 1] is a T -equivalence relation, then the function dE : X 2 → [0, ∞] defined by dE (x, y) = f (E(x, y)) is a pseudo-metric. → Another way to construct T -equivalences is to employ T -operators. The proof of the following assertion can be found in Trillas and Valverde (1984), Kruse et al. (1993) and Kruse et al. (1994). ↔ Theorem 7 Let T be a left-continuous t-norm, T its induced biimplication, µi : X → [0, 1], i ∈ I, I non-empty; then E : X × X → [0, 1] given by ↔ E(x, y) = inf T (µi (x), µi (y)) i∈I (5) is a T -equivalence relation. ¨ For further details on T -equivalences see also Boixader and Jacas (1999), H oppner et al. (2002), Jacas (1988), Trillas et al. (1999) and Valverde (1985). 3. Representing Kernels by T -Equivalences It is interesting that the concept of kernels, which is motivated by geometric reasoning in terms of inner products and mappings to Hilbert spaces and which is inherently formulated by algebraic terms, is closely related to the concept of fuzzy equivalence relations as demonstrated and discussed in more detail in Moser (2006). In this section, we start with the result that any kernel k : X × X → [0, 1] with k(x, x) = 1 for all x ∈ X is T -transitive and, therefore, a fuzzy equivalence relation. The proof can be found in Moser (2006), see also Appendix A.1. 2607 M OSER Theorem 8 Any kernel k : X × X → [0, 1] with k(x, x) = 1 is (at least) Tcos -transitive, where 1 − a2 Tcos (a, b) = max{a b − 1 − b2 , 0}. (6) The nomenclature is motivated by the fact that the triangular norm defined by Equation (6) is an Archimedean t-norm which is generated by the arcosine function as its additive generator. From this result, the following existence theorem can be derived, which guarantees that any kernel under consideration can be represented by the fuzzy-logical formula given by (5). In fuzzy systems, this formula is commonly used for modeling rule bases (see, for example, Kruse et al., 1993, 1994). Theorem 9 Let X be a non-empty universe of discourse, k : X × X → [0, 1] a kernel in the sense of Definition 1 and k(x, x) = 1 for all x ∈ X ; then there is a family of membership functions µ i : X → [0, 1], i ∈ I, I non-empty and a t-norm T , such that ↔ ∀x, y ∈ X : k(x, y) = inf T (µi (x), µi (y)). i∈I (7) Proof. Let us set I := X , µx0 (x) = k(x, x0 ) and let us choose Tcos as t-norm. For convenience let us denote ↔ h(x, y) = inf T cos (µx0 (x), µx0 (y)), x0 ∈X which is equivalent to ↔ h(x, y) = inf T cos (k(x0 , x), k(x0 , y)). x0 ∈X According to Theorem 8, k is Tcos -transitive, that is, ↔ ∀x0 , x, y ∈ X : T cos (k(x0 , x), k(x0 , y)) ≤ k(x, y). This implies that h(x, y) ≤ k(x, y) for all x, y ∈ X . Now let us consider the other inequality. Due to the adjunction property (3), we obtain → Tcos (k(x, y), k(x0 , y)) ≤ k(x, x0 ) ⇔ k(x, y) ≤ T cos (k(x0 , y), k(x, x0 )) and → Tcos (k(x, y), k(x0 , x)) ≤ k(y, x0 ) ⇔ k(x, y) ≤ T cos (k(x0 , x), k(y, x0 )), from which it follows that → → ∀x, y, x0 ∈ X : k(x, y) ≤ min{ T cos (k(x0 , y), k(x, x0 )), T cos (k(x0 , x), k(y, x0 ))}. Hence by Definition 4, ∀x, y ∈ X : k(x, y) ≤ h(x, y) which ends the proof. For an arbitrary choice of fuzzy membership functions, there is no necessity that the resulting relation (7) implies positive-semidefiniteness and, therefore, a kernel. For an example of a Tcos equivalence which is not a kernel see Appendix A.4. Theorem 9 guarantees only the existence of a representation of the form (5) but it does not tell us how to construct the membership functions µ i . In the following section, we provide examples of fuzzy equivalence relations which yield kernels for any choice of membership functions. 2608 G ENERATING K ERNELS BY F UZZY R ELATIONS 4. Constructing Kernels by Fuzzy Equivalence Relations In the Boolean case, positive-definiteness and equivalence are synonymous, that is, a Boolean relation R : X × X → {0, 1} is positive-definite if and only if R is the indicator function of an equivalence relation ∼ that is, R(x, y) = 1 if x ∼ y and R(x, y) = 0 if x ∼ y. For a proof, see Appendix A.2. This = = =, relationship can be used to obtain an extension to fuzzy relations as given by the next theorem whose proof can be found in the Appendix A.3. Theorem 10 Let X be a non-empty universe of discourse, µ i : X → [0, 1], i ∈ I, I non-empty; then the fuzzy equivalence relation EM : X × X → [0, 1] given by ↔ EM (x, y) = inf T M (µi (x), µi (y)) i∈I is positive-semidefinite. In the following, the most prominent representatives of Archimedean t-norms, the Product TP and the Łukasiewicz t-norm TL , are used to construct positive-semidefinite fuzzy similarity relations. Though the first part can also be derived from a result due to Yaglom (1957) that characterizes isotropic stationary kernels by its spectral representation, here we prefer to present a direct, elementary proof. Compare also Bochner (1955) and Genton (2001). Theorem 11 Let X be a non-empty universe of discourse, ν : X → [0, 1] and let h : [0, 1] → [0, 1] be an isomorphism of the unit interval that can be expanded in the manner of Equation (1), that is h(x) = ∑k ck xk with ck ≥ 0; then the fuzzy equivalence relations EL,h , EP,h : X × X → [0, 1] given by ↔ EL,h (x, y) = h T L h−1 (ν(x)) , h−1 (ν(y)) and ↔ EP,h (x, y) = h T P h−1 (ν(x)) , h−1 (ν(y)) (8) (9) are positive-semidefinite. Proof. To prove the positive-definiteness of the two-placed functions E L,h and EP,h given by equations (8) and (9) respectively, we have to show that n n ∑ i, j=1 EL,h (xi , xi ) ci c j ≥ 0, ∑ i, j=1 EP,h (xi , x j ) ci c j ≥ 0 for any n ∈ N and any choice of x1 , . . . , xn ∈ X , respectively. According to an elementary result from Linear Algebra this is equivalent to the assertion that the determinants (1 ≤ m ≤ n) Dm = det (E(xi , x j ))i, j∈{1,...,m} of the minors of the matrix (E(xi , x j ))i, j satisfy ∀m ∈ {1, . . . , n} : Dm ≥ 0, where E denotes either EL,h or EP,h . Recall that the determinant of a matrix is invariant with respect to renaming the indices, that is, if σ : {1, . . . , n} → {1, . . . , n} is a permutation then det [(ai j )i, j ] = det (aσ(i)σ( j) )i, j . 2609 M OSER For convenience, let denote µi = h−1 (ν(xi )). Then, without loss of generality, we may assume that the values µi are ordered monotonically decreasing, that is, µi ≥ µ j for i < j. ↔ → (10) → Case TL : Note that T L (a, b) = min{ T L (a, b), T L (b, a)} = 1 − |a − b|. Then we have to show that for all dimensions n ∈ N, the determinant of E (n) = (1 − |µi − µ j |)i, j∈{1,...,n} is non-negative, that is Due to the assumption (10), we have det[E (n) ] ≥ 0. 1 − |µi − µ j | = 1 − (µi − µ j ) if i ≤ j, 1 − (µ j − µi ) else which yields . . . 1 − (µ1 − µn−1 ) 1 − (µ1 − µn ) . . . 1 − (µ2 − µn−1 ) 1 − (µ2 − µn ) . . . 1 − (µ3 − µn−1 ) 1 − (µ3 − µn ) (n) E = . . . .. . . . . . 1 − (µ1 − µn−1 ) 1 − (µ2 − µn−1 ) . . . 1 1 − (µn−1 − µn ) 1 − (µ1 − µn ) 1 − (µ2 − µn ) . . . 1 − (µn−1 − µn ) 1 1 − (µ1 − µ2 ) 1 1 − (µ2 − µ3 ) . . . 1 1 − (µ1 − µ2 ) 1 − (µ1 − µ3 ) . . . Now let us apply determinant-invariant elementary column operations to simplify this matrix by subtracting the column with index i − 1 from the column with index i, i ≥ 2. This yields 1 µ2 − µ1 ... µn−1 − µn−2 µn − µn−1 1 − (µ1 − µ2 ) −(µ2 − µ1 ) . . . µn−1 − µn−2 µn − µn−1 1 − (µ1 − µ3 ) −(µ2 − µ1 ) . . . µn−1 − µn−2 µn − µn−1 ˜ E (n) = . . . . . .. . . . . . . . . . 1 − (µ1 − µn−1 ) −(µ2 − µ1 ) . . . −(µn−2 − µn−1 ) µn − µn−1 1 − (µ1 − µn ) −(µ2 − µ1 ) . . . −(µn−2 − µn−1 ) −(µn−1 − µn ) Therefore, α = n ∏(µi−1 − µi ) ≥ 0 (11) i=2 ˜ ˆ det[E (n) ] = det[E (n) ] = α det[En ], where . . . −1 −1 . . . −1 −1 . . . −1 −1 (n) ˆ E = . . . .. . . . . . 1 − (µ1 − µn−1 ) +1 . . . +1 −1 1 − (µ1 − µn ) +1 . . . +1 +1 1 1 − (µ1 − µ2 ) 1 − (µ1 − µ3 ) . . . 2610 −1 +1 +1 . . . (12) G ENERATING K ERNELS BY F UZZY R ELATIONS Let us apply Laplacian determinant expansion by minors to the first column of matrix (12), that is n det[A] = ∑ (−1)i+ j ai j det[Ai j ] i=1 where A = (ai j ) is an n × n-matrix, j arbitrarily chosen from {1, . . . , n} and Ai j is the matrix corresponding to the cofactor ai j obtained by canceling out the i-th row and the j-th column from A (see, ˆ for example, Muir, 1960). For n = 1, we get the trivial case det[ E (1) ] = 1. Note that the first and (n) ˆ the last rows of the matrices Ei,1 for 1 < i < n only differ by their signum, consequently the minors ˆ (n) det[Ei,1 ] for 1 < i < n, n ≥ 2, are vanishing, that is, det[Ai,1 ] = 0, for 1 < i < n. Therefore, according to the Laplacian expansion, we get (n) (n) ˆ ˆ ˆ det[E (n) ] = 1 · det[E1,1 ] + (−1)n (1 − (µ1 − µn )) · det[E1,n ]. (13) Observe that (n) ˆ det[E1,1 ] = 2n−2 (n) ˆ det[E1,n ] = (−1)n−1 2n−2 . Consequently, Equation (13) simplifies to ˆ det[E (n) ] = 2n−2 1 + (−1)n (−1)n−1 2n−2 (1 − (µ1 − µn )) = 2n−2 (1 − (1 − (µ1 − µn ))) = 2n−2 (µ1 − µn ) ≥ 0 which together with (11) proves the first case. ↔ → → Case TP : First of all, let us compute T P (a, b) = min{ T P (a, b), T L (b, a)}. Hence, min{ b , a } if a, b > 0, a b 0 ↔ if a = 0 and b > 0 , T P (a, b) = 0 if b = 0 and a > 0 , 1 if a = 0 and b = 0 . Again, without loss of generality, let us suppose that the values µ i , i ∈ {1, . . . , n} are ordered monotonically decreasing, that is µ1 ≥ µ2 ≥ . . . ≥ µn . Before checking the general case, let us consider the special case of vanishing µ-values. For this, let us assume for the moment that µi = > 0 if i < i0 , 0 else ↔ ↔ which implies that T P (µi , µ j ) = 0 for i < i0 and j ≥ i0 and T P (µi , µ j ) = 1 for i ≥ i0 and j ≥ i0 . This leads to a decomposition of the matrix ↔ E (n) = T P (µi , µ j ) 2611 ij M OSER such that det[E (n) ] = det[E (i0 −1) ] · det[In−i0 −1 ] where Ik denotes the k × k-matrix with constant entries 1, hence det[In−i0 −1 ] ∈ {0, 1}. Therefore, we may assume that µ1 ≥ µ2 ≥ . . . ≥ µn > 0. Then we have to show that for all dimensions n ∈ N, the determinant of µi µ j , µ j µi E (n) = min i, j∈{1,...,n} is non-negative, that is det[E (n) ] ≥ 0. Consider 1 µ2 µ1 µ3 µ (n) E = .1 . . µn−1 µ1 µn µ1 µ2 µ1 1 µ3 µ2 . . . µn−1 µ2 µn µ2 ... ... ... .. . ... ... Now, multiply the i-th column by −µi+1 /µi and add 1 ≤ i < n, then we get 1 0 ... 2 µ2 ∗ 1 − ... µ1 ∗ ∗ ... . ˜ .. E (n) = . . . . . . ∗ ∗ ... 1− ∗ ∗ ... µn−1 µ1 µn−1 µ2 µn−1 µ3 µn µ1 µn µ2 µn µ3 . . . . µn µn−1 1 . . . 1 µn µn−1 (14) it to the (i + 1)-th column of matrix (14), 0 0 0 0 0 . . . 0 . . . µn−1 µn−2 2 ∗ 0 1− µn µn−1 2 (15) where ∗ is a placeholder for any real value. By this, the determinant of the matrix in Equation (15) readily turns out to be n−1 µi+1 ˜ det[E (n) ] = det[E (n) ] = ∏ 1 − µi i=1 2 ≥0 which together with Theorem (2) ends the proof. Note that relations (8) and (9) are T -transitive with respect to the corresponding isomorphic Archimedean t-norms, TL,h (x, y) = h(TL (h−1 (x), h−1 (x))) and TP,h (x, y) = h(TP (h−1 (x), h−1 (x))), respectively. 2612 G ENERATING K ERNELS BY F UZZY R ELATIONS Corollary 12 Let X be a non-empty universe of discourse, µ i : X → [0, 1], λi ∈ ]0, 1] with ∑i λi = 1 ˜ ˜ where i ∈ {1, . . . , n}, n ∈ N, then the fuzzy equivalence relations EL , EP : X × X → [0, 1] given by n ↔ ˜ EL (x, y) = ∑ λi T L (µi (x), µi (y)) (16) i=1 and n ↔ ˜ EP (x, y) = ∏ T P (µi (x), µi (y)) λi (17) i=1 are TL - and TP -equivalences, respectively, and kernels. Proof. First of all, let us check the TL -transitivity of formula (16). This can readily be shown by ↔ means of the definition of TL and the TL -transitivity of T L due to the following inequalities: n TL n ↔ i=1 n n ↔ ↔ n ↔ ↔ ∑ λi T L (µi (x), µi (y)) + ∑ λi T L (µi (y), µi (z)) − 1 , 0 i=1 i=1 n max = i=1 i=1 n = i=1 ∑ λi T L (µi (x), µi (y)) + ∑ λi T L (µi (y), µi (z)) − 1, 0 max max ↔ ∑ λi T L (µi (x), µi (y)), ∑ λi T L (µi (y), µi (yz) n ↔ ↔ ∑ λi TL T L (µi (x), µi (y)), ∑ λi T L (µi (y), µi (z)) , 0 i=1 i=1 n max ↔ ∑ λi T L (µi (x), µi (z)), 0 ≤ ≤ = i=1 ↔ λi T L (µi (x), µi (z)). ↔ This, together with the TP -transitivity of T P , proves that the formulas given by (16) and (17) are TL and TP -equivalences, respectively. Expanding the factors of formula (17) yields 1 if µi (x) = µi (y) = 0, λi ↔ λi λi (18) T P (µi (x), µi (y)) = min(µiλi(x),µiλi(y)) else max(µi (x),µi (y)) which by comparing case TP of the proof of Theorem 11 shows that the left-hand side of Equation (18) is positive-semidefinite. As the convex combination and the product are special cases of positive-semidefiniteness preserving functions according to Theorem 1, the functions defined by equations (16) and (17) prove to be again positive-semidefinite and, therefore, kernels. It is interesting to observe that both formulas (16) and (17) can be expressed in the form, f ( τ(x) − τ(y) 1 ), where f : I → [0, 1], I some interval, is a strictly decreasing function, τ : X → I n , I some interval, τ(x) = (τ1 (x), . . . , τn (x)) and τ(x) 1 = ∑n |τi (x)|. Indeed, for Equation (16) let us define i=1 fL : [0, 1] → [0, 1], fL (a) = 1 − a τL : X → [0, 1] , τL (x) = (λ1 µ1 (x), . . . , λn µn (x)) n 2613 M OSER and for Equation (17) and positive membership functions µ i , µi (x) > 0 for all x ∈ X , let us define fP : [0, ∞[→ [0, 1], fP (a) = e−a τP : X → ] − ∞, 1]n , τP (x) = (λ1 ln(µ1 (x)), . . . , λn ln(µn (x))) Therefore, we get ˜ EL (x, y) = 1 − τL (x) − τL (y) ˜ EP (x, y) = e− τP (x)−τP (y) 1 . 1 (19) (20) While formulas (19) and (20) provide a geometrical interpretation by means of the norm . 1 , the corresponding formulas (16) and (17) yield a semantical model of the assertion “IF x is equal to y with respect to feature µ1 AND . . . AND x is equal to y with respect to feature µn THEN x is equal to y” as aggregation of biimplications in terms of fuzzy logic. While in the former case, the aggregation has some compensatory effect, the latter is just a conjunction in terms of the Product triangular norm. For details on aggregation operators see, for example, Saminger et al. (2002) and Calvo et al. (2002). The formulas (16) and (17) coincide for the following special case. If the membership functions µi are indicator functions of sets Ai ⊆ X which form a partition of X , then the kernels (16) and (17) reduce to the indicator function characterizing the Boolean equivalence relation induced by this partition {A1 , . . . , An }. The formulas (16) and (17) for general membership functions therefore provide kernels which can be interpreted to be induced by a family of fuzzy sets and, in particular, by fuzzy partitions, that is, families of fuzzy sets fulfilling some criteria which extend the axioms for a Boolean partition in a many-valued logical sense. For definitions and further details on fuzzy partitions see, for ¨ example, De Baets and Mesiar (1998), Demirci (2003) and H oppner and Klawonn (2003). It is a frequently used paradigm that the decision boundaries for a classification problem lie between clusters rather than intersecting them. Due to this cluster hypothesis, the problem of designing kernels based on fuzzy partitions is closely related to the problem of learning kernels from unlabeled data. For further details on semi-supervised learning see, for example, Seeger (2002), Chapelle et al. (2003) and T. M. Huang (2006). It is left to future research to explore this relationship to the problem of learning from labeled and unlabeled data and related concepts like covariance kernels. 5. Conclusion In this paper, we have presented a novel view on kernels from a fuzzy logical point of view. Particularly, the similarity-measure aspect of a kernel is addressed and investigated by means of the so-called T -transitivity which is characteristic for fuzzy equivalence relations. As a consequence, we derived that a large class of kernels can be represented in a way that is commonly used for representing fuzzy rule bases. In addition to this proof for the existence of such a representation, constructive examples are presented. It is the idea of this research to look for a combination of knowledge-based strategies with kernel-based methods in order to facilitate a more flexible designing process of kernels which also allows to incorporate prior knowledge. Further research aims at 2614 G ENERATING K ERNELS BY F UZZY R ELATIONS analyzing the behavior of kernels constructed in this way when applied in the various kernel methods like support vector machines, kernel principal components analysis and others. In particular, it is intended to focus on the problem of learning kernels from unlabeled data where the fuzzy partitions are induced by appropriate clustering principles. Acknowledgments Bernhard Moser gratefully acknowledges partial support by the Austrian Government, the State of Upper Austria, and the Johannes Kepler University Linz in the framework of the Kplus Competence Center Program. Furthermore special thanks go to the anonymous reviewers who gave helpful suggestions and to Felix Kossak for careful proof-reading. Appendix A. For sake of completeness the following sections provide proofs regarding Theorem 8, the characterization of kernels in the Boolean case and the construction of kernels by means of the minimum t-norm TM . Furthermore, in Section A.4 an example of a non-positive-semidefinite Tcos -equivalence is given. A.1 Proof of Theorem 8 Let us start with the analysis of 3-dimensional matrices. Lemma 13 Let M = (mi j )i j ∈ [0, 1]3×3 be a 3 × 3 symmetric matrix with mii = 1, i = 1, 2, 3; then M is positive-semidefinite iff for all i, j, k ∈ {1, 2, 3} there holds mi j m jk − 1 − m2j i 1 − m2 ≤ mik jk Proof. For simplicity, let a = m1,2 , b = m1,3 and c = m2,3 . Then the determinant of M, Det(M), is a function of the variables a, b, c given by D(a, b, c) = 1 + 2abc − a2 − b2 − c2 . For any choice of a, b, the quadratic equation D(a, b, c) = 0 can be solved for c, yielding two solutions c1 = c1 (a, b) and c2 = c2 (a, b) as functions of a and b, c1 (a, b) = ab − c2 (a, b) = ab + 1 − a2 1 − a2 1 − b2 1 − b2 . Obviously, for all |a| ≤ 1 and |b| ≤ 1, the values c1 (a, b) and c2 (a, b) are real. By substituting a = cos α and b = cos(β) with α, β ∈ [0, π ], it becomes readily clear that 2 c1 (a, b) = c1 (cos(α), cos(β)) = cos(α) cos(β) − sin(α) sin(β) = cos(α + β) ∈ [−1, 1] 2615 M OSER and, analogously, c2 (a, b) = c2 (cos(α), cos(β)) = cos(α) cos(β) + sin(α) sin(β) = cos(α − β) ∈ [−1, 1]. As for all a, b ∈ [−1, 1] the determinant function Da,b (c) := D(a, b, c) is quadratic in c with negative coefficient for c2 , there is a uniquely determined maximum at c0 (a, b) = ab. Note that for all a, b ∈ [−1, 1], we have c1 (a, b) ≤ c0 (a, b) ≤ c2 (a, b) and D(a, b, c0 (a, b)) = 1 + 2ab(ab) − a2 − b2 − (ab)2 = (1 − a2 )(1 − b2 ) ≥ 0. Therefore, D(a, b, c) ≥ 0 if and only if c ∈ [c1 (a, b), c2 (a, b)]. Recall from linear algebra that by renaming the indices, the determinant does not change. Therefore, without loss of generality, we may assume that a ≥ b ≥ c. For convenience, let Q = {(x, y, z) ∈ [0, 1]3 |x ≥ y ≥ z}. Then, obviously, for any choice of a, b ∈ [0, 1] there holds (a, b, c1 (a, b)) ∈ Q. Elementary algebra shows that (a, b, c2 (a, b)) ∈ Q is only the case for a = b = 1. As for a = b = 1 the two solutions c1 , c2 coincide, that is, c1 (1, 1) = c2 (1, 1) = 1, it follows that for any choice of (a, b, c) ∈ Q, there holds D(a, b, c) ≥ 0 if and only if c1 (a, b) = ab − 1 − a2 1 − b2 ≤ c. (21) If (a, b, c) ∈ Q, then the inequality (21) is trivially satisfied which together with (21) proves the lemma Now Theorem 8 immediately follows from Definition (1), Lemma (13) and the characterizing inequality (21). A.2 Characterization of Kernels in the Boolean Case ¨ The following lemma and proposition can also be found as an exercise in Sch olkopf and Smola (2002). Lemma 14 Let ∼ be an equivalence relation on X and let k : X × X → {0, 1} be induced by ∼ via k(x, y) = 1 if and only if x ∼ y; then k is a kernel. Proof. By definition of positive-definiteness, let us consider an arbitrary sequence of elements x1 , . . . , xn . Then there are at most n equivalence classes Q1 , . . . , Qm on the set of indices {1, . . . , n}, S / m ≤ n, where i=1,...,m Qi = {1, . . . , n} and Qi ∩ Q j = 0 for i = j. Note that k(xi , x j ) = 0 if the indices 2616 G ENERATING K ERNELS BY F UZZY R ELATIONS i, j belong to different equivalence classes. Then, for any choice of reals c 1 , . . . , cn , we obtain ∑ ci c j k(xi , x j ) m = i, j ∑ ∑ ci c j k(xi , x j ) p=1 i, j∈Q p m = ∑ ∑ p=1 i, j∈Q p ci c j · 1 2 m = ∑ ∑ ci p=1 i∈Q p ≥ 0 Proposition 15 k : X × X → {0, 1} with k(x, x) = 1 for all x ∈ X is a kernel if and only if it is induced by an equivalence relation. Proof. It only remains to be shown that if k is a kernel, then it is the indicator function of an equivalence relation, that is, it is induced by an equivalence relation. If k is a kernel, according to Lemma 13, for all x, y, z ∈ X , it has to satisfy Tcos (k(x, y), k(y, z)) ≤ k(x, z), which implies, k(x, y) = 1, k(y, z) = 1 =⇒ k(x, z) = 1. Obviously, we have k(x, x) = 1 and k(x, y) = k(y, x) due to the reflexivity and symmetry assumption of k, respectively. A.3 Constructing Kernels by TM For convenience let us recall the basic notion of an α-cut from fuzzy set theory: Definition 16 Let X be a non-empty set and µ : X → [0, 1]; then [µ]α = {x ∈ X | µ(x) ≥ α} is called the α-cut of the membership function µ. Lemma 17 k : X × X → [0, 1] is a TM -equivalence if and only if all α-cuts of k are Boolean equivalence relations. Proof. (i) Let us assume that k is a TM -equivalence. Let α ∈ [0, 1], then by definition, [k]α = {(x, y) ∈ X × X | k(x, y) ≥ α}. In order to show that [k]α is a Boolean equivalence, the axioms for reflexivity, symmetry and transitivity have to be shown. Reflexivity and symmetry are trivially satisfied as for all x, y ∈ X , there holds by assumption that k(x, x) = 1 and k(x, y) = k(y, x). In order to show transitivity, let us consider (x, y), (y, z) ∈ [k]α , that means k(x, y) ≥ α and k(y, z) ≥ α; then by the TM -transitivity assumption it follows that α ≤ min(k(x, y), k(y, z)) ≤ k(x, z), hence (x, z) ∈ [k]α . 2617 M OSER (ii) Suppose now that all α-cuts of k are Boolean equivalence relations. Then, in particular, [k] α with α = 1 is reflexive, hence k(x, x) = 1 for all x ∈ X . The symmetry of k follows from the fact that for all α ∈ [0, 1] and pairs (x, y) ∈ [k]α , by assumption, we have (y, x) ∈ [k]α . In order to show the TM -transitivity property, let us consider arbitrarily chosen elements x, y, z ∈ X . Let α = min(k(x, y), k(y, z)); then by the transitivity assumption of [k] α , it follows that (x, z) ∈ [k]α , consequently k(x, z) ≥ α = min(k(x, y), k(y, z)). Proposition 18 If k : X × X → [0, 1] is a TM -equivalence then it is positive-semidefinite. Proof. Choose arbitrary elements x1 , . . . , xn ∈ X and consider the set of values which are taken by all combinations k(xi , x j ), i, j ∈ {1, . . . , n} and order them increasingly, that is k(xi , x j )| i, j ∈ {1, . . . , n}} = {α1 , . . . , αm , where 0 ≤ α1 ≤ · · · αm ≤ 1. Observe that for all pairs (xi , x j ), i, j ∈ {1, . . . , n} there holds m k(xi , x j ) = ∑ (αv − αv−1 )1[k] αv v=2 (xi , x j ) + α1 1[k]α1 (xi , x j ) showing that on the set {x1 , . . . , xn } × {x1 , . . . , xn }, the function k is a linear combination of indicator functions of Boolean equivalences (which are positive-semidefinite by Proposition 15) with nonnegative coefficients and, consequently, it has to be positive semidefinite. A.4 Example of a Non-Positive-Semidefinite Tcos -Equivalence For dimensions n > 3, the Tcos -transitivity is no longer sufficient to guarantee positive semi(n) definiteness. Consider, for example An = (ai j )i j where λ (n) ai j = 1 0 if min(i, j) = 1, max(i, j) > 1 , if i = j, else . (22) √ (n) (n) (n) Choose λ = 1/ 2, then Tcos (λ, λ) = 0, hence we have Tcos (ai j , a jk ) ≤ aik for all indices i, j, k ∈ 1, . . . , n. As det(An ) < 0 for n > 3, the matrix An cannot be positive-semidefinite though the Tcos transitivity conditions are satisfied. References S. Bochner. Harmonic Analysis and the Theory of Probability. University of California Press, Los Angeles, California, 1955. U. Bodenhofer. A note on approximate equality versus the Poincar´ paradox. Fuzzy Sets and e Systems, 133(2):155–160, 2003. 2618 G ENERATING K ERNELS BY F UZZY R ELATIONS D. Boixader and J. Jacas. T -indistinguishability operators and approximate reasoning via CRI. In D. Dubois, E. P. Klement, and H. Prade, editors, Fuzzy Sets, Logics and Reasoning about Knowledge, volume 15 of Applied Logic Series, pages 255–268. Kluwer Academic Publishers, Dordrecht, 1999. A. Pinkus C. H. FitzGerald, C.A. Micchelli. Functions that preserve families of positive semidefinite matrices. Linear Alg. and Appl., 221:83–102, 1995. T. Calvo, G. Mayor, and R. Mesiar, editors. Aggregation Operators, volume 97 of Studies in Fuzziness and Soft Computing. Physica-Verlag, Heidelberg, 2002. ¨ O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. volume 15 of NIPS. 2003. B. De Baets and R. Mesiar. T -partitions. Fuzzy Sets and Systems, 97:211–223, 1998. M. Demirci. On many-valued partitions and many-valued equivalence relations. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 11(2):235–253, 2003. D. Dubois and H. Prade. A review of fuzzy set aggregation connectives. Inform. Sci., 36:85–121, 1985. M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001. S. Gottwald. Fuzzy set theory with t-norms and Φ-operators. In A. Di Nola and A. G. S. Ventre, editors, The Mathematics of Fuzzy Systems, volume 88 of Interdisciplinary Systems Research, ¨ pages 143–195. Verlag TUV Rheinland, K¨ ln, 1986. o S. Gottwald. Fuzzy Sets and Fuzzy Logic. Vieweg, Braunschweig, 1993. U. H¨ hle. Fuzzy equalities and indistinguishability. In Proc. 1st European Congress on Fuzzy and o Intelligent Technologies, volume 1, pages 358–363, Aachen, 1993. U. H¨ hle. The Poincar´ paradox and non-classical logics. In D. Dubois, E. P. Klement, and H. Prade, o e editors, Fuzzy Sets, Logics and Reasoning about Knowledge, volume 15 of Applied Logic Series, pages 7–16. Kluwer Academic Publishers, Dordrecht, 1999. F. H¨ ppner and F. Klawonn. Improved fuzzy partitions for fuzzy regression models. Internat. J. o Approx. Reason., 32:85–102, 2003. F. H¨ ppner, F. Klawonn, and P. Eklund. Learning indistinguishability from data. Soft Computing, 6 o (1):6–13, 2002. J. Jacas. On the generators of T -indistinguishability operators. Stochastica, 12:49–63, 1988. I. T. Jolliffe. Principal Component Analysis. Springer Verlag, New York, 1986. E. P. Klement, R. Mesiar, and E. Pap. Triangular Norms, volume 8 of Trends in Logic. Kluwer Academic Publishers, Dordrecht, 2000. 2619 M OSER R. Kruse, J. Gebhardt, and F. Klawonn. Fuzzy-Systeme. B. G. Teubner, Stuttgart, 1993. R. Kruse, J. Gebhardt, and F. Klawonn. Foundations of Fuzzy Systems. John Wiley & Sons, New York, 1994. C. H. Ling. Representation of associative functions. Publ. Math. Debrecen, 12:189–212, 1965. B. Moser. On the t-transitivity of kernels. Fuzzy Sets and Systems, 157:1787–1796, 2006. B. Moser. A New Approach for Representing Control Surfaces by Fuzzy Rule Bases. PhD thesis, Johannes Kepler Universit¨ t Linz, October 1995. a T. Muir. A Treatise on the Theory of Determinants. Dover, New York, 1960. H. Poincar´ . La Science et l’Hypoth´ se. Flammarion, Paris, 1902. e e H. Poincar´ . La Valeur de la Science. Flammarion, Paris, 1904. e S. Saminger, R. Mesiar, and U. Bodenhofer. Domination of aggregation operators and preservation of transitivity. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 10(Suppl.):11–35, 2002. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, 2002. o ¨ B. Sch¨ lkopf, A. J. Smola, and K. R. Muller. Nonlinear component analysis as a kernel eigenvalue o problem. Neural Computation, 10:1299–1319, 1998. B. Schweizer and A. Sklar. Associative functions and statistical triangle inequalities. Publ. Math. Debrecen, 8:169–186, 1961. B. Schweizer and A. Sklar. Probabilistic Metric Spaces. North-Holland, Amsterdam, 1983. M. Seeger. Covariance kernels from bayesian generative models. Neural Information Processing Systems, 14:905–912, 2002. I. Kopriva T. M. Huang, V. Kecman. Kernel Based Algorithms for Mining Huge Data Sets, Supervised, Semi-supervised, and Unsupervised Learning. Springer-Verlag, Berlin, 2006. E. Trillas and L. Valverde. An inquiry into indistinguishability operators. In H. J. Skala, S. Termini, and E. Trillas, editors, Aspects of Vagueness, pages 231–256. Reidel, Dordrecht, 1984. E. Trillas, S. Cubillo, and E. Casti˜ eira. Menger and Ovchinnikov on indistinguishabilities revisited. n Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 7(3):213–218, 1999. L. Valverde. On the structure of F-indistinguishability operators. Fuzzy Sets and Systems, 17(3): 313–328, 1985. A. M. Yaglom. Some classes of random fields in n-dimensional space, related to stationary random processes. Theory of Probability and its Applications, 2:273–320, 1957. L. A. Zadeh. Similarity relations and fuzzy orderings. Inform. Sci., 3:177–200, 1971. 2620
same-paper 2 0.98595858 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
Author: Régis Vert, Jean-Philippe Vert
Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation
Author: Luis M. de Campos
Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests
4 0.79798013 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
Author: Guillaume Lecué
Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence
5 0.76047224 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
Author: Di-Rong Chen, Tao Sun
Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition
6 0.67680502 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
7 0.67456126 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
8 0.65626955 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
9 0.63238376 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
10 0.63109046 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
11 0.62853348 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
12 0.62005341 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs
13 0.61317629 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
14 0.5955587 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
15 0.58877349 16 jmlr-2006-Bounds for Linear Multi-Task Learning
16 0.58857727 66 jmlr-2006-On Model Selection Consistency of Lasso
17 0.55519515 48 jmlr-2006-Learning Minimum Volume Sets
18 0.5430674 45 jmlr-2006-Learning Coordinate Covariances via Gradients
19 0.53899324 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
20 0.53229493 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data