jmlr jmlr2010 jmlr2010-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
Reference: text
sentIndex sentText sentNum sentScore
1 First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. [sent-18, score-0.363]
2 , on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. [sent-24, score-0.589]
3 Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . [sent-25, score-0.468]
4 o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology. [sent-33, score-0.411]
5 Keywords: probability metrics, homogeneity tests, independence tests, kernel methods, universal kernels, characteristic kernels, Hilbertian metric, weak topology 1. [sent-34, score-0.605]
6 That γFc is a metric on P follows from the uniqueness theorem for characteristic functions (Dudley, 2002, Theorem 9. [sent-83, score-0.43]
7 • Structured domains: Since γk is dependent only on the kernel (see Theorem 1) and kernels can be defined on arbitrary domains M (Aronszajn, 1950), choosing F = Fk provides the flexibility of measuring the distance between probability measures defined on structured domains (Borgwardt et al. [sent-109, score-0.413]
8 (2008) introduced the concept of a characteristic kernel, that is, a reproducing kernel for which γk (P, Q) = 0 ⇔ P = Q, P, Q ∈ P, that is, γk is a metric on P. [sent-132, score-0.449]
9 (2007b) showed that H is characteristic if k is universal in the sense of Steinwart (2001, Definition 4), that is, H is dense in the Banach space of bounded continuous functions with respect to the supremum norm. [sent-136, score-0.378]
10 Second, universality is in any case an overly restrictive condition because universal kernels assume M to be compact, that is, they induce a metric only on the space of probability measures that are supported on compact M. [sent-144, score-0.493]
11 1, we present the simple characterization that integrally strictly positive definite (pd) kernels (see Section 1. [sent-146, score-0.579]
12 Examples of integrally strictly pd kernels on Rd include the Gaussian, Laplacian, inverse multiquadratics, Mat´ rn kernel family, e B2n+1 -splines, etc. [sent-150, score-0.788]
13 Although the above characterization of integrally strictly pd kernels being characteristic is simple to understand, it is only a sufficient condition and does not provide an answer for kernels that are not integrally strictly pd,2 for example, a Dirichlet kernel. [sent-151, score-1.518]
14 We present a complete characterization of characteristic kernels when the kernel is translation invariant on Rd . [sent-154, score-0.768]
15 We show that a bounded continuous translation invariant kernel on Rd is characteristic if and only if the support of the Fourier transform of the kernel is the entire Rd . [sent-155, score-0.706]
16 We also show that all compactly supported translation invariant kernels on Rd are characteristic. [sent-159, score-0.406]
17 3, we show that a translation invariant kernel on Td is characteristic where S if and only if the Fourier series coefficients of the kernel are positive, that is, the support of the Fourier spectrum is the entire Zd . [sent-168, score-0.601]
18 , the universal kernels and the strictly pd kernels), and show how they relate in turn to characteristic kernels. [sent-176, score-0.77]
19 2 D ISSIMILAR D ISTRIBUTIONS WITH S MALL γk As we have seen, the characteristic property of a kernel is critical in distinguishing between distinct probability measures. [sent-180, score-0.363]
20 Suppose, however, that for a given characteristic kernel k and for any ε > 0, there exist P and Q, P = Q, such that γk (P, Q) < ε. [sent-181, score-0.363]
21 It can be shown that integrally strictly pd kernels are strictly pd (see Footnote 4). [sent-183, score-0.89]
22 Therefore, examples of kernels that are not integrally strictly pd include those kernels that are not strictly pd. [sent-184, score-1.006]
23 To this end, in Section 5, we show that universal kernels on compact (M, ρ) metrize the weak topology on P. [sent-213, score-0.519]
24 For the non-compact setting, we assume M = Rd and provide sufficient conditions on the kernel such that γk metrizes the weak topology on P. [sent-214, score-0.338]
25 A measurable and bounded kernel, k is said to be integrally strictly positive definite if ZZ k(x, y) dµ(x) dµ(y) > 0, M for all finite non-zero signed Borel measures µ defined on M. [sent-255, score-0.447]
26 The above definition is a generalization of integrally strictly positive definite functions on Rd RR (Stewart, 1976, Section 6): Rd k(x, y) f (x) f (y) dx dy > 0 for all f ∈ L2 (Rd ), which is the strictly positive definiteness of the integral operator given by the kernel. [sent-256, score-0.466]
27 Note that the above definition is not equivalent to the definition of strictly pd kernels: if k is integrally strictly pd, then it is strictly pd, while the converse is not true. [sent-257, score-0.621]
28 Therefore, if k is integrally strictly pd, then it is strictly pd. [sent-268, score-0.384]
29 (ii) Consider (9) with k(x, y) = ψt (x − y), ZZ γ2 (P, Q) = k Rd ψt (x − y)p(x)p(y) dx dy + ZZ −2 Z = Rd Rd ZZ Rd ψt (x − y)q(x)q(y) dx dy ψt (x − y)p(x)q(y) dx dy (ψt ∗ p)(x)p(x) dx + Z Rd (ψt ∗ q)(x)q(x) dx − 2 Z Rd (ψt ∗ q)(x)p(x) dx. [sent-371, score-0.41]
30 To this end, we start with the definition of characteristic kernels and provide some examples where k is such that γk is not a metric on P. [sent-434, score-0.589]
31 1, we provide the characterization that if k is integrally strictly pd, then γk is a metric on P. [sent-439, score-0.419]
32 Definition 6 (Characteristic kernel) A bounded measurable positive definite kernel k is characteristic to a set Q ⊂ P of probability measures defined on (M, A) if for P, Q ∈ Q, γk (P, Q) = 0 ⇔ P = Q. [sent-446, score-0.476]
33 k is simply said to be characteristic if it is characteristic to P. [sent-447, score-0.542]
34 Therefore, when M = Rd , the embedding of a distribution to a characteristic R RKHS can be seen as a generalization of the characteristic function, φP = Rd ei ·,x dP(x). [sent-451, score-0.563]
35 This is because, by the uniqueness theorem for characteristic functions (Dudley, 2002, Theorem 9. [sent-452, score-0.344]
36 So, in this context, intuitively ei y,x can be treated as the characteristic kernel, k, although, formally, this is not true as ei y,x is not a pd kernel. [sent-455, score-0.387]
37 Before we get to the characterization of characteristic kernels, the following examples show that there exist bounded measurable kernels that are not characteristic. [sent-456, score-0.616]
38 , 2008, 2009a), the following result provides a more natural and easily understandable characterization for characteristic kernels, namely that integrally strictly pd kernels are characteristic to P. [sent-483, score-1.223]
39 Theorem 7 (Integrally strictly pd kernels are characteristic) Let k be an integrally strictly positive definite kernel on a topological space M. [sent-484, score-0.866]
40 We show that choosing k to be integrally strictly pd violates the conditions in Lemma 8, and k is therefore characteristic to P. [sent-487, score-0.693]
41 Proof (of Theorem 7) Since k is integrally strictly pd on M, we have ZZ k(x, y) dη(x)dη(y) > 0, M for any finite non-zero signed Borel measure η. [sent-503, score-0.436]
42 A translation variant integrally strictly pd kernel, k, can be obtained from a translation invariant integrally strictly pd kernel, k, as k(x, y) = f (x)k(x, y) f (y), where f : M → R is a bounded continuous function. [sent-507, score-1.148]
43 A simple example of a translation variant integrally strictly pd kernel on Rd is k(x, y) = exp(σxT y), σ > 0, where we have chosen f (·) = exp(σ · 2 /2) and k(x, y) = exp(−σ x − y 2 /2), σ > 0. [sent-508, score-0.624]
44 Clearly, this kernel is characteristic 2 2 on compact subsets of Rd . [sent-509, score-0.4]
45 The same result can also be obtained from the fact that k is universal on compact subsets of Rd (Steinwart, 2001, Section 3, Example 1), recalling that universal kernels are characteristic (Gretton et al. [sent-510, score-0.658]
46 In the following section, we assume M = Rd and k to be translation invariant and present a complete characterization for characteristic k which is simple to check. [sent-513, score-0.416]
47 The following theorem characterizes all translation invariant kernels in Rd that are characteristic. [sent-521, score-0.465]
48 An important point to be noted with the B2n+1 -spline kernel is that ψ has vanishing points at ω = 2πα, α ∈ Z\{0}, unlike Gaussian and Laplacian kernels which do not have vanishing points in their Fourier spectrum. [sent-534, score-0.352]
49 , ±n} <σ<1 (2n+1)x 2 x sin 2 2 1 sin 2 n+1 sin2 x 2 F´ jer e R π 2 1[−σ,σ] (ω) sin(σx) x Sinc supp(ψ) cos(σx) π 2 [δ(ω − σ) + δ(ω + σ)] {−σ, σ} Table 2: Translation invariant kernels on R defined by ψ, their spectra, ψ and its support, supp(ψ). [sent-542, score-0.467]
50 By combining Theorem 7 with Theorem 9, it can be shown that the Sinc, Poisson, Dirichlet, F´ jer and cosine kernels are not integrally strictly pd. [sent-558, score-0.583]
51 Therefore, for translation e invariant kernels on Rd , the integral strict positive definiteness of the kernel (or the lack of it) can be tested using Theorems 7 and 9. [sent-559, score-0.484]
52 Of all the kernels shown in Table 2, only the Gaussian, Laplacian and B2n+1 -spline kernels are integrable and their corresponding ψ are computed using (4). [sent-560, score-0.492]
53 Rd The whole family of compactly supported translation invariant continuous bounded kernels on is characteristic, as shown by the following corollary to Theorem 9. [sent-568, score-0.501]
54 As a corollary to Theorem 9, the following result provides a method to construct new characteristic kernels from a given one. [sent-578, score-0.536]
55 Therefore, one can generate all sorts of kernels that are characteristic by starting with a characteristic kernel, k. [sent-593, score-0.76]
56 We showed in Theorem 9 that kernels with supp(Λ) Rd are not characteristic to P. [sent-595, score-0.503]
57 Now, we can question whether such kernels can be characteristic to some proper subset Q of P. [sent-596, score-0.503]
58 On the other hand, the following result is of theoretical interest: along with Theorem 9, it completes the characterization of characteristic kernels that are translation invariant on Rd . [sent-599, score-0.662]
59 Although, by Theorem 9, the kernels with supp(Λ) Rd are not characteristic to P, Theorem 12 shows that there exists a subset of P to which a subset of these kernels are characteristic. [sent-609, score-0.749]
60 So far, we have characterized the characteristic property of kernels that satisfy (a) supp(Λ) = Rd / or (b) supp(Λ) Rd with int(supp(Λ)) = 0. [sent-616, score-0.503]
61 In the following section, we investigate kernels that / have supp(Λ) Rd with int(supp(Λ)) = 0, examples of which include periodic kernels on Rd . [sent-617, score-0.492]
62 e We now state the result that defines characteristic kernels on Td . [sent-633, score-0.503]
63 Based on the above result, one can generate characteristic kernels by constructing an infinite sequence of positive numbers that are summable and then using them in (20). [sent-638, score-0.503]
64 It can be seen from Table 2 that the Poisson kernel on T is characteristic while the Dirichlet, F´ jer and cosine kernels are not. [sent-639, score-0.64]
65 Some examples e of characteristic kernels on T are: (1) k(x, y) = eα cos(x−y) cos(α sin(x − y)), 0 < α ≤ 1 ↔ Aψ (0) = 1, Aψ (n) = α|n| 2|n|! [sent-640, score-0.503]
66 The following result relates characteristic kernels and universal kernels defined on Td . [sent-646, score-0.808]
67 Corollary 15 Let k be a characteristic kernel satisfying Assumption 2 with Aψ (0) > 0. [sent-647, score-0.363]
68 Since k being universal implies that it is characteristic, the above result shows that the converse is not true (though almost true except that Aψ (0) can be zero for characteristic kernels). [sent-651, score-0.345]
69 (2009b) shows that a bounded continuous translation invariant kernel on a locally compact Abelian group G is characteristic to the set of all probability measures on G if and only if the support of the Fourier transform of the translation invariant kernel is the dual group of G. [sent-656, score-0.902]
70 (2009b), these results are also extended to translation invariant kernels on non-Abelian compact groups and the semigroup Rd . [sent-659, score-0.415]
71 4 Overview of Relations Between Families of Kernels So far, we have presented various characterizations of characteristic kernels, which are easily checkable compared with characterizations proposed in the earlier literature (Gretton et al. [sent-670, score-0.345]
72 We now provide an overview of various useful conditions one can impose on kernels (to be universal, strictly pd, integrally strictly pd, or characteristic), and the implications that relate some of these conditions. [sent-673, score-0.63]
73 Integrally strictly pd kernels: It is clear from Theorem 7 that integrally strictly pd kernels on a topological space M are characteristic, whereas the converse remains undetermined. [sent-676, score-0.919]
74 Strictly pd kernels: The relation between integrally strictly pd and strictly pd kernels shown in Figure 1 is straightforward, as one direction follows from Footnote 4, while the other direction is not true, which follows from Steinwart and Christmann (2008, Proposition 4. [sent-681, score-1.02]
75 However, if M is a finite set, then k being strictly pd also implies it is integrally strictly pd. [sent-684, score-0.514]
76 Strictly pd kernels: Since integrally strictly pd kernels are characteristic and are also strictly pd, a natural question to ask is, “What is the relation between characteristic and strictly pd kernels? [sent-686, score-1.612]
77 ” It can be seen that strictly pd kernels need not be characteristic because 2 (σ(x−y)) the sinc-squared kernel, k(x, y) = sin (x−y)2 on R, which has supp(Λ) = [−σ, σ] R is strictly pd (Wendland, 2005, Theorem 6. [sent-687, score-0.989]
78 However, for any general M, it is not clear whether k being characteristic implies that it is strictly pd. [sent-689, score-0.335]
79 As a special case, if M = Rd or M = Td , then by Theorems 9 and 12, it follows that a translation invariant k being characteristic also implies that it is strictly pd. [sent-690, score-0.467]
80 (2006, Proposition 15) have provided a characterization of universality for translation invariant kernels on Rd : k is universal if λ(supp(Λ)) > 0, where λ is the Lebesgue measure and Λ is defined as in (11). [sent-704, score-0.502]
81 , sincsquared kernel is not characteristic as supp(Λ) = [−σ, σ] R but universal in the sense of Micchelli as λ(supp(Λ)) = 2σ > 0). [sent-708, score-0.422]
82 On the other hand, if a kernel is strictly pd, then it need not be universal, which follows from the results due to Dahmen and Micchelli (1987) and Pinkus (2004) for Taylor kernels (Steinwart and Christmann, 2008, Lemma 4. [sent-720, score-0.43]
83 (2010a,b) carried out a thorough study relating characteristic kernels to various notions of universality, addressing some open questions mentioned in the above discussion and Figure 1. [sent-727, score-0.503]
84 This is done by relating universality to the injective embedding of regular Borel measures into an RKHS, which can therefore be seen as a generalization of the notion of characteristic kernels, as the latter deal with the injective RKHS embedding of probability measures. [sent-728, score-0.476]
85 Since φQ ∈ L1 , by the inversion theorem for characteristic functions (Dudley, 2002, Theorem 9. [sent-740, score-0.344]
86 However, in this section, we show that characteristic kernels, while guaranteeing γk to be a metric on P, may nonetheless have difficulty in distinguishing certain distributions on the basis of finite samples. [sent-913, score-0.343]
87 The characteristic function of P is given as iα φP (ω) = φQ (ω) − [φQ (ω − νπ) − φQ (ω + νπ)] , ω ∈ R, 2 where φQ is the characteristic function associated with Q. [sent-924, score-0.514]
88 This means that, k k although the B1 -spline kernel is characteristic to P, in practice, it becomes harder to distinguish 1547 ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET between P and Q with finite samples, when P is constructed as in (22) with ν = ω0 . [sent-994, score-0.363]
89 Metrization of the Weak Topology So far, we have shown that a characteristic kernel k induces a metric γk on P. [sent-1038, score-0.449]
90 A metric γ on P is said to metrize the weak topology if the topology induced by γ coincides with the w n→∞ weak topology, which is defined as follows: if, for P, P1 , P2 , . [sent-1051, score-0.443]
91 Since the Dudley metric is related to the Prohorov metric as 1 β(P, Q) ≤ ς(P, Q) ≤ 2 β(P, Q), 2 (28) it also metrizes the weak topology on P (Dudley, 2002, Theorem 11. [sent-1060, score-0.404]
92 The first result states that when (M, ρ) is compact, γk induced by universal kernels metrizes the weak topology. [sent-1130, score-0.443]
93 The entire Mat´ rn class of kernels in (18) satisfies the conditions of Theorem 24 and, therefore, e the corresponding γk metrizes the weak topology on P. [sent-1145, score-0.478]
94 We now discuss two topics related to γk , concerning the choice of kernel parameter and kernels defined on P. [sent-1216, score-0.352]
95 {kσ : σ ∈ R+ } is the family of Gaussian kernels and {γkσ : σ ∈ R+ } is the associated family of distance measures indexed by the kernel parameter, σ. [sent-1219, score-0.413]
96 Note that kσ is characteristic for any σ ∈ R++ and, therefore, γkσ is a metric on P for any σ ∈ R++ . [sent-1220, score-0.343]
97 Klin := kλ = ∑lj=1 λ j k j | kλ is pd, ∑lj=1 λ j = 1 , which is the linear combination of pd kernels {k j }lj=1 . [sent-1242, score-0.376]
98 Kcon := kλ = ∑lj=1 λ j k j | λ j ≥ 0, ∑lj=1 λ j = 1 , which is the convex combination of pd kernels {k j }lj=1 . [sent-1244, score-0.376]
99 Further work on Hilbertian metrics and positive definite kernels on probability measures has been carried out by Hein and Bousquet (2005) and Fuglede and Topsøe (2003). [sent-1262, score-0.346]
100 On the relation between universality, characteristic kernels and RKHS embedding of measures. [sent-1689, score-0.552]
wordName wordTfidf (topN-words)
[('rd', 0.403), ('supp', 0.285), ('characteristic', 0.257), ('kernels', 0.246), ('integrally', 0.228), ('zz', 0.171), ('dp', 0.156), ('gretton', 0.143), ('haracteristic', 0.142), ('cb', 0.14), ('anckriet', 0.135), ('ilbert', 0.135), ('mbedding', 0.135), ('olkopf', 0.135), ('riperumbudur', 0.135), ('ukumizu', 0.135), ('pd', 0.13), ('fourier', 0.128), ('retton', 0.116), ('tv', 0.114), ('kernel', 0.106), ('td', 0.1), ('topology', 0.094), ('dudley', 0.09), ('fukumizu', 0.09), ('ernels', 0.09), ('sriperumbudur', 0.089), ('borel', 0.088), ('theorem', 0.087), ('metric', 0.086), ('sd', 0.085), ('translation', 0.082), ('dx', 0.082), ('pace', 0.08), ('metrizes', 0.08), ('strictly', 0.078), ('dq', 0.077), ('ch', 0.073), ('metrics', 0.073), ('pn', 0.071), ('sin', 0.07), ('dd', 0.069), ('hilbertian', 0.062), ('lebesgue', 0.061), ('zd', 0.061), ('rkhs', 0.059), ('universal', 0.059), ('weak', 0.058), ('steinwart', 0.057), ('folland', 0.055), ('measurable', 0.051), ('invariant', 0.05), ('pseudometric', 0.049), ('embedding', 0.049), ('characterizations', 0.044), ('transform', 0.043), ('shorack', 0.043), ('wasserstein', 0.043), ('sinc', 0.042), ('tp', 0.039), ('chapter', 0.039), ('rudin', 0.038), ('universality', 0.038), ('compact', 0.037), ('bounded', 0.035), ('lemma', 0.035), ('distance', 0.034), ('lj', 0.033), ('corollary', 0.033), ('laplacian', 0.032), ('christmann', 0.032), ('sup', 0.031), ('pk', 0.031), ('eix', 0.031), ('footnote', 0.031), ('holomorphic', 0.031), ('jer', 0.031), ('homogeneity', 0.031), ('iv', 0.029), ('converse', 0.029), ('compactly', 0.028), ('injective', 0.028), ('bochner', 0.028), ('tl', 0.028), ('said', 0.028), ('characterization', 0.027), ('jl', 0.027), ('continuous', 0.027), ('measures', 0.027), ('convolution', 0.026), ('berlinet', 0.026), ('hilbert', 0.026), ('ap', 0.026), ('embeddings', 0.026), ('metrize', 0.025), ('multiquadratic', 0.025), ('prohorov', 0.025), ('micchelli', 0.024), ('suppose', 0.024), ('supx', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
2 0.19742571 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
3 0.13974661 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
4 0.12044659 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
5 0.099723503 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
6 0.077553734 82 jmlr-2010-On Learning with Integral Operators
7 0.075403355 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
8 0.059087697 84 jmlr-2010-On Spectral Learning
9 0.057810325 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
10 0.057200044 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
11 0.048953597 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
12 0.043001719 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
13 0.041408699 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.039239306 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
15 0.038814645 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
16 0.037906744 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
17 0.036152434 22 jmlr-2010-Classification Using Geometric Level Sets
18 0.035896085 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
19 0.035786532 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
20 0.034268096 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
topicId topicWeight
[(0, -0.19), (1, -0.06), (2, 0.041), (3, 0.1), (4, -0.12), (5, 0.013), (6, -0.412), (7, -0.098), (8, 0.001), (9, -0.055), (10, -0.069), (11, -0.069), (12, -0.048), (13, -0.17), (14, 0.005), (15, 0.046), (16, -0.039), (17, -0.066), (18, 0.062), (19, 0.046), (20, -0.147), (21, 0.054), (22, -0.17), (23, 0.0), (24, -0.004), (25, 0.228), (26, -0.143), (27, -0.048), (28, 0.0), (29, -0.03), (30, 0.169), (31, 0.006), (32, -0.039), (33, -0.048), (34, 0.081), (35, -0.014), (36, -0.024), (37, 0.003), (38, 0.021), (39, -0.026), (40, 0.106), (41, -0.039), (42, -0.021), (43, 0.032), (44, -0.108), (45, -0.032), (46, 0.047), (47, -0.09), (48, 0.013), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.97923803 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
2 0.73325026 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
3 0.53439224 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
4 0.45262095 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
5 0.43192944 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
Author: Gunnar Carlsson, Facundo Mémoli
Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance
6 0.41476905 44 jmlr-2010-Graph Kernels
7 0.40993577 82 jmlr-2010-On Learning with Integral Operators
8 0.33670786 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
9 0.28909647 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
10 0.24964052 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
11 0.24812265 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
12 0.24360193 84 jmlr-2010-On Spectral Learning
13 0.2248069 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
14 0.22077717 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
15 0.1866076 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
16 0.17649226 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
17 0.17452839 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
18 0.16220047 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
19 0.16104618 72 jmlr-2010-Matrix Completion from Noisy Entries
20 0.15641487 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
topicId topicWeight
[(3, 0.556), (8, 0.015), (15, 0.017), (32, 0.031), (36, 0.019), (37, 0.046), (75, 0.136), (81, 0.017), (83, 0.011), (85, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.88785386 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
2 0.79842681 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
3 0.48499632 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
4 0.46560308 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel
Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first—to the best of our knowledge—PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classifiers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classifiers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL
5 0.46461934 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani
Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative filtering, nuclear norm, spectral regularization, netflix prize, large scale convex optimization
6 0.46392533 84 jmlr-2010-On Spectral Learning
7 0.46318975 82 jmlr-2010-On Learning with Integral Operators
8 0.45274633 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
9 0.44564947 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
10 0.43389222 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
11 0.41564655 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
12 0.40866184 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
13 0.3990505 60 jmlr-2010-Learnability, Stability and Uniform Convergence
14 0.39576331 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
15 0.39248076 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
16 0.38864681 25 jmlr-2010-Composite Binary Losses
17 0.38477618 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
18 0.36745572 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
19 0.36530605 102 jmlr-2010-Semi-Supervised Novelty Detection
20 0.36454275 69 jmlr-2010-Lp-Nested Symmetric Distributions