nips nips2009 nips2009-118 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. [sent-13, score-0.176]
2 In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. [sent-14, score-0.111]
3 An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. [sent-15, score-0.116]
4 First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. [sent-17, score-0.374]
5 An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. [sent-18, score-0.599]
6 Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. [sent-19, score-1.54]
7 Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). [sent-20, score-0.701]
8 This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. [sent-21, score-0.725]
9 This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. [sent-22, score-0.467]
10 The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. [sent-23, score-0.093]
11 1 Introduction Kernel methods are broadly established as a useful way of constructing nonlinear algorithms from linear ones, by embedding points into higher dimensional reproducing kernel Hilbert spaces (RKHSs) [9]. [sent-24, score-0.23]
12 More specifically, suppose we are given the set P of all Borel probability measures defined on the topological space M , and the RKHS (H, k) of functions on M with k as its reproducing kernel (r. [sent-26, score-0.245]
13 If k is measurable and bounded, then we may define the embedding of P in H as Pk ∈ H. [sent-31, score-0.104]
14 (1) We say that k is characteristic [4, 14] if the mapping P → Pk is injective, in which case (1) is zero if and only if P = Q, i. [sent-33, score-0.379]
15 In this application domain, the question of whether k is characteristic is key: without this property, the algorithms can fail through inability to distinguish between particular distributions. [sent-37, score-0.412]
16 Characteristic kernels are important in binary classification: The problem of distinguishing distributions is strongly related to binary classification: indeed, one would expect easily distinguishable distributions to be easily classifiable. [sent-38, score-0.447]
17 The importance of using characteristic RKHSs is further underlined by this link: if the property does not hold, then there exist distributions that are unclassifiable in the RKHS H. [sent-41, score-0.423]
18 We further strengthen this by showing that characteristic kernels are necessary (and sufficient under certain conditions) to achieve Bayes risk in the kernel-based classification algorithms. [sent-42, score-0.775]
19 Characterization of characteristic kernels: Given the centrality of the characteristic property to both RKHS classification and RKHS distribution testing, we should take particular care in establishing which kernels satisfy this requirement. [sent-43, score-1.065]
20 In the case of translation invariant kernels, [14] proved the kernel to be characteristic if and only if the support of the Fourier transform of k is the entire Rd , which is a much easier condition to verify. [sent-46, score-0.631]
21 Similar sufficient conditions are obtained by [5] for translation invariant kernels on groups and semi-groups. [sent-47, score-0.411]
22 In Section 3, we expand the class of characteristic kernels to include kernels that may or may not be translation invariant, with the introduction of a novel criterion: strictly positive definite kernels (see Definition 3) on M are characteristic. [sent-48, score-1.511]
23 Choice of characteristic kernels: In expanding the families of allowable characteristic kernels, we have so far neglected the question of which characteristic kernel to choose. [sent-49, score-1.374]
24 A practitioner asking by how much two samples differ does not want to receive a blizzard of answers for every conceivable kernel and bandwidth setting, but a single measure that satisfies some “reasonable” notion of distance across the family of kernels considered. [sent-50, score-0.613]
25 Thus, in Section 4, we propose a generalization of the MMD, yielding a new distance measure between P and Q defined as γ(P, Q) = sup{γk (P, Q) : k ∈ K} = sup{ Pk − Qk H : k ∈ K}, (2) which is the maximal RKHS distance between P and Q over a family, K of positive definite kernels. [sent-51, score-0.136]
26 For example, K can be the family of Gaussian kernels on Rd indexed by the bandwidth parameter. [sent-52, score-0.398]
27 This distance measure is very natural in the light of our results on binary classification (in Section 2): most directly, this corresponds to the problem of learning the kernel by minimizing the risk of the associated Parzen-based classifier. [sent-53, score-0.33]
28 Specifically, we show that when two distributions differ on particular length scales, the kernel selected by the generalized MMD is appropriate to this difference, and the resulting hypothesis test outperforms the heuristic kernel choice employed in earlier studies [6]. [sent-61, score-0.46]
29 2 Characteristic Kernels and Binary Classification One of the most important applications of the maximum mean discrepancy is in nonparametric hypothesis testing [6, 7, 4], where the characteristic property of k is required to distinguish between probability measures. [sent-63, score-0.468]
30 This motivates the need for characteristic k to guarantee that classes arising from different distributions can be classified by kernel-based algorithms. [sent-65, score-0.423]
31 Now, we present the result that relates γk to the optimal risk associated with the Parzen window classifier. [sent-70, score-0.146]
32 , the risk is maximum (note that L since 0 ≤ γk (P, Q) = −RFk , the maximum risk is zero). [sent-82, score-0.178]
33 In other words, if k is characteristic, then the maximum risk is obtained only when P = Q. [sent-83, score-0.089]
34 This motivates the importance of characteristic kernels in binary classification. [sent-84, score-0.712]
35 In the following, we provide another result which provides a similar motivation for the importance of characteristic kernels in binary classification, wherein we relate γk to the margin of a hard-margin SVM. [sent-85, score-0.754]
36 Assuming the training sample is separable, let fsvm be the solution to the program, inf{ f H : Yi f (Xi ) ≥ 1, ∀ i}, where H is an RKHS with measurable and bounded k. [sent-91, score-0.152]
37 , a less smooth classifier, fsvm , where smoothness is measured as fsvm H ). [sent-97, score-0.152]
38 Suppose k is not characteristic, then γk (Pm , Qn ) can be zero for Pm = Qn and therefore the margin is zero, which means even unlike distributions can become inseparable in this feature representation. [sent-99, score-0.086]
39 Another justification of using characteristic kernels in kernel-based classification algorithms can be provided by studying the conditions on H for which the Bayes risk is realized for all µ. [sent-100, score-0.803]
40 37] have showed that under certain conditions on L, the Bayes risk is achieved for all µ if and only if H is dense in Lp (M, η) for all η, where η = εP + (1 − ε)Q. [sent-102, score-0.117]
41 Denseness of H in Lp (M, η) implies H + R is dense Lp (M, η), which therefore yields that k is characteristic [4, 5]. [sent-104, score-0.379]
42 On the other hand, if constant functions are included in H, then it is easy to show that the characteristic property of k is also sufficient to achieve the Bayes risk. [sent-105, score-0.408]
43 As an example, it can be shown that characteristic kernels are necessary (and sufficient if constant functions are in H) for SVMs to achieve the Bayes risk [16, Example 5. [sent-106, score-0.775]
44 Therefore, the characteristic property of k is fundamental in kernel-based classification algorithms. [sent-108, score-0.379]
45 Having showed how characteristic kernels play a role in kernel-based classification, in the following section, we provide a novel characterization for them. [sent-109, score-0.725]
46 3 Novel Characterization for Characteristic Kernels A positive definite (pd) kernel, k is said to be characteristic to P if and only if γk (P, Q) = 0 ⇔ P = Q, ∀ P, Q ∈ P. [sent-110, score-0.411]
47 The following result provides a novel characterization for characteristic kernels, which shows that strictly pd kernels are characteristic to P. [sent-111, score-1.41]
48 A measurable and bounded kernel, k is said to be strictly positive definite if and only if M M k(x, y) dµ(x) dµ(y) > 0 for all finite non-zero signed Borel measures, µ defined on M . [sent-116, score-0.247]
49 Note that the above definition is not equivalent to the usual definition of strictly pd kernels that involves finite sums [16, Definition 4. [sent-117, score-0.613]
50 The above definition is a generalization of integrally strictly positive definite functions [17, Section 6]: k(x, y)f (x)f (y) dx dy > 0 for all f ∈ L2 (Rd ), which is the strictly positive definiteness of the integral operator given by the kernel. [sent-119, score-0.368]
51 62] shows a kernel that is strictly pd in the finite sum sense but not in the integral sense. [sent-121, score-0.482]
52 If k is strictly positive definite on M , then k is characteristic to P. [sent-123, score-0.55]
53 The proof idea is to derive necessary and sufficient conditions for a kernel not to be characteristic. [sent-124, score-0.204]
54 We show that choosing k to be strictly pd violates these conditions and k is therefore characteristic to P. [sent-125, score-0.713]
55 Examples of strictly pd kernels on Rd include exp(−σ x−y 2 ), σ > 0, exp(−σ x−y 1 ), σ > 2 ˜ 0, (c2 + x − y 2 )−β , β > 0, c > 0, B2l+1 -splines etc. [sent-126, score-0.613]
56 Note that k(x, y) = f (x)k(x, y)f (y) is a 2 strictly pd kernel if k is strictly pd, where f : M → R is a bounded continuous function. [sent-127, score-0.621]
57 Therefore, translation-variant strictly pd kernels can be obtained by choosing k to be a translation invariant strictly pd kernel. [sent-128, score-0.995]
58 A simple example of a translation-variant kernel that is a strictly pd kernel on ˜ compact sets of Rd is k(x, y) = exp(σxT y), σ > 0, where we have chosen f (. [sent-129, score-0.658]
59 Therefore, k is characteristic on compact sets of Rd , 2 ˜ which is the same result that follows from the universality of k [15, Section 3, Example 1]. [sent-132, score-0.379]
60 The following result in [10], which is based on the usual definition of strictly pd kernels, can be obtained as a corollary to Theorem 4. [sent-133, score-0.349]
61 Let X = {xi }m ⊂ M , Y = {yj }n ⊂ M and assume that xi = xj , yi = i=1 j=1 m n yj , ∀ i, j. [sent-135, score-0.192]
62 Suppose k is strictly pd (more generally, suppose k is characteristic). [sent-150, score-0.344]
63 4 Generalizing the MMD for Classes of Characteristic Kernels The discussion so far has been related to the characteristic property of k that makes γk a metric on P. [sent-154, score-0.379]
64 We have seen that this characteristic property is of prime importance both in distribution testing, and to ensure classifiability of dissimilar distributions in the RKHS. [sent-155, score-0.423]
65 We have not yet addressed how to choose among a selection/family of characteristic kernels, given a particular pair of distributions we wish to discriminate between. [sent-156, score-0.423]
66 {kσ : σ ∈ R+ } is the family of Gaussian kernels and {γkσ : σ ∈ R+ } is the family of MMDs indexed by the kernel parameter, σ. [sent-159, score-0.561]
67 Note that kσ is characteristic for any σ ∈ R++ and therefore γkσ is a metric on P for any σ ∈ R++ . [sent-160, score-0.379]
68 To generalize the MMD to families of kernels, we propose the following modification to γk , which yields a pseudometric on P, γ(P, Q) = sup{γk (P, Q) : k ∈ K} = sup{ Pk − Qk H : k ∈ K}. [sent-170, score-0.099]
69 Since the Parzen window classifier depends on the kernel, k, one can propose to learn the kernel like in support vector machines [8], wherein L L the kernel is chosen such that RFk in Theorem 1 is minimized over k ∈ K, i. [sent-175, score-0.409]
70 A similar motivation for γ can be provided based on (5) as learning the kernel in a hard-margin SVM by maximizing its margin. [sent-178, score-0.176]
71 We say a translation-invariant kernel, k on Rd is normalized if M ψ(y) dy = c (some positive constant independent of the kernel parameter), where k(x, y) = ψ(x − y). [sent-181, score-0.208]
72 K is a normalized kernel family if every kernel in K is normalized. [sent-182, score-0.391]
73 For example, it is easy to see that Kg and Kl are unnormalized kernel families. [sent-184, score-0.234]
74 Therefore, the generalized MMD reduces to a single kernel MMD. [sent-187, score-0.206]
75 A similar result also holds for the normalized inverse-quadratic kernel family, { 2σ 2 /π(σ 2 + x − y 2 )−1 , x, y ∈ R : σ ∈ [σ0 , ∞)}. [sent-188, score-0.176]
76 These examples show that the generalized MMD definition 2 is usually not very useful if K is a normalized kernel family. [sent-189, score-0.206]
77 In addition, σ0 should be chosen beforehand, which is equivalent to heuristically setting the kernel parameter in γk . [sent-190, score-0.176]
78 Note that σ0 cannot be zero because in the limiting case of σ → 0, the kernels approach a Dirac distribution, which means the limiting kernel is not bounded and therefore the definition of MMD in (1) does not hold. [sent-191, score-0.483]
79 So, in this work, we consider unnormalized kernel families to render the definition of generalized MMD in (6) useful. [sent-192, score-0.296]
80 Now, the question reduces to which of the kernel classes, K have V (K) < ∞. [sent-224, score-0.176]
81 Examples of kernels on Rd that are covered by these classes include the Gaussian, Laplacian, inverse multiquadratics, Mat´ rn class etc. [sent-228, score-0.307]
82 Other choices e for K that are popular in machine learning are the linear combination of kernels, Klin := {kλ = l l l l i=1 λi ki | kλ is pd, i=1 λi = 1} and Kcon := {kλ = i=1 λi ki | λi ≥ 0, i=1 λi = 1}. [sent-229, score-0.152]
83 Therefore, instead of using a class based on a fixed, parameterized kernel, one can also use a finite linear combination of kernels to compute γ. [sent-231, score-0.307]
84 (11) γ 2 (Pm , Qn ) = sup + −2 m2 n2 mn σ∈R+ i,j=1 i,j=1 i,j=1 The optimum σ ∗ can be obtained by solving (11) and γ(Pm , Qn ) = Pm kσ∗ − Qn kσ∗ Example 12. [sent-238, score-0.086]
85 Then, γ(Pm , Qn ) becomes γ 2 (Pm , Qn ) = sup k∈Kcon Pm k − Qn k 2 H = sup k∈Kcon = sup{λT a : λT 1 = 1, λ where we have replaced k by m 1 1 a,b=1 ki (Xa , Xb ) + n2 m2 γ 2 (Pm , Qn ) = max1≤i≤l (a)i . [sent-240, score-0.19]
86 , λl ) and (a)i = Pm ki − Qn ki 2 i = H m,n 2 − mn a,b=1 ki (Xa , Yb ). [sent-246, score-0.257]
87 Note that γk0 (single kernel MMD based on k0 ) can be obtained by defining λ(k) = δ(k − k0 ) in β(P, Q). [sent-255, score-0.176]
88 5 Experiments In this section, we present a benchmark experiment that illustrates the generalized MMD proposed in Section 4 is preferred above the single kernel MMD where the kernel parameter is set heuristically. [sent-256, score-0.382]
89 Therefore, using this setup we can verify whether the adaptive bandwidth selection achieved by γ (as the test statistic) helps to distinguish between P and Q at higher ν compared to γk with a heuristic σ. [sent-269, score-0.119]
90 14 different values for σ are considered on a logarithmic scale of base 2 with exponents 3 (−3, −2, −1, 0, 1, 2 , 2, 5 , 3, 7 , 4, 5, 6) along with the median distance between samples as one more 2 2 choice. [sent-281, score-0.096]
91 (e) Box plot of the median distance between points (which is also a choice for σ), grouped by ν. [sent-309, score-0.126]
92 Figure 1(d) shows the box plot for log σ, grouped by ν, where σ is the bandwidth selected by γ. [sent-314, score-0.126]
93 Figure 1(e) shows the box plot of the median distance between points (which is also a choice for σ), grouped by ν. [sent-315, score-0.17]
94 From Figures 1(c) and (e), it is easy to see that the median heuristic exhibits high type-II error for 3 ν = 2 , while γ exhibits zero type-II error (from Figure 1(a)). [sent-316, score-0.12]
95 It is intuitive to note that as ν increases, (which means the characteristic function of Q differs from that of P at higher frequencies), a smaller σ is needed to detect these changes. [sent-318, score-0.379]
96 6 Conclusions In this work, we have shown how MMD appears in binary classification, and thus that characteristic kernels are important in kernel-based classification algorithms. [sent-321, score-0.712]
97 We have broadened the class of characteristic RKHSs to include those induced by strictly positive definite kernels (with particular application to kernels on non-compact domains, and/or kernels that are not translation invariant). [sent-322, score-1.544]
98 We have further provided a convergent generalization of MMD over families of kernel functions, which becomes necessary even in considering relatively simple families of kernels (such as the Gaussian kernels parameterized by their bandwidth). [sent-323, score-0.938]
99 Acknowledgments The authors thank anonymous reviewers for their constructive comments and especially the reviewer who pointed out the connection between characteristic kernels and the achievability of Bayes risk. [sent-325, score-0.686]
100 On the influence of the kernel on the consistency of support vector machines. [sent-465, score-0.176]
wordName wordTfidf (topN-words)
[('mmd', 0.425), ('characteristic', 0.379), ('kernels', 0.307), ('pm', 0.301), ('qn', 0.301), ('kernel', 0.176), ('pd', 0.167), ('parzen', 0.152), ('strictly', 0.139), ('rkhs', 0.126), ('kcon', 0.095), ('kg', 0.095), ('rfk', 0.095), ('risk', 0.089), ('gretton', 0.086), ('um', 0.08), ('yi', 0.079), ('rd', 0.078), ('fukumizu', 0.077), ('classi', 0.076), ('ki', 0.076), ('measurable', 0.076), ('fsvm', 0.076), ('tmn', 0.076), ('xi', 0.063), ('families', 0.061), ('rkhss', 0.061), ('sup', 0.057), ('window', 0.057), ('median', 0.057), ('klin', 0.057), ('krbf', 0.057), ('rademacher', 0.055), ('sch', 0.054), ('borel', 0.054), ('sriperumbudur', 0.054), ('pk', 0.053), ('bandwidth', 0.052), ('nite', 0.051), ('steinwart', 0.05), ('cmn', 0.05), ('yj', 0.05), ('chaos', 0.046), ('distributions', 0.044), ('box', 0.044), ('embeddings', 0.043), ('corollary', 0.043), ('margin', 0.042), ('nition', 0.041), ('translation', 0.04), ('distance', 0.039), ('family', 0.039), ('characterization', 0.039), ('homogeneity', 0.038), ('suppose', 0.038), ('lp', 0.038), ('mmds', 0.038), ('pseudometric', 0.038), ('qk', 0.037), ('invariant', 0.036), ('lkopf', 0.035), ('hilbert', 0.034), ('heuristic', 0.034), ('distinguish', 0.033), ('broadened', 0.033), ('theorem', 0.033), ('positive', 0.032), ('discrepancy', 0.031), ('mpi', 0.031), ('topological', 0.031), ('ece', 0.03), ('mat', 0.03), ('bayes', 0.03), ('generalized', 0.03), ('grouped', 0.03), ('fk', 0.03), ('inf', 0.029), ('unnormalized', 0.029), ('mn', 0.029), ('easy', 0.029), ('cybernetics', 0.028), ('lanckriet', 0.028), ('conditions', 0.028), ('gert', 0.028), ('yb', 0.028), ('banach', 0.028), ('embedding', 0.028), ('statistic', 0.028), ('la', 0.028), ('germany', 0.028), ('kl', 0.028), ('er', 0.028), ('jolla', 0.027), ('ying', 0.027), ('integrable', 0.027), ('binary', 0.026), ('generalization', 0.026), ('established', 0.026), ('injective', 0.026), ('testing', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
2 0.32340088 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur
Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.
3 0.19530867 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
4 0.17657088 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
5 0.11681519 95 nips-2009-Fast subtree kernels on graphs
Author: Nino Shervashidze, Karsten M. Borgwardt
Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
6 0.11656044 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
7 0.11262318 256 nips-2009-Which graphical models are difficult to learn?
8 0.11194707 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
9 0.10883784 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
10 0.10039666 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
11 0.096591771 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
12 0.096560851 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
13 0.092222877 72 nips-2009-Distribution Matching for Transduction
14 0.081919044 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
15 0.080996923 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
16 0.078492858 3 nips-2009-AUC optimization and the two-sample problem
17 0.075655505 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
18 0.07443437 33 nips-2009-Analysis of SVM with Indefinite Kernels
19 0.072831504 112 nips-2009-Human Rademacher Complexity
20 0.072053343 175 nips-2009-Occlusive Components Analysis
topicId topicWeight
[(0, -0.208), (1, 0.145), (2, -0.062), (3, 0.152), (4, -0.096), (5, -0.092), (6, -0.065), (7, 0.199), (8, -0.142), (9, -0.01), (10, 0.153), (11, -0.184), (12, 0.052), (13, 0.022), (14, -0.023), (15, 0.062), (16, -0.074), (17, 0.083), (18, 0.107), (19, 0.014), (20, 0.065), (21, 0.037), (22, -0.059), (23, -0.087), (24, -0.131), (25, 0.096), (26, 0.057), (27, -0.042), (28, -0.154), (29, 0.069), (30, -0.071), (31, -0.235), (32, -0.028), (33, 0.003), (34, 0.064), (35, 0.06), (36, 0.002), (37, 0.017), (38, -0.059), (39, 0.004), (40, -0.036), (41, -0.014), (42, 0.082), (43, 0.01), (44, 0.017), (45, 0.072), (46, 0.03), (47, -0.054), (48, -0.061), (49, -0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.96078944 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
2 0.87668657 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur
Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.
3 0.72684717 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
4 0.66129041 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
Author: Liefeng Bo, Cristian Sminchisescu
Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classifiers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their significant computational cost. To address this problem, we propose efficient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the specified kernel function. Classifiers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efficient and achieve the current state of the art in three difficult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1
5 0.65238696 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
6 0.58869326 95 nips-2009-Fast subtree kernels on graphs
7 0.54194409 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
8 0.5243938 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
9 0.50935519 72 nips-2009-Distribution Matching for Transduction
10 0.49128526 3 nips-2009-AUC optimization and the two-sample problem
11 0.49067694 33 nips-2009-Analysis of SVM with Indefinite Kernels
12 0.43491587 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
13 0.43322009 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
14 0.43321696 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
15 0.38696146 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
16 0.38287884 112 nips-2009-Human Rademacher Complexity
17 0.38136926 11 nips-2009-A General Projection Property for Distribution Families
18 0.3790724 206 nips-2009-Riffled Independence for Ranked Data
19 0.37615591 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding
20 0.37180635 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
topicId topicWeight
[(7, 0.015), (21, 0.011), (24, 0.065), (25, 0.071), (35, 0.043), (36, 0.128), (39, 0.034), (49, 0.068), (58, 0.093), (61, 0.026), (71, 0.063), (72, 0.163), (86, 0.096), (91, 0.018)]
simIndex simValue paperId paperTitle
1 0.94170165 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale
Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck
Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.
same-paper 2 0.89451361 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
3 0.8681215 76 nips-2009-Efficient Learning using Forward-Backward Splitting
Author: Yoram Singer, John C. Duchi
Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1
4 0.80275542 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur
Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.
5 0.80011922 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.
6 0.77389711 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
7 0.77239966 72 nips-2009-Distribution Matching for Transduction
8 0.77117091 71 nips-2009-Distribution-Calibrated Hierarchical Classification
9 0.76674652 260 nips-2009-Zero-shot Learning with Semantic Output Codes
10 0.76624674 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
11 0.76530534 129 nips-2009-Learning a Small Mixture of Trees
12 0.76401836 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
13 0.76360554 97 nips-2009-Free energy score space
14 0.76295358 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
15 0.76214337 104 nips-2009-Group Sparse Coding
16 0.75865865 27 nips-2009-Adaptive Regularization of Weight Vectors
17 0.7584253 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
18 0.75782084 87 nips-2009-Exponential Family Graph Matching and Ranking
19 0.75690794 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
20 0.75680423 128 nips-2009-Learning Non-Linear Combinations of Kernels