jmlr jmlr2012 jmlr2012-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
Reference: text
sentIndex sentText sentNum sentScore
1 Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). [sent-20, score-0.422]
2 We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. [sent-21, score-0.221]
3 Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. [sent-23, score-0.304]
4 We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. [sent-24, score-0.193]
5 Introduction We address the problem of comparing samples from two probability distributions, by proposing statistical tests of the null hypothesis that these distributions are equal against the alternative hypothesis that these distributions are different (this is called the two-sample problem). [sent-44, score-0.396]
6 We use as our test statistic the difference between the mean function values on the two samples; when this is large, the samples are likely from different distributions. [sent-52, score-0.263]
7 We call this test statistic the Maximum Mean Discrepancy (MMD). [sent-53, score-0.263]
8 Clearly the quality of the MMD as a statistic depends on the class F of smooth functions that define it. [sent-54, score-0.203]
9 We also propose a test statistic with a computational cost of O(m + n): the associated test can achieve a given Type II error at a lower overall computational cost than the quadratic-cost test, by looking at a larger volume of data. [sent-62, score-0.323]
10 (2007a,b) is rigorous in its treatment of the asymptotic distribution of the test statistic under the null hypothesis. [sent-75, score-0.39]
11 In Section 3, we give an overview of hypothesis testing as it applies to the two-sample problem, and review alternative test statistics, including the L2 distance between kernel density estimates (Anderson et al. [sent-79, score-0.254]
12 We present our first two hypothesis tests in Section 4, based on two different bounds on the deviation between the population and empirical MMD. [sent-81, score-0.241]
13 When large volumes of data are available, the cost of computing the MMD (quadratic in the sample size) may be excessive: we therefore propose in Section 6 a modified version of the MMD statistic that has a linear cost in the number of samples, and an associated asymptotic test. [sent-83, score-0.28]
14 Finally, in Section 8, we demonstrate the performance of MMD-based two-sample tests on problems from neuroscience, bioinformatics, and attribute matching using the Hungarian marriage method. [sent-86, score-0.193]
15 2, we compute both the population MMD and two empirical estimates when the associated function space is a reproducing kernel Hilbert space, and in Section 2. [sent-99, score-0.187]
16 The empirical MMD defined below has an upward bias—we will define an unbiased statistic in the following section. [sent-148, score-0.274]
17 This statistic is unbiased following Serfling (1980, Chapter 5). [sent-207, score-0.247]
18 m(m − 1) i=1 Moreover, while the empirical statistic for m = n is an unbiased estimate of MMD2 , it does not have minimum variance, since we ignore the cross-terms k(xi , yi ), of which there are O(n). [sent-211, score-0.3]
19 The biased statistic in (2) may also be easily computed following the above reasoning. [sent-215, score-0.235]
20 Intuitively we expect the empirical test statistic MMD[F, X,Y ], whether biased or unbiased, to be small if p = q, and large if the distributions are far apart. [sent-218, score-0.36]
21 The empirical estimate fˆ∗ of the function f ∗ that witnesses the MMD—in other words, the function maximizing the mean discrepancy in (1)—is smooth, negative where the Laplace density exceeds the Gaussian density (at the center and tails), and positive where the Gaussian density is larger. [sent-240, score-0.186]
22 730 A K ERNEL T WO -S AMPLE T EST samples X ∼ p of size m and Y ∼ q of size n, the statistical test, T(X,Y ) : Xm × Xn → {0, 1} is used to distinguish between the null hypothesis H0 : p = q and the alternative hypothesis HA : p = q. [sent-253, score-0.184]
23 This is achieved by comparing the test statistic5 MMD[F, X,Y ] with a particular threshold: if the threshold is exceeded, then the test rejects the null hypothesis (bearing in mind that a zero population MMD indicates p = q). [sent-254, score-0.326]
24 1 L2 D ISTANCE BETWEEN PARZEN W INDOW E STIMATES The prior work closest to the current approach is the Parzen window-based statistic of Anderson et al. [sent-294, score-0.203]
25 Denote by Dr (p, q) := f p − fq r the Lr distance between the densities f p and fq corresponding to the distributions p and q, respectively. [sent-306, score-0.215]
26 Assume that fˆp and fˆq are given as kernel density estimates with kernel κ(x − x′ ), that is, fˆp (x) = m−1 ∑m κ(xi − x) and fˆq (y) is defined by i=1 analogy. [sent-308, score-0.195]
27 We now describe the asymptotic performance of a two-sample test using the statistic D2 ( fˆp , fˆq )2 . [sent-311, score-0.312]
28 We consider the power of the test under local departures from the null hypothesis. [sent-312, score-0.176]
29 (1994) define these to take the form fq = f p + δg, (7) where δ ∈ R, and g is a fixed, bounded, integrable function chosen to ensure that fq is a valid density for sufficiently small |δ|. [sent-314, score-0.189]
30 consider two cases: the kernel bandwidth converging to zero with increasing sample size, ensuring consistency of the Parzen window estimates of f p and fq ; and the case of a fixed bandwidth. [sent-316, score-0.183]
31 In the former case, the minimum distance with which the test −d/2 can discriminate f p from fq is7 δ = (m + n)−1/2 hm+n . [sent-317, score-0.161]
32 Formally, define sα as a threshold for the statistic D2 fˆp , fˆq , chosen to ensure the test has level α, and let δ = −d/2 (m + n)−1/2 hm+n c for some fixed c = 0. [sent-319, score-0.289]
33 The power of the L2 test against local alternatives is greater when the kernel is held fixed, since for any rate of decrease of hm+n with increasing sample size, δ will decrease more slowly than for a fixed kernel. [sent-324, score-0.193]
34 An RKHS-based approach generalizes the L2 statistic in a number of important respects. [sent-325, score-0.203]
35 Finally, the kernel approach leads us to establish consistency against a larger class of local alternatives to the null hypothesis than that considered by Anderson et al. [sent-333, score-0.236]
36 We embed x into an RKHS H via the feature mapping φ(x) := ex , where es is the unit vector in Rd taking value 1 in dimension s, and zero in the remaining entries. [sent-344, score-0.167]
37 (2008, Section 1, long version) note that this L2 statistic may not be the best choice for finite domains, citing a result of Lehmann and Romano (2005, Theorem 14. [sent-348, score-0.203]
38 733 ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Chi-squared statistic is optimal for the problem of goodness of fit testing for multinomials. [sent-354, score-0.203]
39 3 F URTHER M ULTIVARIATE T WO -S AMPLE T ESTS Biau and Gyorfi (2005) (Biau) use as their test statistic the L1 distance between discretized estimates of the probabilities, where the partitioning is refined as the sample size increases. [sent-358, score-0.316]
40 The resulting test relies on the asymptotic normality of the test statistic, and is not distribution-free under the null hypothesis for finite samples (the test threshold depends on p, as with our asymptotic test in Section 5; by contrast, our tests in Section 4 are distributionfree). [sent-362, score-0.607]
41 The resulting statistic is distribution-free under the null hypothesis at finite sample sizes, in which respect it is superior to the FriedmanRafsky test; on the other hand, it costs O((m + n)3 ) to compute. [sent-369, score-0.362]
42 As we shall see in our experimental comparisons, the test statistic is costly to compute; Hall and Tajvidi consider only tens of points in their experiments. [sent-371, score-0.263]
43 Pearson’s Chi-squared statistic weights each term in the sum (8) by its corresponding q−1 . [sent-379, score-0.203]
44 Our next goal is to refine this result in a way that allows us to define a test threshold under the null hypothesis p = q. [sent-391, score-0.217]
45 10 Corollary 9 A hypothesis test of level α for the null hypothesis p = q, that is, for MMD[F, p, q] = 0, has the acceptance region MMDb [F, X,Y ] < 2K/m 1 + 2 log α−1 . [sent-400, score-0.244]
46 In this case, the test statistic has a Student-t distribution with n + m − 2 degrees of freedom, and its Type II error probability converges at the same rate as our test. [sent-405, score-0.287]
47 When F is the unit ball in an RKHS, however, we may very easily define a test via a convergence bound on the unbiased statistic MMD2 in Lemma 4. [sent-416, score-0.307]
48 u Corollary 11 A hypothesis test of level α for the null hypothesis p = q has the acceptance region √ MMD2 < (4K/ m) log(α−1 ). [sent-422, score-0.244]
49 We note first that the threshold for the biased statistic applies to an estimate of MMD, whereas that for the unbiased statistic is for an estimate of MMD2 . [sent-425, score-0.508]
50 Thus for sufficiently large11 m, the McDiarmid-based threshold will be lower (and the associated test statistic is in any case biased upwards), and its Type II error will be better for a given Type I bound. [sent-427, score-0.321]
51 Our goal is to determine whether the empirical test statistic MMD2 is so large as to be outside u the 1 − α quantile of the null distribution in (10), which gives a level α test. [sent-456, score-0.368]
52 Consistency of this test against local departures from the null hypothesis is provided by the following theorem, proved in Appendix B. [sent-457, score-0.229]
53 Theorem 13 Define ρx , ρy , and t as in Theorem 12, and write µq = µ p + gt , where gt ∈ H is chosen such that µ p + gt remains a valid mean embedding, and gt H is made to approach zero as t → ∞ to describe local departures from the null hypothesis. [sent-459, score-0.432]
54 In Figure u u 3, we illustrate the Pearson curve fit to the null distribution: the fit is good in the upper quantiles of the distribution, where the test threshold is computed. [sent-496, score-0.164]
55 A Linear Time Statistic and Test The MMD-based tests are already more efficient than the O(m2 log m) and O(m3 ) tests described in Section 3. [sent-521, score-0.224]
56 Moreover, we would like to obtain tests which have O(1) storage requirements for computing the test statistic, in order to apply the test to data streams. [sent-525, score-0.232]
57 We now describe how to achieve this by computing the test statistic using a subsampling of the terms in the sum. [sent-526, score-0.263]
58 In particular, the statistic can be used in stream computations with need for only O(1) memory, whereas MMD2 requires O(m) storage and O(m2 ) time to u compute the kernel h on all interacting pairs. [sent-531, score-0.282]
59 It is instructive to compare this asymptotic distribution with that of the quadratic time statistic MMD2 under HA , u when m = n. [sent-544, score-0.252]
60 That said, it remains to be determined what effect this approximation would have on the distribution of the test statistic under H0 , and hence on the test threshold. [sent-550, score-0.323]
61 4), a two-sample statistic using a distribution over witness functions (Section 7. [sent-556, score-0.241]
62 We have sup E p f − Eq f + sup Eq g − Er g ≥ sup E p f − Eq f + Eq f − Er f ∈F f ∈F g∈F ≥ sup |E p f − Er f | . [sent-575, score-0.188]
63 1 KOLMOGOROV-S MIRNOV S TATISTIC The Kolmogorov-Smirnov (K-S) test is probably one of the most famous two-sample tests in statistics. [sent-602, score-0.172]
64 With the benefit of hindsight, it is now understandable why the kernel k(Xi , X j ) = 1 mi m j mi ,m j ∑ k(xiu , x jv ) u,v produces useful results: it is simply the kernel between the empirical means in feature space µ(Xi ), µ(X j ) (Hein et al. [sent-666, score-0.185]
65 Note, however, that the empirical mean embedding µX may not be the best statistic to use for MIC: we are only interested in determining whether some instances in the domain have the desired property, rather than making a statement regarding the distribution over all instances. [sent-669, score-0.265]
66 We remark that a hypothesis test based on the above kernel statistic is more complicated than for the two-sample problem, since the product of the marginal distributions is in effect simulated by permuting the variables of the original sample. [sent-723, score-0.433]
67 An important issue in the practical application of the MMD-based tests is the selection of the kernel parameters. [sent-758, score-0.191]
68 We illustrate this with a Gaussian RBF kernel, where we must choose the kernel width σ (we use this kernel for univariate and multivariate data, but not for graphs). [sent-759, score-0.187]
69 We further note that setting the kernel using the sample being tested may cause changes to the asymptotic distribution: in particular, the analysis in Sections 4 and 5 assumes the kernel not to be a function of the sample. [sent-763, score-0.235]
70 2 1 10 2 10 Dimension 3 10 0 0 10 4 10 1 10 2 10 Dimension FR Wolf H FR Smirnov Hall 3 10 4 10 Figure 5: Type II performance of the various tests when separating two Gaussians, with test level α = 0. [sent-796, score-0.172]
71 3 Computational Cost We next investigate the tradeoff between computational cost and performance of the various tests, with a particular focus on how the quadratic-cost MMD tests from Sections 4 and 5 compare with the linear time MMD-based asymptotic test from Section 6. [sent-971, score-0.221]
72 It may also be worth doing a t-test first in this case, and only running more sophisticated nonparametric tests if the t-test accepts the null hypothesis, to verify the distributions are identical in more than just mean. [sent-989, score-0.228]
73 For the final data set, we performed univariate matching of attributes (FOREST) and multivariate matching of tables (FOREST10D) from two different databases, where each table represents one type of forest. [sent-1016, score-0.182]
74 Both our asymptotic MMD2 -based tests perform as well as or better than the alternatives, notably for u CNUM, where the advantage of MMD2 is large. [sent-1017, score-0.161]
75 , φn (An )): in other words, the kernel decomposes into kernels on the individual attributes of A (and also decomposes this way on the attributes of B). [sent-1028, score-0.215]
76 Conclusion We have established three simple multivariate tests for comparing two distributions p and q, based on samples of size m and n from these respective distributions. [sent-1064, score-0.179]
77 Our test statistic is the maximum mean discrepancy (MMD), defined as the maximum deviation in the expectation of a function evaluated on each of the random variables, taken over a sufficiently rich function class: in our case, a reproducing kernel Hilbert space (RKHS). [sent-1065, score-0.422]
78 Equivalently, the statistic can be written as the norm of the difference between distribution feature means in the RKHS. [sent-1066, score-0.203]
79 We also give a third test based on quantiles of the asymptotic distribution 754 A K ERNEL T WO -S AMPLE T EST of the associated test statistic. [sent-1069, score-0.169]
80 All three tests can be computed in O((m + n)2 ) time, however when sufficient data are available, a linear time statistic can be used, which in our experiments was able to achieve a given Type II error at smaller computational cost, by looking at many more samples than the quadratic-cost tests. [sent-1070, score-0.315]
81 We have seen in Section 7 that several classical metrics on probability distributions can be written as integral probability metrics with function classes that are not Hilbert spaces, but rather Banach or seminormed spaces (for instance the Kolmogorov-Smirnov and Earth Mover’s distances). [sent-1071, score-0.164]
82 (2011b) provide expressions for the maximum mean discrepancy in terms of mean embeddings in reproducing kernel Banach spaces. [sent-1074, score-0.2]
83 (2008) define a two-sample test statistic arising from the kernel Fisher discriminant, rather than the difference of RKHS means; and Nguyen et al. [sent-1082, score-0.342]
84 mn i=1 j=1 (20) We begin with the asymptotic analysis of MMD2 [F, X,Y ] under the null hypothesis. [sent-1163, score-0.173]
85 (1994, Appendix) that vanish for large sample sizes, since our statistic is unbiased; • we require greater generality, since our kernels are not necessarily inner products in L2 between probability density functions (although this is a special case: see Section 3. [sent-1166, score-0.328]
86 760 A K ERNEL T WO -S AMPLE T EST where bl ∼ N(0, 1) independent of the al , and ∞ 1 m n ˜ √ ∑ ∑ k(xi , y j ) → ∑ λl al bl , D mn i=1 j=1 l=1 (28) both jointly in distribution with (27), where (28) is proved at the end of the section. [sent-1196, score-0.226]
87 Then ∞ ∞ l=1 l=1 tMMD2 [F, X,Y ] → ρ−1 ∑ λl (a2 − 1) + ρ−1 ∑ λl (b2 − 1) − √ y u x l l D ∞ ∑ λl = l=1 −1/2 (ρx −1/2 al − ρy ∞ 2 ∑ λl al bl ρx ρy l=1 bl )2 − (ρx ρy )−1 . [sent-1199, score-0.18]
88 mn i=1 j=1 l=1 The target distribution is written ∞ V = ∑ λl al bl , l=1 and its truncation is L VL := ∑ λl al bl . [sent-1206, score-0.226]
89 4), where we u consider a more general class of local departures from the null hypothesis (rather than the class of perturbed densities described in Section 3. [sent-1234, score-0.169]
90 n(n − 1) i=1 j=i i=1 j=1 We begin by transforming this statistic by centering the samples X and Y in feature space by µ p and µq , respectively; unlike the H0 case, however, µ p = µq , and the new statistic MMD2 is not the same c as MMD2 . [sent-1238, score-0.406]
91 ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA The resulting centred statistic is MMD2 [F, X,Y ] = c + m m 1 ∑ φ(xi ) − µ p , φ(x j ) − µ p m(m − 1) i=1 ∑ j=i n n 1 ∑ ∑ φ(yi ) − µq , φ(y j ) − µq n(n − 1) i=1 j=i − H H 2 m n ∑ ∑ φ(xi ) − µ p , φ(y j ) − µq mn i=1 j=1 H . [sent-1241, score-0.249]
92 We write µq = µ p + gt , where gt ∈ H is chosen such that µ p + gt remains a valid distribution embedding, and gt H can be made to approach zero to describe local departures from the null hypothesis. [sent-1242, score-0.432]
93 Consider the case where the departure from the null hypothesis satisfies gt H = ct −1/2 . [sent-1246, score-0.21]
94 Then, as t → ∞, ∞ tMMD2 [F, X,Y ] → ∑ λl (ρx c D −1/2 −1/2 al + ρy l=1 bl )2 − (ρx ρy )−1 =: S as before, since the distance between µ p and µq vanishes for large t (as gt 1 n √ ∑ n i=1 gt , φ(yi ) − µq gt H and H 1 m √ ∑ m i=1 H → 0). [sent-1247, score-0.352]
95 Next, the terms gt , φ(xi ) − µ p gt H H in the difference between MMD2 and MMD2 are straightforward sums of independent zero mean u c random variables, and have Gaussian asymptotic distribution. [sent-1248, score-0.207]
96 Defining uy to be the zero mean Gaussian random variable associated with the first term, t n ∑ gt , φ(yi ) − µq n i=1 H = → D Likewise, t ct −1/2 n n gt , φ(yi ) − µq gt H ∑ i=1 H −1/2 cρy uy . [sent-1249, score-0.291]
97 t m ∑ gt , φ(xi ) − µ p m i=1 H→ D −1/2 cρx ux , where ux is a zero mean Gaussian random variable independent of uy (note, however, that ux and uy are correlated with terms in S, and are defined on the same probability space as al and bl in this sum). [sent-1250, score-0.247]
98 We investigated three kernel choice strategies: kernel selection on the entire sample from p and q; kernel selection on a hold-out set (10% of data), and testing on the remaining 90%; and kernel selection and testing on 90% of the available data. [sent-1272, score-0.344]
99 Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. [sent-1292, score-0.187]
100 On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. [sent-1355, score-0.252]
wordName wordTfidf (topN-words)
[('mmd', 0.687), ('statistic', 0.203), ('ex', 0.167), ('mmdb', 0.164), ('olkopf', 0.158), ('wo', 0.141), ('asch', 0.135), ('mola', 0.135), ('orgwardt', 0.135), ('retton', 0.135), ('gretton', 0.127), ('ey', 0.12), ('tests', 0.112), ('ample', 0.109), ('ernel', 0.094), ('est', 0.094), ('cnum', 0.089), ('sriperumbudur', 0.085), ('ch', 0.082), ('gt', 0.079), ('kernel', 0.079), ('null', 0.078), ('rkhs', 0.076), ('fq', 0.076), ('ser', 0.073), ('pearson', 0.067), ('bio', 0.063), ('fukumizu', 0.063), ('test', 0.06), ('bl', 0.058), ('tmnl', 0.057), ('xi', 0.055), ('hypothesis', 0.053), ('wolf', 0.051), ('attributes', 0.05), ('borgwardt', 0.049), ('anderson', 0.049), ('asymptotic', 0.049), ('population', 0.049), ('discrepancy', 0.048), ('sup', 0.047), ('ez', 0.046), ('attribute', 0.046), ('mn', 0.046), ('dudley', 0.045), ('tmn', 0.044), ('unbiased', 0.044), ('pxy', 0.043), ('forest', 0.043), ('parzen', 0.042), ('sch', 0.042), ('embeddings', 0.041), ('hilbert', 0.039), ('rasch', 0.039), ('metrics', 0.039), ('distributions', 0.038), ('departures', 0.038), ('smirnov', 0.038), ('witness', 0.038), ('ing', 0.037), ('density', 0.037), ('ii', 0.036), ('rkhss', 0.036), ('kernels', 0.036), ('embedding', 0.035), ('matching', 0.035), ('metric', 0.035), ('fr', 0.035), ('type', 0.033), ('biased', 0.032), ('biau', 0.032), ('al', 0.032), ('harchaoui', 0.032), ('neuroii', 0.032), ('rafsky', 0.032), ('smir', 0.032), ('reproducing', 0.032), ('multivariate', 0.029), ('hungarian', 0.029), ('hall', 0.029), ('lkopf', 0.029), ('characteristic', 0.028), ('bootstrap', 0.028), ('sample', 0.028), ('mcdiarmid', 0.028), ('dataset', 0.027), ('uy', 0.027), ('empirical', 0.027), ('yi', 0.026), ('fp', 0.026), ('smola', 0.026), ('theorem', 0.026), ('alternatives', 0.026), ('threshold', 0.026), ('dud', 0.025), ('hein', 0.025), ('stmn', 0.025), ('py', 0.025), ('distance', 0.025), ('probability', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 4 jmlr-2012-A Kernel Two-Sample Test
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
2 0.18848167 44 jmlr-2012-Feature Selection via Dependence Maximization
Author: Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt
Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt independence criterion, Hilbert space embedding of distribution
3 0.081382789 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
4 0.07735756 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
5 0.060482126 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
6 0.057927277 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
7 0.053090803 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
8 0.051769372 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
9 0.050612941 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
10 0.050487302 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
11 0.049768999 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
12 0.0493416 80 jmlr-2012-On Ranking and Generalization Bounds
13 0.049155015 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
14 0.044198234 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
15 0.041210622 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
16 0.040623311 111 jmlr-2012-Structured Sparsity and Generalization
17 0.038839132 59 jmlr-2012-Linear Regression With Random Projections
18 0.036037683 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
19 0.035001047 12 jmlr-2012-Active Clustering of Biological Sequences
20 0.034675032 84 jmlr-2012-Online Submodular Minimization
topicId topicWeight
[(0, -0.199), (1, 0.056), (2, 0.027), (3, 0.146), (4, 0.127), (5, -0.031), (6, -0.115), (7, 0.035), (8, -0.002), (9, 0.049), (10, 0.005), (11, -0.153), (12, -0.068), (13, -0.026), (14, 0.091), (15, 0.15), (16, 0.132), (17, -0.19), (18, 0.056), (19, 0.042), (20, 0.198), (21, 0.082), (22, -0.04), (23, -0.105), (24, -0.272), (25, 0.003), (26, 0.051), (27, 0.208), (28, 0.085), (29, -0.158), (30, -0.162), (31, -0.024), (32, -0.177), (33, 0.158), (34, -0.0), (35, 0.024), (36, -0.028), (37, 0.065), (38, 0.086), (39, 0.074), (40, 0.012), (41, -0.017), (42, -0.01), (43, -0.095), (44, 0.072), (45, 0.081), (46, 0.025), (47, 0.003), (48, -0.16), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.90133274 4 jmlr-2012-A Kernel Two-Sample Test
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
2 0.77939844 44 jmlr-2012-Feature Selection via Dependence Maximization
Author: Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt
Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt independence criterion, Hilbert space embedding of distribution
3 0.32012022 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
4 0.31586543 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon
Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet
5 0.29549086 100 jmlr-2012-Robust Kernel Density Estimation
Author: JooSeuk Kim, Clayton D. Scott
Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation
6 0.28857082 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
7 0.28136876 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
8 0.25667909 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
9 0.24954063 109 jmlr-2012-Stability of Density-Based Clustering
10 0.24195449 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
11 0.23617491 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.22283933 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
13 0.22164144 59 jmlr-2012-Linear Regression With Random Projections
14 0.21181047 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
15 0.20432146 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
16 0.20323892 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.20321067 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
18 0.19727093 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
19 0.19613671 111 jmlr-2012-Structured Sparsity and Generalization
20 0.18662053 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
topicId topicWeight
[(21, 0.031), (26, 0.038), (29, 0.47), (35, 0.02), (49, 0.027), (56, 0.016), (64, 0.014), (69, 0.019), (75, 0.096), (79, 0.013), (92, 0.082), (96, 0.064)]
simIndex simValue paperId paperTitle
1 0.92248344 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAM challenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010). Keywords: causality, graphical models, randomized experiments, structural equation models, latent variables, latent confounders, cycles
2 0.90771133 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
same-paper 3 0.86171848 4 jmlr-2012-A Kernel Two-Sample Test
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
4 0.56244069 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
5 0.52927226 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
6 0.51363444 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
7 0.50603592 100 jmlr-2012-Robust Kernel Density Estimation
8 0.48498964 109 jmlr-2012-Stability of Density-Based Clustering
9 0.48272541 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
10 0.47874007 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
11 0.47578612 82 jmlr-2012-On the Necessity of Irrelevant Variables
12 0.46474421 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
13 0.46453783 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
14 0.46220136 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
15 0.46057716 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
16 0.45078287 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
17 0.44998991 44 jmlr-2012-Feature Selection via Dependence Maximization
18 0.44764954 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
19 0.4462311 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
20 0.43935865 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach