nips nips2006 nips2006-5 nips2006-5-reference knowledge-graph by maker-knowledge-mining

5 nips-2006-A Kernel Method for the Two-Sample-Problem


Source: pdf

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.


reference text

[1] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50:41–54, 1994.

[2] M. Arcones and E. Gin´ . On the bootstrap of u and v statistics. The Annals of Statistics, 20(2):655–674, e 1992.

[3] G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11):3965–3973, 2005.

[4] P. Bickel. A distribution free version of the Smirnov two sample test in the p-variate case. The Annals of Mathematical Statistics, 40(1):1–23, 1969.

[5] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.

[6] K. M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Sch¨ lkopf, and A.J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

[7] K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

[8] R. Caruana and T. Joachims. Kdd cup. http://kodiak.cs.cornell.edu/kddcup/index.html, 2004.

[9] G. Casella and R. Berger. Statistical Inference. Duxbury, Pacific Grove, CA, 2nd edition, 2002.

[10] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.

[11] R. Fortet and E. Mourier. Convergence de la r´ paration empirique vers la r´ paration th´ orique. Ann. e e e ´ Scient. Ecole Norm. Sup., 70:266–285, 1953.

[12] J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697–717, 1979.

[13] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the two sample o problem. Technical Report 157, MPI for Biological Cybernetics, 2007.

[14] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford, third edition, 2001.

[15] P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359–374, 2002.

[16] M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in svm’s. In Proceedings of the 26th DAGM Symposium, pages 270–277, Berlin, 2004. Springer.

[17] N. Henze and M. Penrose. On the multivariate runs test. The Annals of Statistics, 27(1):290–298, 1999.

[18] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1 (Second Edition). John Wiley and Sons, 1994.

[19] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955.

[20] P. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society B, 67(4):515–530, 2005.

[21] R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.

[22] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67–93, 2002.