jmlr jmlr2012 jmlr2012-4 jmlr2012-4-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
Y. Altun and A.J. Smola. Unifying divergence minimization and statistical inference via convex duality. In Proc. Annual Conf. Computational Learning Theory, LNCS, pages 139–153. Springer, 2006. N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50:41–54, 1994. 766 A K ERNEL T WO -S AMPLE T EST NEURO I, Bootstrap NEURO I, Pearson 0.8 0.8 Train Part All 0.7 0.6 Type II error 0.6 Type II error Train Part All 0.7 0.5 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 50 100 150 200 250 300 Sample size m 0 50 100 150 200 250 300 Sample size m Figure 7: Type II error on the Neural Data I set, for kernel computed via the median heuristic on the full data set (“All”), kernel computed via the median heuristic on a 10% hold-out set (“Train”), and kernel computed via the median heuristic on 90% of the data (“Part”). Results are plotted over 1000 repetitions. Left: Bootstrap results. Right: Pearson curve results. S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. MIT Press. M. Arcones and E. Gin´ . On the bootstrap of u and v statistics. The Annals of Statistics, 20(2): e 655–674, 1992. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007. A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer, 2004. G. Biau and L. Gyorfi. On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. IEEE Transactions on Information Theory, 51(11):3965–3973, 2005. P. Bickel. A distribution free version of the Smirnov two sample test in the p-variate case. The Annals of Mathematical Statistics, 40(1):1–23, 1969. 767 ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. URL K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics (ISMB), 21(Suppl 1):i47–i56, Jun 2005. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch¨ lkopf, and A. J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. Bioinformatics (ISMB), 22(14): e49–e57, 2006. O. Bousquet, S. Boucheron, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:323– 375, 2005. R. Caruana and T. Joachims. KDD cup. http://kodiak.cs.cornell.edu/kddcup/index.html. 2004. URL G. Casella and R. Berger. Statistical Inference. Duxbury, Pacific Grove, CA, 2nd edition, 2002. B. Chazelle. A minimum spanning tree algorithm with inverse-Ackermann type complexity. Journal of the ACM, 47:1028–1047, 2000. P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287–314, 1994. C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 5254 of Lecture Notes in Computer Science, pages 38–53. Springer, 2008. M. Davy, A. Gretton, A. Doucet, and P. J. W. Rayner. Optimized support vector machines for nonstationary signal classification. IEEE Signal Processing Letters, 9(12):442–445, 2002. V. de la Pe˜ a and E. Gin´ . Decoupling: from Dependence to Independence. Springer, New York, n e 1999. M. Dud´k and R. E. Schapire. Maximum entropy distribution estimation with generalized reguı larization. In Proceedings of the Annual Conference on Computational Learning Theory, pages 123–138. Springer Verlag, 2006. M. Dud´k, S. Phillips, and R.E. Schapire. Performance guarantees for regularized maximum entropy ı density estimation. In Proceedings of the Annual Conference on Computational Learning Theory, pages 472–486. Springer Verlag, 2004. R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002. W. Feller. An Introduction to Probability Theory and its Applications. John Wiley and Sons, New York, 2nd edition, 1971. A. Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3): 419–433, 1993. 768 A K ERNEL T WO -S AMPLE T EST S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, 2001. R. Fortet and E. Mourier. Convergence de la r´ paration empirique vers la r´ paration th´ orique. Ann. e e e ´ Scient. Ecole Norm. Sup., 70:266–285, 1953. J. Friedman. On multivariate goodness-of-fit and two-sample testing. Technical Report SLACPUB-10325, University of Stanford Statistics Department, 2003. J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov twosample tests. The Annals of Statistics, 7(4):697–717, 1979. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ lkopf. Kernel measures of conditional dependence. In o Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press. T. G¨ rtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In Proceedings of a the International Conference on Machine Learning, pages 179–186. Morgan Kaufmann Publishers Inc., 2002. E. Gokcay and J.C. Principe. Information theoretic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):158–171, 2002. A. Gretton, O. Bousquet, A.J. Smola, and B. Sch¨ lkopf. Measuring statistical dependence with o Hilbert-Schmidt norms. In Proceedings of the International Conference on Algorithmic Learning Theory, pages 63–77. Springer-Verlag, 2005a. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨ lkopf. Kernel methods for measuring o independence. Journal of Machine Learning Research, 6:2075–2129, 2005b. A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the twoo sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520, Cambridge, MA, 2007a. MIT Press. A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf, and A. Smola. A kernel approach to comparing distributions. Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), pages 1637–1641, 2007b. A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the two o sample problem. Technical Report 157, MPI for Biological Cybernetics, 2008a. A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨ lkopf, and A. Smola. A kernel statistical o test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592, Cambridge, MA, 2008b. MIT Press. A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems 22, Red Hook, NY, 2009. Curran Associates Inc. 769 ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford, third edition, 2001. P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359–374, 2002. Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discriminant analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press, Cambridge, MA, 2008. M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in SVMs. In Proceedings of the 26th DAGM Symposium, pages 270–277, Berlin, 2004. Springer. N. Henze and M. Penrose. On the multivariate runs test. The Annals of Statistics, 27(1):290–298, 1999. W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293–325, 1948. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. T. Jebara and R. Kondor. Bhattacharyya and expected likelihood kernels. In Proceedings of the Annual Conference on Computational Learning Theory, volume 2777 of LNCS, pages 57–71, Heidelberg, Germany, 2003. Springer-Verlag. N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1. John Wiley and Sons, 2nd edition, 1994. A. Kankainen. Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyv¨ skyl¨ , 1995. a a D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proceedings of the International Conference on Very Large Data Bases, pages 180–191. VLDB Endowment, 2004. H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. E. L. Lehmann and J. P. Romano. Testing Statistical Hypotheses. Springer, 3rd edition, 2005. C. McDiarmid. On the method of bounded differences. In Survey in Combinatorics, pages 148–188. Cambridge University Press, 1989. C. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7: 2651–2667, 2006. A. M¨ ller. Integral probability metrics and their generating classes of functions. Advances in u Applied Probability, 29(2):429–443, 1997. 770 A K ERNEL T WO -S AMPLE T EST X.L. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems 20, pages 1089–1096. MIT Press, Cambridge, MA, 2008. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. The Art of Scientific Computation. Cambridge University Press, Cambridge, UK, 1994. M. Rasch, A. Gretton, Y. Murayama, W. Maass, and N. K. Logothetis. Predicting spiking activity from local field potentials. Journal of Neurophysiology, 99:1461–1476, 2008. M. Reed and B. Simon. Methods of modern mathematical physics. Vol. 1: Functional Analysis. Academic Press, San Diego, 1980. M. Reid and R. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731–817, 2011. P. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society B, 67(4):515–530, 2005. Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000. B. Sch¨ lkopf. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997. Download: o http://www.kernel-machines.org. B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o B. Sch¨ lkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support o of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. B. Sch¨ lkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, o Cambridge, MA, 2004. R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. J. Shawe-Taylor and A. Dolia. A framework for probability density estimation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 468–475, 2007. H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57:3498 – 3511, 2009. B. W. Silverman. Density Estimation for Statistical and Data Analysis. Monographs on statistics and applied probability. Chapman and Hall, London, 1986. N.V. Smirnov. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Moscow University Mathematics Bulletin, 2:3–26, 1939. University of Moscow. 771 ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA A. J. Smola and B. Sch¨ lkopf. Sparse greedy matrix approximation for machine learning. In Proo ceedings of the International Conference on Machine Learning, pages 911–918, San Francisco, 2000. Morgan Kaufmann Publishers. A. J. Smola, A. Gretton, L. Song, and B. Sch¨ lkopf. A Hilbert space embedding for distributions. o In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 13–31. Springer, 2007. L. Song, X. Zhang, A. Smola, A. Gretton, and B. Sch¨ lkopf. Tailoring density estimation via reo producing kernel moment matching. In Proceedings of the International Conference on Machine Learning, pages 992–999. ACM, 2008. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨ lkopf. Injective Hilbert space o embeddings of probability measures. In Proceedings of the Annual Conference on Computational Learning Theory, pages 111–122, 2008. B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems 22, Red Hook, NY, 2009. Curran Associates Inc. B. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨ lkopf, and G. Lanckriet. Non-parametric o estimation of integral probability metrics. In International Symposium on Information Theory, pages 1428 – 1432, 2010a. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨ lkopf. Hilbert space emo beddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517– 1561, 2010b. B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389–2410, 2011a. B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint. In Advances in Neural Information Processing Systems 24. Curran Associates Inc., Red Hook, NY, 2011b. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008. I. Takeuchi, Q. V. Le, T. Sears, and A. J. Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 2006. D. M. J. Tax and R. P. W. Duin. Data domain description by support vectors. In Proceedings ESANN, pages 251–256, Brussels, 1999. D Facto. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996. L. Wasserman. All of Nonparametric Statistics. Springer, 2006. 772 A K ERNEL T WO -S AMPLE T EST J. E. Wilkins. A note on skewness and kurtosis. The Annals of Mathematical Statistics, 15(3): 333–335, 1944. C. K. I. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001. MIT Press. 773