jmlr jmlr2010 jmlr2010-47 jmlr2010-47-reference knowledge-graph by maker-knowledge-mining

47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures


Source: pdf

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und


reference text

S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B (Methodological), 28:131–142, 1966. N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50:41–54, 1994. N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337–404, 1950. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. A. D. Barbour and L. H. Y. Chen. An Introduction to Stein’s Method. Singapore University Press, Singapore, 2005. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Spring Verlag, New York, 1984. A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, London, UK, 2004. K. M. Borgwardt, A. Gretton, M. Rasch, H.-P. Kriegel, B. Sch¨ lkopf, and A. J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49– e57, 2006. P. Br´ maud. Mathematical Principles of Signal Processing. Springer-Verlag, New York, 2001. e I. Csisz´ r. Information-type measures of difference of probability distributions and indirect obsera vations. Studia Scientiarium Mathematicarum Hungarica, 2:299–318, 1967. W. Dahmen and C. A. Micchelli. Some remarks on ridge functions. Approx. Theory Appl., 3: 139–143, 1987. E. del Barrio, J. A. Cuesta-Albertos, C. Matr´ n, and J. M. Rodr´guez-Rodr´guez. Testing of gooda ı ı ness of fit based on the L2 -Wasserstein distance. Annals of Statistics, 27:1230–1239, 1999. L. Devroye and L. Gy¨ rfi. No empirical probability measure can converge in the total variation o sense for all distributions. Annals of Statistics, 18(3):1496–1499, 1990. 1558 H ILBERT S PACE E MBEDDING AND C HARACTERISTIC K ERNELS R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002. G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience, New York, 1999. B. Fuglede and F. Topsøe. Jensen-Shannon divergence and Hilbert space embedding, 2003. Preprint. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ lkopf. Kernel measures of conditional dependence. o In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press. K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. Annals of Statistics, 37(5):1871–1905, 2009a. K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Sch¨ lkopf. Characteristic kernels on groups o and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 473–480, 2009b. C. Gasquet and P. Witomski. Fourier Analysis and Applications. Springer-Verlag, New York, 1999. A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419–435, 2002. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨ lkopf. Kernel methods for measuring o independence. Journal of Machine Learning Research, 6:2075–2129, December 2005a. A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Sch¨ lkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. o In Z. Ghahramani and R. Cowell, editors, Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 1–8, 2005b. A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the two o sample problem. Technical Report 157, MPI for Biological Cybernetics, 2007a. A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the o two sample problem. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural o Information Processing Systems 19, pages 513–520. MIT Press, 2007b. A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch¨ lkopf, and A. J. Smola. A kernel statistical o test of independence. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 585–592. MIT Press, 2008. M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Z. Ghahramani and R. Cowell, editors, Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 136–143, 2005. 1559 ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in SVMs. In Proceedings of the 26th DAGM Symposium, pages 270–277, Berlin, 2004. Springer. E. L. Lehmann and J. P. Romano. Testing Statistical Hypothesis. Springer-Verlag, New York, 2005. F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Trans. Information Theory, 52(10):4394–4412, 2006. T. Lindvall. Lectures on the Coupling Method. John Wiley & Sons, New York, 1992. C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7:2651–2667, 2006. A. M¨ ller. Integral probability metrics and their generating classes of functions. Advances in u Applied Probability, 29:429–443, 1997. A. Pinkus. Strictly positive definite functions on a real inner product space. Adv. Comput. Math., 20:263–271, 2004. S. T. Rachev. Probability Metrics and the Stability of Stochastic Models. John Wiley & Sons, Chichester, 1991. S. T. Rachev and L. R¨ schendorf. Mass Transportation Problems. Vol. I Theory, Vol. II Applications. u Probability and its Applications. Springer-Verlag, Berlin, 1998. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. M. Reed and B. Simon. Methods of Modern Mathematical Physics: Functional Analysis I. Academic Press, New York, 1980. M. Rosenblatt. A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Annals of Statistics, 3(1):1–14, 1975. W. Rudin. Functional Analysis. McGraw-Hill, USA, 1991. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57:3498 – 3511, 2009. G. R. Shorack. Probability for Statisticians. Springer-Verlag, New York, 2000. A. J. Smola, A. Gretton, L. Song, and B. Sch¨ lkopf. A Hilbert space embedding for distributions. o In Proc. 18th International Conference on Algorithmic Learning Theory, pages 13–31. SpringerVerlag, Berlin, Germany, 2007. B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch¨ lkopf. Injective o Hilbert space embeddings of probability measures. In R. Servedio and T. Zhang, editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122, 2008. 1560 H ILBERT S PACE E MBEDDING AND C HARACTERISTIC K ERNELS B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Sch¨ lkopf. Kernel choice o and classifiability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1750–1758. MIT Press, 2009a. B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨ lkopf, and G. R. G. Lanckriet. On integral o probability metrics, φ-divergences and binary classification. http://arxiv.org/abs/0901.2698v4, October 2009b. B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. On the relation between universality, characteristic kernels and RKHS embedding of measures. In Y. W. Teh and M. Titterington, editors, Proc. 13th International Conference on Artificial Intelligence and Statistics, volume 9 of Workshop and Conference Proceedings. JMLR, 2010a. B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. http://arxiv.org/abs/1003.0887, March 2010b. C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1972. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. J. Stewart. Positive definite functions and generalizations, an historical survey. Rocky Mountain Journal of Mathematics, 6(3):409–433, 1976. I. Vajda. Theory of Statistical Inference and Information. Kluwer Academic Publishers, Boston, 1989. S. S. Vallander. Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl., 18:784–786, 1973. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. SpringerVerlag, New York, 1996. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. N. Weaver. Lipschitz Algebras. World Scientific Publishing Company, 1999. H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005. 1561