jmlr jmlr2013 jmlr2013-57 jmlr2013-57-reference knowledge-graph by maker-knowledge-mining

57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

Source: pdf

Author: Kenji Fukumizu, Le Song, Arthur Gretton

Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and ﬁltering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space

reference text

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950. 3779 F UKUMIZU , S ONG AND G RETTON Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. Charles R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973. Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher, 2004. David Blei and Michael Jordan. Variational inference for dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144, 2006. Byron Boots, Arthur Gretton, and Geoffrey J. Gordon. Hilbert space embeddings of predictive state representations. In Proceedings of Conference on Uncertainty in Artiﬁcial Intelligence (UAI2013), pages 92–101, 2013. Aedian W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353–360, 1984. Andrea Caponnetto and Ernesto De Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. Arnaud Doucet, Nando De Freitas, and Neil J. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. Mona Eberts and Ingo Steinwart. Optimal learning rates for least squares svms using gaussian kernels. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1539–1547. Curran Associates, Inc., 2011. Heinz W. Engl, Martin Hanke, and Andreas Neubauer. Regularization of Inverse Problems. Kluwer Academic Publishers, 2000. Shai Fine and Katya Scheinberg. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, 2001. Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73– 99, 2004. Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch¨ lkopf. Kernel measures of cono ditional dependence. In Advances in Neural Information Processing Systems 20, pages 489–496. MIT Press, 2008. Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Kernel dimension reduction in regression. Annals of Statistics, 37(4):1871–1905, 2009a. Kenji Fukumizu, Bhrath Sriperumbudur, Arthur Gretton, and Bernhard Sch¨ lkopf. Characteristic o kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages 473–480, Red Hook, NY, 2009b. Curran Associates Inc. 3780 K ERNEL BAYES ’ RULE Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Sch¨ lkopf, and A. Smola. A kernel o method for the two-sample-problem. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances o in Neural Information Processing Systems 19, pages 513–520, Cambridge, MA, 2007. MIT Press. Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Sch¨ lkopf, and Alex Smola. o A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592. MIT Press, 2008. Arthur Gretton, Kenji Fukumizu, Zaid Harchaoui, and Bharath Sriperumbudur. A fast, consistent kernel two-sample test. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 673–681. 2009a. Arthur Gretton, Kenji Fukumizu, and Bharath K. Sriperumbudur. Discussion of: Brownian distance covariance. Annals of Applied Statistics, 3(4):1285–1294, 2009b. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Sch¨ lkopf, and Alex Smola. A kernel o two-sample test. Journal of Machine Learning Research, 13:723–773, 2012. Steffan Gr¨ new¨ lder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilu a iano Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning (ICML2012), pages 1823–1830, 2012. Steffen Gr¨ new¨ lder, Arthur Gretton, and John Shawe-Taylor. Smooth operators. In Proceedings of u a the 30th International Conference on Machine Learning (ICML2013), pages 1184–1192, 2013. Zaid Harchaoui, Francis Bach, and Eric Moulines. Testing for homogeneity with kernel Fisher discriminant analysis. In Advances in Neural Information Processing Systems 21, pages 609– 616, Cambridge, MA, 2008. MIT Press. Simon J. Julier and Jeffrey K. Uhlmann. A new extension of the kalman ﬁlter to nonlinear systems. In Proceedings of AeroSense: The 11th International Symposium on Aerospace/Defence Sensing, Simulation and Controls, 1997. Annaliisa Kankainen and Nikolai G. Ushakov. A consistent modiﬁcation of a test for independence based on the empirical characteristic function. Journal of Mathematical Sciencies, 89:1582–1589, 1998. Steven MacEachern. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics – Simulation and Computation, 23(3):727–741, 1994. Steven N. MacEachern, Merlise Clyde, and Jun S. Liu. Sequential importance sampling for nonparametric Bayes models: The next generation. The Canadian Journal of Statistics, 27(2):251–267, 1999. Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavare. Markov chain monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324–15328, 2003. Florence Merlev´ de, Magda Peligrad, and Sergey Utev. Sharp conditions for the clt of linear proe cesses in a hilbert space. Journal of Theoretical Probability, 10:681–693, 1997. 3781 F UKUMIZU , S ONG AND G RETTON Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177–204, 2005. Sebastian Mika, Bernhard Sch¨ lkopf, Alex Smola, Klaus-Robert M¨ ller, Matthias Scholz, and Guno u nar R¨ tsch. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information a Pecessing Systems 11, pages 536–542. MIT Press, 1999. Val´ rie Monbet, Pierre Ailliot, and Pierre-Francois Marteau. l 1 -convergence of smoothing densities e ¸ in non-parametric state space models. Statistical Inference for Stochastic Processes, 11:311–325, 2008. Peter M¨ ller and Fernando A. Quintana. Nonparametric Bayesian data analysis. Statistical Science, u 19(1):95–110, 2004. Shigeki Nakagome, Kenji Fukumizu, and Shuhei Mano. Kernel approximate Bayesian computation for population genetic inferences. Statistical Applications in Genetics and Molecular Biology, 2013. Accepted. Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of POMDPs. In Proceedigns of Conference on Uncertainty in Artiﬁcial Intelligence (UAI2012), pages 644–653, 2012. Mats Rudemo. Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9(2):pp. 65–78, 1982. Bernhard Sch¨ lkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002. o Albert N. Shiryaev. Probability. Springer, 2nd edition, 1995. Scott A. Sisson, Yanan Fan, and Mark M. Tanaka. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007. Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximation. Constructive Approximation, 26:153–172, 2007. Le Song, Jonathan Huang, Alexander Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning (ICML2009), pages 961–968, 2009. Le Song, Arthur Gretton., and Carlos Guestrin. Nonparametric tree graphical models via kernel embeddings. In Proceedings of AISTATS 2010, pages 765–772, 2010a. Le Song, Sajid M. Siddiqi, Geoffrey Gordon, and Alexander Smola. Hilbert space embeddings of hidden Markov models. In Proceedings of the 27th International Conference on Machine Learning (ICML2010), pages 991–998, 2010b. Le Song, Arthur Gretton, Danny Bickson, Yucheng Low, and Carlos Guestrin. Kernel belief propagation. In Proceedings of AISTATS 2011, pages 707–715, 2011. 3782 K ERNEL BAYES ’ RULE Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf, and Gert R.G. o Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010. Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389–2410, 2011. John Stachurski. A Hilbert space central limit theorem for geometrically ergodic Markov chains. Working paper, Australian National University, 2012. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. Charles J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040–1053, 1982. Simon Tavar´ , David J. Balding, Robert C. Grifﬁthis, and Peter Donnelly. Inferring coalescence e times from dna sequece data. Genetics, 145:505–518, 1997. Sebastian Thrun, John Langford, and Dieter Fox. Monte Carlo hidden Markov models: Learning non-parametric models of partially observable stochastic processes. In Proceedings of International Conference on Machine Learning (ICML 1999), pages 415–424, 1999. Larry Wasserman. All of Nonparametric Statistics. Springer, 2006. Mike West, Peter M¨ ller, and Michael D. Escobar. Hierarchical priors and mixture models, with apu plications in regression and density estimation. In P. Freeman et al, editor, Aspects of Uncertainty: A Tribute to D.V. Lindley, pages 363–386. Wiley, 1994. Harold Widom. Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society, 109:278–295, 1963. Harold Widom. Asymptotic behavior of the eigenvalues of certain integral equations II. Archive for Rational Mechanics and Analysis, 17:215–229, 1964. Kun Zhang, Jan Peters, Dominik Janzing, and Bernhard Sch¨ lkopf. Kernel-based conditional indeo pendence test and application in causal discovery. In 27th Conference on Uncertainty in Artiﬁcial Intelligence, pages 804–813, 2011. 3783