jmlr jmlr2013 jmlr2013-9 jmlr2013-9-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sumio Watanabe
Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion
T. W. Anderson. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics, 22:327–351, 1951. M. Aoyagi and K. Nagata. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix-type singularity. Neural Computation, 24(6):1569–1610, 2012. 895 WATANABE M. Aoyagi and S. Watanabe. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Networks, 18(7):924–933, 2005. M. F. Atiyah. Resolution of singularities and division of distributions. Communications of Pure and Applied Mathematics, 13:145–150, 1970. I. N. J. Bernstein. Analytic continuation of distributions with respect to a parameter. Functional Analysis and its Applications, 6(4):26–40, 1972. M. Drton. Likelihood ratio tests and singularities. The Annals of Statistics, 37:979–1012, 2009. M. Drton. Reduced rank regression. In Workshop on Singular Learning Theory. American Institute of Mathematics, 2010. M. Drton, B. Sturmfels, and S. Sullivant. Lecures on Algebraic Statistics. Birkh¨ user, Berlin, 2009. a I. M. Gelfand and G. E. Shilov. Generalized Functions. Volume I: Properties and Operations. Academic Press, San Diego, 1964. I. J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, Cambridge, 1965. H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79:109–326, 1964. M. Kashiwara. B-functions and holonomic systems. Inventiones Mathematicae, 38:33–53, 1976. F. J. Kir´ ly, P. B¨ uau, F. C. Meinecke, D. A. J. Blythe, and K-R M¨ ller. Algebraic geometric a u u comparison of probability distributions. Journal of Machine Learning Research, 13:855–903, 2012. J. Koll´ r. Singularities of pairs. In Algebraic Geometry, Santa Cruz 1995, Proceedings of Symposia a in Pure Mathematics, volume 62, pages 221–286. American Mathematical Society, 1997. S. Lin. Algebraic Methods for Evaluating Integrals in Bayesian Statistics. PhD thesis, Ph.D. dissertation, University of California, Berkeley, 2011. D. Rusakov and D. Geiger. Asymptotic model selection for naive Bayesian network. Journal of Machine Learning Research, 6:1–35, 2005. M. Saito. On real log canonical thresholds. arXiv:0707.2308v1, 2007. M. Sato and T. Shintani. On zeta functions associated with prehomogeneous vector space. Annals of Mathematics, 100:131–170, 1974. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. A. Varchenko. Newton polyhedrons and estimates of oscillatory integrals. Functional Analysis and its Applications, 10(3):13–38, 1976. S. Watanabe. Algebraic analysis for singular statistical estimation. Lecture Notes in Computer Sciences, 1720:39–50, 1999. 896 WBIC S. Watanabe. Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13 (4):899–933, 2001a. S. Watanabe. Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8):1049–1060, 2001b. S. Watanabe. Algebraic geometry and statistical learning theory. Cambridge University Press, Cambridge, UK, 2009. S. Watanabe. Asymptotic learning curve and renormalizable condition in statistical learning theory. Journal of Physics Coneference Series, 233(1), 2010a. 012014. doi: 10.1088/17426596/233/1/012014. S. Watanabe. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11:3571–3591, 2010b. K. Yamazaki and S. Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, 16(7):1029–1038, 2003. K. Yamazaki and S. Watanabe. Singularities in complete bipartite graph-type boltzmann machines and upper bounds of stochastic complexities. IEEE Transactions on Neural Networks, 16(2): 312–324, 2005. P. Zwiernik. Asymptotic model selection and identifiability of directed tree models with hidden variables. CRiSM report, 2010. P. Zwiernik. An asymptotic behaviour of the marginal likelihood for general markov models. Journal of Machine Learning Research, 12:3283–3310, 2011. 897