jmlr jmlr2013 jmlr2013-108 jmlr2013-108-reference knowledge-graph by maker-knowledge-mining

108 jmlr-2013-Stochastic Variational Inference

Source: pdf

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

reference text

A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in latent variable models. In Web Search and Data Mining, New York, NY, USA, 2012. S. Amari. Differential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, 10(2):357–385, 1982. 1341 H OFFMAN , B LEI , WANG AND PAISLEY S. Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251–276, 1998. C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. In Uncertainty in Artiﬁcial Intelligence, 2009. H. Attias. Inferring parameters and structure of latent variable models by variational bayes. In Uncertainty in Artiﬁcial Intelligence, 1999. H. Attias. A variational Bayesian framework for graphical models. In Neural Information Processing Systems, 2000. J. Bernardo and A. Smith. Bayesian Theory. John Wiley & Sons Ltd., Chichester, 1994. C. Bishop. Pattern Recognition and Machine Learning. Springer New York., 2006. C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A variational inference engine for Bayesian networks. In Neural Information Processing Systems. Cambridge, MA, 2003. D. Blackwell and J. MacQueen. Ferguson distributions via Pólya urn schemes. The Annals of Statistics, 1(2):353–355, 1973. D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144, 2006. D. Blei and J. Lafferty. Dynamic topic models. In International Conference on Machine Learning, pages 113–120, 2006. D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics, 1(1): 17–35, 2007. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003. L. Bottou. On-line learning and stochastic approximations. In On-line Learning in Neural Networks, pages 9–42. Cambridge University Press, 1998. L. Bottou. Stochastic learning. In Advanced Lectures on Machine Learning, pages 146–168. Springer, 2003. L. Bottou and O. Bousquet. Learning using large datasets. In Mining Massive Datasets for Security. IOS Press, 2008. Olivier Cappé and Eric Moulines. On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593–613, 2009. 1342 S TOCHASTIC VARIATIONAL I NFERENCE M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to the exponential family. In Neural Information Processing Systems, 2002. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. M. Do Carmo. Riemannian Geometry. Birkhäuser, 1992. A. Doucet, N. De Freitas, and N. Gordon. An introduction to sequential Monte Carlo methods. Springer, 2001. E. Erosheva. Bayesian estimation of the grade of membership model. Bayesian Statistics, 7:501– 510, 2003. M. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1: 209–230, 1973. S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41–62, 1998. E. Fox, E. Sudderth, M. Jordan, and A. Willsky. An HDP-HMM for systems with state persistence. In International Conference on Machine Learning, 2008. E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011a. E. Fox, E. Sudderth, M. Jordan, and A. Willsky. A sticky HDP-HMM with application to speaker diarization. Annals of Applied Statistics, 5(2A):1020–1056, 2011b. S. Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70:320–328, 1975. A. Gelfand and A. Smith. Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398–409, 1990. A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. S. Gershman and D. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56:1–12, 2012. S. Gershman, M. Hoffman, and D. Blei. Nonparametric variational inference. In International Conference on Machine Learning, 2012. Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. In Neural Information Processing Systems, 2000. 1343 H OFFMAN , B LEI , WANG AND PAISLEY Z. Ghahramani and M. Beal. Propagation algorithms for variational Bayesian learning. In Neural Information Processing Systems, pages 507–513, 2001. Z. Ghahramani and M. Jordan. Factorial hidden Markov models. Machine Learning, 31(1), 1997. M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8), 2006. P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Neural Information Processing Systems, 2012. T. Grifﬁths and M. Steyvers. Finding scientiﬁc topics. Proceedings of the National Academy of Science, 101:5228–5235, 2004. W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970. G. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Computational Learning Theory, pages 5–13. ACM, 1993. N. Hjort, C. Holmes, P. Muller, and S. Walker, editors. Bayesian Nonparametrics. Cambridge University Press, 2010. M. Hoffman, D. Blei, and F. Bach. On-line learning for latent Dirichlet allocation. In Neural Information Processing Systems, 2010a. M. Hoffman, D. Blei, and P. Cook. Bayesian nonparametric matrix factorization for recorded music. In International Conference on Machine Learning, 2010b. A. Honkela, M. Tornio, T. Raiko, and J. Karhunen. Natural conjugate gradient in variational inference. In Neural Information Processing Systems, 2008. T. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. PhD thesis, Massachusetts Institute of Technology, 1997. M. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999. R. Kalman. A new approach to linear ﬁltering and prediction problems a new approach to linear ﬁltering and prediction problems,”. Transaction of the AMSE: Journal of Basic Engineering, 82: 35–45, 1960. D. Knowles and T. Minka. Non-conjugate variational message passing for multinomial and binary regression. In Neural Information Processing Systems, 2011. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009. S. Kullback and R.A. Leibler. On information and sufﬁciency. The Annals of Mathematical Statistics, 22(1):79–86, 1951. 1344 S TOCHASTIC VARIATIONAL I NFERENCE P. Liang, M. Jordan, and D. Klein. Learning semantic correspondences with less supervision. In Association of Computational Linguisitics, 2009. J. Mairal, J. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. J. Maritz and T. Lwin. Empirical Bayes Methods. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1989. P. McCullagh and J. A. Nelder. Generalized Linear Models. London: Chapman and Hall, 1989. N. Metropolis, A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953. D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation. In International Conference on Machine Learning, 2012. T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncertainty in Artiﬁcial Intelligence (UAI), 2002. K. Murphy. Machine Learning: A Probabilistic Approach. MIT Press, 2012. R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. R. Neal and G. Hinton. A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. MIT Press, 1999. D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, 10:1801–1828, 2009. J. Paisley and L. Carin. Nonparametric factor analysis with beta process priors. In International Conference on Machine Learning, 2009. J. Paisley, D. Blei, and M. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012a. J. Paisley, C. Wang, and D. Blei. The discrete inﬁnite logistic normal distribution. Bayesian Analysis, 7(2):235–272, 2012b. J. Paisley, C. Wang, D. Blei, and M. Jordan. Nested hierarchical Dirichlet processes. arXiv preprint arXiv:1210.6738, 2012c. G. Parisi. Statistical Field Theory. Perseus Books, 1988. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. ISBN 1558604790. C. Peterson and J. Anderson. A mean ﬁeld theory learning algorithm for neural networks. Complex Systems, 1(5):995–1019, 1987. 1345 H OFFMAN , B LEI , WANG AND PAISLEY J. Pitman. Combinatorial Stochastic Processes. Lecture Notes for St. Flour Summer School. Springer-Verlag, New York, NY, 2002. J. Platt, E. Kıcıman, and D. Maltz. Fast variational inference for large-scale internet diagnosis. Neural Information Processing Systems, 2008. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257–286, 1989. R. Ranganath, C. Wang, D. Blei, and E. Xing. An adaptive learning rate for stochastic variational inference. In International Conference on Machine Learning, 2013. H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. SpringerVerlag, New York, NY, 2004. R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In International Conference on Machine learning, pages 880–887, 2008. M. Sato. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649– 1681, 2001. L. Saul and M. Jordan. Exploiting tractable substructures in intractable networks. Neural Information Processing Systems, 1996. L. Saul, T. Jaakkola, and M. Jordan. Mean ﬁeld theory for sigmoid belief networks. Journal of Artiﬁcial Intelligence Research, 4:61–76, 1996. J. Sethuraman. A constructive deﬁnition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994. A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases, 2010. J. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. John Wiley and Sons, 2003. C. Spearman.