jmlr jmlr2013 jmlr2013-47 jmlr2013-47-reference knowledge-graph by maker-knowledge-mining

47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference


Source: pdf

Author: Edward Challis, David Barber

Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes


reference text

D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. D. Barber and C. Bishop. Ensemble learning in bayesian neural networks. In Neural Networks and Machine Learning, pages 215–237. Springer, 1998. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. E. Challis and D. Barber. Concave gaussian variational approximations for inference in large-scale bayesian linear models. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. 2282 G AUSSIAN KL A PPROXIMATE I NFERENCE Time (s) G-KL Chev Band Sub FA VB ˜ B G-KL Chev Band Sub FA VB w − wtr 2 /D G-KL Chev Band Sub FA VB log p(y∗ |X∗ )/Ntst G-KL VB Chev Band Sub FA Ntrn = 500 K = 50 K = 100 2.68±0.04 3.37±0.05 6.66±0.58 8.97±0.14 1.59±0.07 2.58±0.12 9.94±1.00 12.24±0.63 1.78±0.03 2.65±0.05 −1.28±0.01 −1.24±0.01 −1.24±0.01 −1.17±0.01 −5.40±0.23 −4.54±0.25 −1.29±0.01 −1.26±0.01 –±– –±– 0.53±0.00 0.53±0.00 0.53±0.00 0.53±0.00 0.56±0.00 0.55±0.00 0.53±0.00 0.53±0.00 0.54±0.00 0.54±0.00 −0.62±0.01 −0.61±0.01 −0.61±0.01 −0.59±0.01 −0.62±0.01 −0.61±0.01 −0.62±0.01 −0.61±0.01 −0.88±0.01 −0.95±0.02 Ntrn = 1000 K = 50 K = 100 6.41±0.11 7.28±0.14 12.81±0.15 20.59±0.26 3.24±0.03 7.71±0.20 16.21±0.74 18.64±1.09 4.12±0.04 6.17±0.07 −0.99±0.00 −0.96±0.00 −0.98±0.00 −0.94±0.00 −7.56±0.00 −1.52±0.00 −1.00±0.00 −0.97±0.00 –±– –±– 0.49±0.00 0.49±0.00 0.49±0.00 0.49±0.00 0.56±0.00 0.50±0.00 0.49±0.00 0.49±0.00 0.52±0.00 0.52±0.00 −0.51±0.01 −0.49±0.01 −0.50±0.01 −0.49±0.01 −0.69±0.00 −0.61±0.01 −0.52±0.01 −0.51±0.01 −0.68±0.01 −0.70±0.01 Ntrn = 5000 K = 50 K = 100 75.23±1.51 78.38±2.10 127.69±2.36 190.65±3.47 56.67±1.80 75.35±1.63 70.87±3.92 82.13±5.83 21.88±0.03 33.87±0.02 −0.42±0.00 −0.41±0.00 −0.42±0.00 −0.42±0.00 −0.62±0.00 −0.54±0.00 −0.42±0.00 −0.41±0.00 –±– –±– 0.38±0.00 0.38±0.00 0.38±0.00 0.38±0.00 0.44±0.00 0.43±0.00 0.38±0.00 0.38±0.00 0.45±0.00 0.45±0.00 −0.18±0.00 −0.18±0.00 −0.18±0.00 −0.18±0.00 −0.21±0.00 −0.21±0.00 −0.18±0.00 −0.18±0.00 −0.21±0.00 −0.21±0.00 Table 4: Bayesian logistic regression results for a unit variance Gaussian prior, with parameter dimension D = 1000 and number of test points Ntst = 5000. Experimental setup and metrics are described in Section 6.2. R. Fergus, B. Singh, A. Hertzmann, S. Roweis, and W. Freeman. Removing camera shake from a single photograph. In ACM Transactions on Graphics. J. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, 1991. M. Gibbs and D. MacKay. Variational gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. M. Girolami. A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11):2517–2532, 2001. G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, 1996. A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, 2011. R. Herbrich. On gaussian expectation propagation. Technical report, Microsoft Research Cambridge, research.microsoft.com/pubs/74554/EP.pdf, 2005. A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate riemannian conjugate gradient learning for fixed-form variational bayes. Journal of Machine Learning Research, 11: 3235–3268, 2010. T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression problems and their extensions. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, 1997. 2283 C HALLIS AND BARBER P. Jylanki, J. Vanhatalo, and A. Vehtrari. Robust Gaussian Process Regression with a Student-t Likelihood. Journal of Machine Learning Research, 12:3187–3225, 2011. K. Ko and M. Seeger. Large scale variational bayesian inference for structured scale mixture models. In Proceedings of the 29th International Conference on Machine Learning, 2012. M. Kuss. Gaussian Process Models for Robust Regression, Classification, and Reinforcement Learning. PhD thesis, Technischen Universit¨ t Darmstadt, Darmstadt, Germany, 2006. a M. Kuss and C. Rasmussen. Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. D. MacKay and J. Oldfield. Generalization error and the number of hidden units in a multilayer perceptron. Technical report, Cambridge University, www.inference.phy.cam.ac.uk/mackay/ gen.ps.gz, 1995. B. Marlin, M. Khan, and K. Murphy. Piecewise bounds for estimating bernoulli-logistic latent gaussian models. In Proceedings of the 28th International Conference on Machine Learning, 2011. T. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 2001. T. Minka. Power EP. Technical report, Department of Statistics, Carnegie Mellon University, research.microsoft.com/pubs/67427/tr-2004-149.pdf, 2004. R. Neal. Monte carlo implementation of gaussian process models for bayesian regression and classification. Technical report, Department of Statistics and Department of Computer Science, University of Toronto, arxiv.org/abs/physics/9701026v2, 1997. H. Nickisch. Bayesian Inference and Experimental Design for Large Generalised Linear Models. PhD thesis, Technische Universit¨ t Berlin, Berlin, Germany, 2010. a H. Nickisch. glm-ie: Generalised linear models inference and estimation toolbox. Journal of Machine Learning Research, 13:1699–1703, 13 2012. H. Nickisch and C. Rasmussen. Approximations for binary gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. H. Nickisch and M. Seeger. Convex variational bayesian inference for large scale generalized linear models. In Proceedings of the 26th International Conference on Machine Learning, 2009. J. Nocedal and S. Wright. Numerical Optimization. Springer, 2006. B. Olshausen and D. Field. Natural image statistics and efficient coding. Network: Computation in Neural Systems, 7:333–339, 2 1996. M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural Computation, 21(3):786–792, 2009. 2284 G AUSSIAN KL A PPROXIMATE I NFERENCE M. Opper and O. Winther. Expectation consistent approximate inference. Journal of Machine Learning Research, 6:2177–2204, Dec. 2005. J. Ormerod and M. Wand. Gaussian variational approximate inference for generalized linear mixed models. Journal of Computational and Graphical Statistics, 21(1):2–17, 2012. A. Palmer, D. Wipf, K. Kreutz-Delgado, and B. Rao. Variational EM algorithms for non-Gaussian latent variable models. In Advances in Neural Information Processing Systems 20, 2006. G. Papandreou and A. Yuille. Gaussian sampling by local perturbations. In Advances in Neural Information Processing Systems 11, 2010. T. Park and G. Casella. The bayesian lasso. Journal of the American Statistical Association, 103: 681–686, 2008. C. Rasmussen and H. Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research, 11:3011–3015, Nov. 2010. C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. R. Salakhutdinov, S. Roweis, and Z. Ghahramani. Optimization with EM and expectationconjugate-gradient. In Proceedings of the 20th International Conference on Machine Learning, 2003. L. Saul, T. Jaakkola, and M. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76, 1996. M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In The 18th European Conference on Machine Learning, 2007. M. Seeger. Bayesian methods for support vector machines and gaussian processes. Master’s thesis, University of Karlsruhe, 1999a. M. Seeger. Bayesian model selection for support vector machines, gaussian processes and other kernel classifiers. In Advances in Neural Information Processing Systems 12. 1999b. M. Seeger. Low rank updates for the cholsky decomposition. Technical report, University of California at Berkeley, infoscience.epfl.ch/record/161468/files/cholupdate.pdf, 2007. M. Seeger. Bayesian inference and optimal design in the sparse linear model. Journal of Machine Learning Research, 9:759–813, Oct. 2008. M. Seeger. Sparse linear models: Variational approximate inference and bayesian experimental design. Journal of Physics: Conference Series, 197(1), 2009. M. Seeger. Gaussian covariance and scalable variational inference. In Proceedings of the 27th International Conference on Machine Learning, 2010. M. Seeger and H. Nickisch. Compressed sensing and bayesian experimental design. In Proceedings of the 25th International Conference on Machine Learning, pages 912–919, 2008. 2285 C HALLIS AND BARBER M. Seeger and H. Nickisch. Large scale variational inference and experimental design for sparse generalized linear models. Technical report, Max Planck Institute for Biological Cybernetics, //arxiv.org/abs/0810.0901, 2010. M. Seeger and H. Nickisch. Fast convergent algorithms for expectation propagation approximate inference. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011a. M. Seeger and H. Nickisch. Large scale bayesian inference and experimental design for sparse linear models. SIAM Journal on Imaging Sciences, 4(1):166–199, 2011b. M. Seeger, S. Gerwinn, and M. Bethge. Bayesian inference for sparse generalized linear models. In The 18th European Conference on Machine Learning, 2007. M. Tipping. Probabilistic visualisation of high-dimensional binary data. In Advances in Neural Information Processing Systems 11, 1999. J. Vanhatalo, P. Jylnki, and A. Vehtari. Gaussian process regression with a student-t likelihood. In Advances in Neural Information Processing Systems 22, 2009. D. Wipf. Sparse bayesian learning for basis selection. IEEE Transactions on Signal Processing, 52 (8):2153–2164, 2004. 2286