jmlr jmlr2012 jmlr2012-118 jmlr2012-118-reference knowledge-graph by maker-knowledge-mining

118 jmlr-2012-Variational Multinomial Logit Gaussian Process

Source: pdf

Author: Kian Ming A. Chai

Abstract: Gaussian process prior with an appropriate likelihood function is a ﬂexible non-parametric model for a variety of learning tasks. One important and standard task is multi-class classiﬁcation, which is the categorization of an item into one of several ﬁxed classes. A usual likelihood function for this is the multinomial logistic likelihood function. However, exact inference with this model has proved to be difﬁcult because high-dimensional integrations are required. In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters. Experiments have shown our approximation to be tight. In addition, we provide dataindependent bounds on the marginal likelihood of the model, one of which is shown to be much tighter than the existing variational mean-ﬁeld bound in the experiments. We also derive a proper lower bound on the predictive likelihood that involves the Kullback-Leibler divergence between the approximating and the true posterior. We combine our approach with a recently proposed sparse approximation to give a variational sparse approximation to the Gaussian process multi-class model. We also derive criteria which can be used to select the inducing set, and we show the effectiveness of these criteria over random selection in an experiment. Keywords: Gaussian process, probabilistic classiﬁcation, multinomial logistic, variational approximation, sparse approximation

reference text

T. Ando. Concavity of certain maps on positive deﬁnite matrices and applications to Hadamard products. Linear Algebra and Its Applications, 26:203–241, 1979. D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343–370, 1988. A. Banerjee. On Bayesian bounds. In International Conference on Machine Learning, 2006. E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 153–160. Curran Associates, Inc., 2008. E. J. Bredensteiner and K. P. Bennett. Multicategory classiﬁcation by support vector machines. Computational Optimization Applications, 12:53–79, January 1999. J. S. Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs with relationships to statistical pattern recognition. In Neuro-computing: Algorithms, Architectures and Applications, NATO ASI Series in Systems and Computer Science. Springer, 1989. E. Challis and D. Barber. Concave Gaussian variational approximations for inference in large-scale Bayesian linear models. In D. Dunson and M. Dudk, editors, Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 15 of JMLR: Workshop and Conference Proceedings Series, 2011. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641–668, 2002. o A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.ics. uci.edu/ml. J. Geweke. Contemporary Bayesian Econometrics and Statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New Jersey, 2005. J. Geweke, M. P. Keane, and D. Runkle. Alternative computational approaches to inference in the multinomial probit model. The Review of Economics and Statistics, 76(4):609–32, 1994. Z. Ghahramani and M. J. Beal. Graphical models and variational methods. In D. Saad and M. Opper, editors, Advanced Mean Field methods — Theory and Practice. MIT Press, 2000a. Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Advances in Neural Information Processing u Systems, volume 12, pages 449–455. MIT Press, Cambridge, MA, 2000b. M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classiﬁcation. PhD thesis, University of Cambridge, Department of Physics, 1997. M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790–1817, 2006. 1805 C HAI Y. Guermeur. Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications, 5(2):168–179, 2002. Y. Guermeur and E. Monfrini. A quadratic loss multi-class SVM for which a radius-margin bound applies. Informatica, 22(1):73–96, 2011. P. Hall. Chi squared approximations to the distribution of a sum of independent random variables. The Annals of Probability, 11(4):1028–1036, 1983. R. Henao and O. Winther. PASS-GP: Predictive active set selection for Gaussian processes. In IEEE International Workshop on Machine Learning for Signal Processing, pages 148–153, 2010. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. H.-C. Kim and Z. Ghahramani. Bayesian Gaussian process classiﬁcation with the EM-EP algorithm. IEEE Transactions on Pattern Analysis Machine Intelligence, 28:1948–1959, 2006. D. A. Knowles and T. P. Minka. Non-conjugate variational message passing for multinomial and binary regression. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24, pages 1701–1709. 2011. S. J. Koopman, N. Shephard, and D. Creal. Testing the assumptions behind importance sampling. Journal of Econometrics, 149(1):2–11, 2009. S. Kotz and S. Nadarajah. Multivariate t Distributions and Their Applications. Cambridge University Press, Cambridge, UK, 2004. F. Lauer and Y. Guermeur. MSVMpack: a multi-class support vector machine package. Journal of Machine Learning Research, 12:2269–2272, 2011. http://www.loria.fr/˜lauer/MSVMpack. N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems, volume 15, pages 609–616, Cambridge, MA, 2003. MIT Press. M. L´ zaro-Gredilla and A. Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference a using inducing features. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22, pages 1087–1095. Curran Associates, Inc., 2009. Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):67–81, 2004. D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers in Econometrics, pages 105–142. Academic Press, 1974. T. P. Minka. Beyond Newton’s method, 2002. URL http://research.microsoft.com/en-us/ um/people/minka/papers/minka-newton.pdf. 1806 VARIATIONAL M ULTINOMIAL L OGIT G AUSSIAN P ROCESS R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, 1996. R. M. Neal. Regression and classiﬁcation using Gaussian process priors. In A. P. D. J. M. Bernardo, J. O. Berger and A. F. M. Smith, editors, Bayesian Statistics, volume 6, pages 475–501. Oxford University Press, 1998. R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classiﬁcation. Journal of Machine Learning Research, 9:2035–2078, 2008. J. E. Potter. Matrix quadratic solutions. SIAM Journal on Applied Mathematics, 14(3):496–501, May 1966. S. Puntanen and G. P. H. Styan. Historical introduction: Issai Schur and the early development of the Schur complement. In F. Zhang, editor, The Schur Complement and Its Applications, Numerical Methods and Algorithms, pages 1–16. Springer, 2005. J. Qui˜ onero-Candela, C. E. Rasmussen, and C. K. I. Williams. Approximation methods for Gausn sian process regression. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages 203–223. MIT Press, 2007. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, Massachusetts, 2006. R. Rifkin and A. Klautau. In defense of one-vs-all classiﬁcation. Journal of Machine Learning Research, 5:101–141, 2004. M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research, 3:233–269, 2002. M. Seeger and M. I. Jordan. Sparse Gaussian process classiﬁcation with multiple classes. Technical report, University of California at Berkeley, Department of Statistics, 2004. E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, volo ume 18, pages 1257–1264, Cambridge, MA, 2006. MIT Press. M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics, volume 5 of JMLR: Workshop and Conference Proceedings Series, pages 567–574, 2009a. M. Titsias. Variational model selection for sparse Gaussian process regression. Technical report, University of Manchester, School of Computer Science, 2009b. URL http://www.cs.man.ac. uk/˜mtitsias/papers/sparseGPv2.pdf. V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. J. M. Ver Hoef and R. P. Barry. Constructing and ﬁtting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69(2):275–294, 1998. 1807 C HAI G. S. Watson. Spectral decomposition of the covariance matrix of a multinomial. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):289–291, 1996. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the 7th European Symposium on Artiﬁcial Neural Network, 1999. C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In T. K. o Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 682–688, Cambridge, MA, 2001. MIT Press. C. K. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. 1808