nips nips2011 nips2011-295 nips2011-295-reference knowledge-graph by maker-knowledge-mining

295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction


Source: pdf

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1


reference text

[1] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Learning with blocks: Composite likelihood and contrastive divergence. In AISTATS, 2010. 2, 7

[2] L. Bahl, P. Brown, P. de Souza, and R. Mercer. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In ICASSP, 1986. 1, 2, 7

[3] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–95, 1975. 1, 2, 7, 8

[4] D. Brook. On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbor systems. Biometrika, 3/4(51):481–483, 1964. 7 8 ´

[5] M. A. Carreira-Perpi˜ an and G. E. Hinton. On contrastive divergence learning. In AISTATS, n´ 2005. 6

[6] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006. 2, 3

[7] D. R. Cox. Partial likelihood. Biometrika, 62(2):pp. 269–276, 1975. 1, 2, 6

[8] I. Csisz´ r and P. C. Shields. Information theory and statistics: A tutorial. Foundations and a Trends in Communications and Information Theory, 1(4):417–528, 2004. 4, 8

[9] I.J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, 1965. 1, 2, 8

[10] M. Gutmann and J. Hirayama. Bregman divergence as general framework to estimate unnormalized statistical models. In Conference on Uncertainty in Artificial Intelligence (UAI), Barcelona, Spain, 2011. 2

[11] M. Gutmann and A. Hyv¨ rinen. Noise-contrastive estimation: A new estimation principle for a unnormalized statistical models. In AISTATS, 2010. 1, 2, 6

[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. 1, 2, 6

[13] P. J. Huber. Projection pursuit. The Annuals of Statistics, 13(2):435–475, 1985. 3

[14] A. Hyv¨ rinen. Estimation of non-normalized statistical models using score matching. Journal a of Machine Learning Research, 6:695–709, 2005. 1, 2, 7

[15] A. Hyv¨ rinen. Connections between score matching, contrastive divergence, and pseudolikea lihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18(5):1529– 1531, 2007. 2

[16] A. Hyv¨ rinen. Some extensions of score matching. Computational Statistics & Data Analysis, a 51:2499–2512, 2007. 8

[17] T. Jebara and A. Pentland. Maximum conditional likelihood via bound maximization and the CEM algorithm. In NIPS, 1998. 1, 2, 7

[18] J. Laurie Kindermann, Ross; Snell. Markov Random Fields and Their Applications. American Mathematical Society, 1980. 1

[19] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, 1989. 2, 4

[20] Stefan L. Lauritzen. Statistical manifolds. In Differential Geometry in Statistical Inference, pages 163–216, 1987. 4

[21] Lucien Le Cam. Maximum likelihood — an introduction. ISI Review, 58(2):153–171, 1990. 1

[22] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. Tutorial on energy-based learning. In Predicting Structured Data. MIT Press, 2006. 2

[23] P. Liang and M. I Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In International Conference on Machine Learning, 2008. 7

[24] B. G Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):22–39, 1988. 1, 2, 7

[25] S. Lyu. Interpretation and generalization of score matching. In UAI, 2009. 7, 8

[26] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative training for clustering and classification. In Association for the Advancement of Artificial Intelligence (AAAI), 2006. 7

[27] M. Pihlaja, M. Gutmann, and A. Hyv¨ rinen. A family of computationally efficient and simple a estimators for unnormalized statistical models. In UAI, 2010. 8

[28] J. Sohl-Dickstein, P. Battaglino, and M. DeWeese. Minimum probability flow learning. In ICML, 2011. 8

[29] D. Strauss and M. Ikeda. Pseudolikelihood estimation for social networks. Journal of the American Statistical Association, 85:204–212, 1990. 8

[30] C. Varin and P. Vidoni. A note on composite likelihood inference and model selection. Biometrika, 92(3):519–528, 2005. 2, 7, 8

[31] D. Vickrey, C. Lin, and D. Koller. Non-local contrastive objectives. In ICML, 2010. 1, 2, 6 9