nips nips2012 nips2012-100 nips2012-100-reference knowledge-graph by maker-knowledge-mining

100 nips-2012-Discriminative Learning of Sum-Product Networks


Source: pdf

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the first discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efficient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classification tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1


reference text

[1] M. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure. CVPR, 2012.

[2] F. Bach and M.I. Jordan. Thin junction trees. Advances in Neural Information Processing Systems, 14:569–576, 2002.

[3] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.

[4] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1729–1736. IEEE, 2011.

[5] L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. Advances in Neural Information Processing Systems, 2010.

[6] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for RGB-D based object recognition. ISER, 2012.

[7] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in bayesian networks. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pages 115– 123, 1996.

[8] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Artificial Intelligence, 172(6-7):772–799, 2008.

[9] A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.

[10] A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsupervised feature learning. In aistats11. Society for Artificial Intelligence and Statistics, 2011.

[11] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning, volume 8, page 10, 2011.

[12] A. Coates and A.Y. Ng. Selecting receptive fields in deep networks. NIPS, 2011.

[13] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8, Philadelphia, PA, 2002. ACL.

[14] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50:280– 305, 2003.

[15] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.

[16] O. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In Proceedings of the 25th Conference on Neural Information Processing Systems, 2011.

[17] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

[18] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. Ieee, 2008.

[19] A. Hyv¨ rinen and E. Oja. Independent component analysis: algorithms and applications. Neural neta works, 13(4-5):411–430, 2000.

[20] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, 2012.

[21] A. Kulesza, F. Pereira, et al. Structured learning with approximate inference. Advances in Neural Information Processing Systems, 20:785–792, 2007.

[22] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289, Williamstown, MA, 2001. Morgan Kaufmann.

[23] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pages 337–346, 2011.

[24] M.A. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order Boltzmann machines. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2551–2558. IEEE, 2010.

[25] J. Saloj¨ rvi, K. Puolam¨ ki, and S. Kaski. Expectation maximization algorithms for conditional likelia a hoods. In Proceedings of the 22nd international conference on Machine learning, pages 752–759. ACM, 2005. 9