nips nips2012 nips2012-204 nips2012-204-reference knowledge-graph by maker-knowledge-mining

204 nips-2012-MAP Inference in Chains using Column Generation


Source: pdf

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1


reference text

[1] David Sontag and Tommi Jaakkola. Tree block coordinate descent for MAP in graphical models. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS), volume 8, pages 544–551. JMLR: W&CP;, 2009. 8

[2] C. Pal, C. Sutton, and A. McCallum. Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, pages V–V. IEEE, 2006.

[3] A. Kulesza, F. Pereira, et al. Structured learning with approximate inference. Advances in neural information processing systems, 20:785–792, 2007.

[4] L. Shen, G. Satta, and A. Joshi. Guided learning for bidirectional sequence classification. In Annual Meeting-Association for Computational Linguistics, volume 45, page 760, 2007.

[5] D. Tarlow, D. Batra, P. Kohli, and V. Kolmogorov. Dynamic tree block coordinate ascent. In ICML, pages 113–120, 2011.

[6] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation–an empirical study. The Journal of Machine Learning Research, 7:1887–1907, 2006.

[7] M. Lubbecke and J. Desrosiers. Selected topics in column generation. Operations Research, 53:1007– 1023, 2004.

[8] D. Bertsimas and J. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, 1997.

[9] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.

[10] S. Riedel, D. Smith, and A. McCallum. Parse, price and cutdelayed column and row generation for graph based parsers. Proceedings of the Conference on Empirical methods in natural language processing (EMNLP ’12), 2012.

[11] R. Esposito and D.P. Radicioni. Carpediem: an algorithm for the fast evaluation of SSL classifiers. In Proceedings of the 24th international conference on Machine learning, pages 257–264. ACM, 2007.

[12] N. Kaji, Y. Fujiwara, N. Yoshinaga, and M. Kitsuregawa. Efficient staggered decoding for sequence labeling. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 485–494. Association for Computational Linguistics, 2010.

[13] C. Raphael. Coarse-to-fine dynamic programming. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(12):1379–1390, 2001.

[14] J. McAuley and T. Caetano. Exploiting data-independence for fast belief-propagation. In International Conference on Machine Learning 2010, volume 767, page 774, 2010.

[15] J.J. McAuley and T.S. Caetano. Faster algorithms for max-product message-passing. Journal of Machine Learning Research, 12:1349–1388, 2011.

[16] A.M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1–11. Association for Computational Linguistics, 2010.

[17] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Information Processing Systems, 2007.

[18] S. Riedel. Improving the accuracy and efficiency of MAP inference for Markov logic. Proceedings of UAI 2008, pages 468–475, 2008.

[19] R. Kipp Martin, Ronald L. Rardin, and Brian A. Campbell. Polyhedral characterization of discrete dynamic programming. Operations Research, 38(1):pp. 127–138, 1990.

[20] R.K. Ahuja, T.L. Magnanti, J.B. Orlin, and K. Weihe. Network flows: theory, algorithms and applications. ZOR-Methods and Models of Operations Research, 41(3):252–254, 1995.

[21] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.

[22] M. Wick, K. Rohanimanesh, K. Bellare, A. Culotta, and A. McCallum. SampleRank: training factor graphs with atomic gradients. In Proceedings of ICML, 2011. 9