nips nips2009 nips2009-197 nips2009-197-reference knowledge-graph by maker-knowledge-mining

197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Source: pdf

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming ﬁxed tic-tac-toe pruning. 1

reference text

[1] J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. JASA, 1993.

[2] C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab., 2006.

[3] P. Blunsom, T. Cohn, C. Dyer, and M. Osborne. A Gibbs sampler for phrasal synchronous grammar induction. In EMNLP, 2009. 8

[4] P. Blunsom, T. Cohn, and M. Osborne. A discriminative latent variable model for statistical machine translation. In ACL-HLT, 2008.

[5] P. Blunsom and M. Osborne. Probabilistic inference for machine translation. In EMNLP, 2008.

[6] D. Burkett and D. Klein. Two languages are better than one (for syntactic parsing). In EMNLP ’08, 2008.

[7] E. Charniak and M. Johnson. Coarse-to-ﬁne n-best parsing and maxent discriminative reranking. In ACL, 2005.

[8] D. Chiang. A hierarchical phrase-based model for statistical machine translation. In ACL, 2005.

[9] S. Clark and J. R. Curran. Parsing the WSJ using CCG and log-linear models. In ACL, 2004.

[10] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, UPenn, 1999.

[11] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 1987.

[12] J. Finkel, A. Kleeman, and C. Manning. Efﬁcient, feature-based, conditional random ﬁeld parsing. In ACL, 2008.

[13] J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In EMNLP, 2006.

[14] M. Galley, M. Hopkins, K. Knight, and D. Marcu. What’s in a translation rule? In HLT-NAACL, 2004.

[15] A. Globerson and T. Jaakkola. Approximate inference using planar graph decomposition. In NIPS, 2006.

[16] J. Goodman. Parsing algorithms and metrics. In ACL, 1996.

[17] J. Goodman. Global thresholding and multiple-pass parsing. In EMNLP, 1997.

[18] M. Johnson. Joint and conditional estimation of tagging and parsing models. In ACL, 2001.

[19] M. Johnson, T. L. Grifﬁths, and S. Goldwater. Bayesian inference for PCFGs via Markov Chain Monte Carlo. In ACL, 2007.

[20] P. Liang, M. I. Jordan, and B. Taskar. A permutation-augmented sampler for Dirichlet process mixture models. In ICML, 2007.

[21] P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In NAACL, 2006.

[22] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge U. Press, 2003.

[23] I. W. McKeague and W. Wefelmeyer. Markov chain Monte Carlo and Rao-Blackwellization. Statistical Planning and Inference, 2000.

[24] R. Neal. Slice sampling. Annals of Statistics, 2000.

[25] S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In HLT-NAACL ’07, 2007.

[26] S. Petrov and D. Klein. Discriminative log-linear grammars with latent variables. In NIPS, 2008.

[27] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.

[28] D. Smith and N. Smith. Bilingual parsing with factored estimation: Using english to parse korean. In EMNLP ’04, 2004.

[29] D. A. Smith and J. Eisner. Dependency parsing by belief propagation. In EMNLP, 2008.

[30] R. H. Swendsen and J. S. Wang. Nonuniversal critical dynamics in MC simulations. Rev. Lett, 1987.

[31] M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation. JASA, 1987.

[32] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Max-margin parsing. In EMNLP, 2004.

[33] L. Tierney. Markov chains for exploring posterior distributions. The Annals of Statistics, 1994.

[34] I. Titov and J. Henderson. Loss minimization in parse reranking. In EMNLP, 2006.

[35] J. Turian, B. Wellington, and I. D. Melamed. Scalable discriminative learning for natural language parsing and translation. In NIPS, 2006.

[36] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the inﬁnite hidden Markov model. In ICML, 2008.

[37] S. G. Walker. Sampling the Dirichlet mixture model with slices. Communications in Statistics - Simulation and Computation, 2007.

[38] J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML ’08, 2008.

[39] N. Xue, F-D Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In COLING, 2002.

[40] H. Zhang and D. Gildea. Stochastic lexicalized inversion transduction grammar for alignment. In ACL, 2005.

[41] H. Zhang, C. Quirk, R. C. Moore, and D. Gildea. Bayesian learning of non-compositional phrases with synchronous parsing. In ACL, 2008. 9