jmlr jmlr2010 jmlr2010-91 jmlr2010-91-reference knowledge-graph by maker-knowledge-mining

91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Source: pdf

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

reference text

A. Abeillé. Treebanks: Building and Using Parsed Corpora. Springer, 2003. S. Afonso, E. Bick, R. Haber, and D. Santos. Floresta Sinta(c)tica: a treebank for Portuguese. In Proc. LREC, 2002. Y. Altun, M. Johnson, and T. Hofmann. Investigating loss functions and optimization methods for discriminative learning of label sequences. In Proc. EMNLP, 2003. R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró. Freeling 1.3: Syntactic and semantic services in an open-source nlp library. In Proc. LREC, 2006. M. Balcan and A. Blum. A PAC-style model for learning from labeled and unlabeled data. In Proc. COLT, 2005. C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proc. ACL, 2005. K. Bellare, G. Druck, and A. McCallum. Alternating projections for learning with expectation constraints. In Proc. UAI, 2009. 2044 P OSTERIOR R EGULARIZATION FOR S TRUCTURED L ATENT VARIABLE M ODELS D. P. Bertsekas. Nonlinear Programming: 2nd Edition. Athena scientiﬁc, 1999. J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In Proc. EMNLP, 2006. J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classiﬁcation. In Proc. ACL, 2007. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. COLT, 1998. U. Brefeld, C. Büscher, and T. Scheffer. Multi-view hidden markov perceptrons. In Proc. LWA, 2005. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, M. J. Goldsmith, J. Hajic, R. L. Mercer, and S. Mohanty. But dictionaries are data too. In Proc. HLT, 1993. P. F. Brown, S. Della Pietra, V. J. Della Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1994. C. Callison-Burch. Paraphrasing and Translation. PhD thesis, University of Edinburgh, 2007. C. Callison-Burch. Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. In Proc. EMNLP, 2008. A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr., and T. M. Mitchell. Coupled SemiSupervised Learning for Information Extraction. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM), 2010. M. Chang, L. Ratinov, and D. Roth. Guiding semi-supervision with constraint-driven learning. In Proc. ACL, 2007. M.W. Chang, L. Ratinov, N. Rizzolo, and D. Roth. Learning and inference with constraints. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI). AAAI, 2008. C. Chelba, D. Engle, F. Jelinek, V. Jimenez, S. Khudanpur, L. Mangu, H. Printz, E. Ristad, R. Rosenfeld, A. Stolcke, and D. Wu. Structure and performance of a dependency language model. In Proc. Eurospeech, 1997. D. Chiang, A. Lopez, N. Madnani, C. Monz, P. Resnik, and M. Subotin. The hiero machine translation system: extensions, evaluation, and analysis. In Proc. HLT-EMNLP, 2005. M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999. M. Collins and Y. Singer. Unsupervised models for named entity classiﬁcation. In Proc. SIGDATEMNLP, 1999. H. Daumé III. Cross-task knowledge-constrained self training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008. 2045 G ANCHEV, G RAÇA , G ILLENWATER AND TASKAR A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Royal Statistical Society, Ser. B, 39(1):1–38, 1977. G. Druck, G. Mann, and A. McCallum. Semi-supervised learning of dependency parsers using generalized expectation criteria. In Proc. ACL-IJCNLP, 2009. J. Eisner. Three new probabilistic models for dependency parsing: an exploration. In Proc. CoLing, 1996. H. Fox. Phrasal cohesion and statistical machine translation. In Proc. EMNLP, 2002. M. Galley, M. Hopkins, K. Knight, and D. Marcu. What’s in a translation rule? In Proc. HLTNAACL, 2004. K. Ganchev, J. Graça, J. Blitzer, and B. Taskar. Multi-view learning over structured and nonidentical outputs. In Proc. UAI, 2008a. K. Ganchev, J. Graça, and B. Taskar. Better alignments = better translations? In Proc. ACL, 2008b. K. Ganchev, J. Gillenwater, and B. Taskar. Dependency grammar induction via bitext projection constraints. In Proc. ACL-IJCNLP, 2009. J. Gao and M. Johnson. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In Proc. EMNLP, 2008. S. Goldwater and T. Grifﬁths. A fully bayesian approach to unsupervised part-of-speech tagging. In Proc. ACL, 2007. J. Graça, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In Proc. NIPS, 2007. J. Graça, K. Ganchev, F. Pereira, and B. Taskar. Parameter vs. posterior sparisty in latent variable models. In Proc. NIPS, 2009a. J. Graça, K. Ganchev, and B. Taskar. Postcat - posterior constrained alignment toolkit. In The Third Machine Translation Marathon, 2009b. J. Graça, K. Ganchev, and B. Taskar. Learning tractable word alignment models with complex constraints. Computational Linguistics, 36, September 2010. A. Haghighi and D. Klein. Prototype-driven learning for sequence models. In Proc. NAACL, 2006. A. Haghighi, A. Ng, and C. Manning. Robust textual inference via graph matching. In Proc. EMNLP, 2005. R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11:11–311, 2005. M. Johnson. Why doesn’t EM ﬁnd good HMM POS-taggers. In Proc. EMNLP-CoNLL, 2007. T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Transactions on Communications, 15(1):52–60, 2 1967. ISSN 0096-2244. 2046 P OSTERIOR R EGULARIZATION FOR S TRUCTURED L ATENT VARIABLE M ODELS S. Kakade and D. Foster. Multi-view regression via canonical correlation analysis. In Proc. COLT, 2007. D. Klein and C. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. ACL, 2004. P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit, 2005. P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc. NAACL, 2003. S. Lee and K. Choi. Reestimation and best-ﬁrst parsing algorithm for probabilistic dependency grammar. In Proc. WVLC-5, 1997. Z. Li and J. Eisner. First- and second-order expectation semirings with applications to minimumrisk training on translation forests. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 40–51, Singapore, 2009. P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In Proc. HLT-NAACL, 2006. P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In Proc. ICML, 2009. G. S. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proc. ICML, 2007. G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of conditional random ﬁelds. In Proc. ACL, 2008. M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993. E. Matusov, R. Zens, and H. Ney. Symmetric word alignments for statistical machine translation. In Proc. COLING, 2004. E. Matusov, N. Uefﬁng, and H. Ney. Computing consensus translation from multiple machine translation systems using enhanced hypotheses alignment. In Proc. EACL, 2006. R. McDonald, K. Crammer, and F. Pereira. Online large-margin training of dependency parsers. In Proc. ACL, 2005. R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justiﬁes incremental, sparse and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer, 1998. J. Nivre, J. Hall, S. Kübler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. The CoNLL 2007 shared task on dependency parsing. In Proc. EMNLP-CoNLL, 2007. F. J. Och and H. Ney. Improved statistical alignment models. In Proc. ACL, 2000. F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. ISSN 0891-2017. 2047 G ANCHEV, G RAÇA , G ILLENWATER AND TASKAR A. Pauls, J. Denero, and D. Klein. Consensus training for consensus decoding in machine translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1418–1427, Singapore, 2009. Association for Computational Linguistics. N. Quadrianto, J. Petterson, and A. Smola. Distribution matching for transduction. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1500–1508. MIT Press, 2009. C. Quirk, A. Menezes, and C. Cherry. Dependency treelet translation: syntactically informed phrasal smt. In Proc. ACL, 2005. M. Rogati, S. McCarley, and Y. Yang. Unsupervised learning of arabic stemming using a parallel corpus. In Proc. ACL, 2003. D. Rosenberg and P. Bartlett. The rademacher complexity of co-regularized kernel classes. In Proc. AI Stats, 2007. E. F. Tjong Kim Sang and S. Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. In Proc. CoNLL and LLL, 2000. E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: languageindependent named entity recognition. In Proc. HLT-NAACL, 2003. L. Shen, J. Xu, and R. Weischedel. A new string-to-dependency machine translation algorithm with a target dependency language model. In Proc. ACL, 2008. K. Simov, P. Osenova, M. Slavcheva, S. Kolkovska, E. Balabanova, D. Doikoff, K. Ivanova, A. Simov, E. Simov, and M. Kouylekov. Building a linguistically interpreted corpus of bulgarian: the bultreebank. In Proc. LREC, 2002. V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularization approach to semi-supervised learning with multiple views. In Proc. ICML, 2005. A. Smith, T. Cohn, and M. Osborne. Logarithmic opinion pools for conditional random ﬁelds. In Proc. ACL, 2005. B. Snyder and R. Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proc. ACL, 2008. B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay. Adding more languages improves unsupervised multilingual part-of-speech tagging: a bayesian non-parametric approach. In Proc. NAACL, 2009. J. Tiedemann. Building a multilingual parallel subtitle corpus. In Proc. CLIN, 2007. K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. HLT-NAACL, 2003. P. Tseng. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research, 29(1):27–44, 2004. 2048 P OSTERIOR R EGULARIZATION FOR S TRUCTURED L ATENT VARIABLE M ODELS Y. Tsuruoka and J. Tsujii. Bidirectional inference with the easiest-ﬁrst strategy for tagging sequence data. In Proc. HLT-EMNLP, 2005. L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8: 189–201, 1979. S. Vogel, H. Ney, and C. Tillmann. Hmm-based word alignment in statistical translation. In Proc. COLING, 1996. H. Yamada and Y. Matsumoto. Statistical dependency analysis with support vector machines. In Proc. IWPT, 2003. D. Yarowsky and G. Ngai. Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proc. NAACL, 2001. 2049