acl acl2011 acl2011-188 acl2011-188-reference knowledge-graph by maker-knowledge-mining

188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

Source: pdf

Author: Matt Post

Abstract: In this paper, we show that local features computed from the derivations of tree substitution grammars such as the identify of particular fragments, and a count of large and small fragments are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model. — —

reference text

Mohit Bansal and Dan Klein. 2010. Simple, accurate parsing with an all-fragments grammar. In Proc. ACL, Uppsala, Sweden, July. Rens Bod. 1993. Using an annotated corpus as a stochastic grammar. In Proc. ACL, Columbus, Ohio, USA. Rens Bod. 2001. What is the minimal set of fragments that achieves maximal parse accuracy? In Proc. ACL, Toulouse, France, July. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and MaxEnt discriminative reranking. In Proc. ACL, Ann Arbor, Michigan, USA, June. Eugene Charniak. 1996. Tree-bank grammars. In Proc. of the National Conference on Artificial Intelligence. Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proc. NAACL, Seattle, Washington, USA, April–May. Colin Cherry and Chris Quirk. 2008. Discriminative, syntactic language modeling through latent svms. In Proc. AMTA, Waikiki, Hawaii, USA, October. Trevor Cohn, Sharon. Goldwater, and Phil Blunsom. 2009. Inducing compact but accurate tree-substitution grammars. In Proc. NAACL, Boulder, Colorado, USA, June. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9: 1871–1874. Jennifer Foster and Øistein E. Andersen. 2009. Gen- errate: generating errors for use in grammatical error detection. In Proceedings of the fourth workshop on innovative use of nlp for building educational applications, pages 82–90. Association for Computational Linguistics. Jennifer Foster and Carl Vogel. 2004. Good reasons for noting bad grammar: Constructing a corpus of ungrammatical language. In Pre-Proceedings of the International Conference on Linguistic Evidence: Empirical, Theoretical and Computational Perspectives. Joshua Goodman. 1996. Efficient algorithms for parsing the DOP model. In Proc. EMNLP, Philadelphia, Pennsylvania, USA, May. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Columbus, Ohio, June. Mark Johnson. 1998. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613–632. Aravind K. Joshi and Yves Schabes. 1997. Treeadjoining grammars. In G. Rozenberg and A. Salomaa, editors, Handbook ofFormal Languages: Beyond Words, volume 3, pages 71–122. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):330. 222 Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. 2007. Gleu: Automatic evaluation of sentencelevel fluency. In Proc. ACL, volume 45, page 344. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, et al. 2004. A smorgasbord of features for statistical machine translation. In Proc. NAACL. Daisuke Okanohara and Jun’ichi Tsujii. 2007. A discriminative language model with pseudo-negative samples. In Proc. ACL, Prague, Czech Republic, June. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proc. COLING/ACL, Sydney, Australia, July. Matt Post and Daniel Gildea. 2009a. Bayesian learning of a tree substitution grammar. In Proc. ACL (short paper track), Suntec, Singapore, August. Matt Post and Daniel Gildea. 2009b. Language modeling with tree substitution grammars. In NIPS workshop on Grammar Induction, Representation of Language, and Language Learning, Whistler, British Columbia. Matt Post. 2010. Syntax-based Language Models for Statistical Machine Translation. Ph.D. thesis, University of Rochester. Remko Scha. 1990. Taaltheorie en taaltechnologie; com- petence en performance. In R. de Kort and G.L.J. Leerdam, editors, Computertoepassingen in de neerlandistiek, pages 7–22, Almere, the Netherlands. Andreas Stolcke. 2002. SRILM an extensible language modeling toolkit. In Proc. International Conference on Spoken Language Processing. Ghihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, and Chin-Yew Lin. 2007. Detecting erroneous sentences using automatically mined sequential patterns. In Proc. ACL, volume 45. Joachim Wagner, Jennifer Foster, and Josefvan Genabith. 2009. Judging grammaticality: Experiments in sentence classification. CALICO Journal, 26(3):474–490. Sze-Meng Jojo Wong and Mark Dras. 2010. Parser features for sentence grammaticality classification. In Proc. Australasian Language Technology Association Workshop, Melbourne, Australia, December. Andreas Zollmann and Khalil Sima’an. 2005. A consistent and efficient estimator for Data-Oriented Parsing. Journal of Automata, Languages and Combinatorics, 10(2/3):367–388. Willem Zuidema. 2007. Parsimonious Data-Oriented Parsing. In Proc. EMNLP, Prague, Czech Republic, June. –