acl acl2010 acl2010-76 acl2010-76-reference knowledge-graph by maker-knowledge-mining

76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

Source: pdf

Author: Shane Bergsma ; Emily Pitler ; Dekang Lin

Abstract: In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.

reference text

Nir Ailon and Mehryar Mohri. 2008. An efficient reduction of ranking to classification. In COLT. Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In ACL. Cory Barr, Rosie Jones, and Moira Regelson. 2008. The linguistic structure of English web-search queries. In EMNLP. Shane Bergsma, Dekang Lin, and Randy Goebel. 2009. Web-scale N-gram models for lexical disambiguation. In IJCAI. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL. Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram Corpus Version 1.1. LDC2006T13. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In EMNLP. Thorsten Brants. 2000. TnT speech tagger. In ANLP. – a statistical part-of- Andrew Carlson, Tom M. Mitchell, and Ian Fette. 2008. Data analysis project: Leveraging massive textual corpora using n-gram statistics. Technial Report CMU-ML-08-107. Kenneth Church, Ted Hart, and Jianfeng Gao. 2007. Compressing trigram language models with Golomb coding. In EMNLP-CoNLL. Hal Daum e´ III. 2007. Frustratingly easy domain adaptation. In ACL. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9. Dan Gildea. 2001. Corpus variation and parser performance. In EMNLP. Andrew R. Golding and Dan Roth. 1999. A Winnowbased approach to context-sensitive spelling correction. Machine Learning, 34(1-3): 107–130. Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In KDD. Frank Keller and Mirella Lapata. 2003. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459–484. Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics, 29(3):333–347. Seth Kulick, Ann Bies, Mark Liberman, Mark Mandel, Ryan McDonald, Martha Palmer, Andrew Schein, Lyle Ungar, Scott Winters, and Pete White. 2004. Integrated annotation for biomedical information extraction. In BioLINK 2004: Linking Biological Literature, Ontologies and Databases. Mirella Lapata and Frank Keller. 2005. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1): 1–31. Mark Lauer. 1995a. Corpus statistics meet the compound: Some empirical results. In ACL. noun Mark Lauer. 1995b. Designing Statistical Language Learners: Experiments on Compound Nouns. Ph.D. thesis, Macquarie University. Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New tools for web-scale N-grams. In LREC. 873 Robert Malouf. 2000. The order of prenominal adjectives in natural language generation. In ACL. Mitchell P. Marcus, Beatrice Santorini, and Mary Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):3 13–330. Mitchell P. Marcus. 1980. Theory of Syntactic Recognition for Natural Languages. MIT Press, Cambridge, MA, USA. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Reranking and self-training for parser adaptation. In COLING-ACL. Margaret Mitchell. 2009. Class-based ordering of prenominal modifiers. In 12th European Workshop on Natural Language Generation. Natalia N. Modjeska, Katja Markert, and Malvina Nissim. 2003. Using the Web in machine learning for other-anaphora resolution. In EMNLP. Preslav Nakov and Marti Hearst. 2005. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In CoNLL. Preslav Ivanov Nakov. 2007. Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. Ph.D. thesis, University of California, Berkeley. Xuan-Hieu Phan. 2006. CRFTagger: CRF English POS Tagger. crft agger . s ource forge .net . Laura Rimell and Stephen Clark. 2008. Adapting a lexicalized-grammar parser to contrasting domains. In EMNLP. James Shaw and Vasileios Hatzivassiloglou. 1999. Ordering among premodifiers. In ACL. Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii. 2005. Developing a robust partof-speech tagger for biomedical text. In Advances in Informatics. Peter D. Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. David Vadas and James R. Curran. 2007a. Adding noun phrase structure to the Penn Treebank. In ACL. David Vadas and James R. Curran. 2007b. Large-scale supervised models for noun phrase bracketing. In PACLING. Xiaofeng Yang, Jian Su, and Chew Lim Tan. 2005. Improving pronoun resolution using statistics-based semantic compatibility information. In ACL. 874