emnlp emnlp2010 emnlp2010-92 emnlp2010-92-reference knowledge-graph by maker-knowledge-mining

92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

Source: pdf

Author: Shane Bergsma ; Aditya Bhargava ; Hua He ; Grzegorz Kondrak

Abstract: In many applications, replacing a complex word form by its stem can reduce sparsity, revealing connections in the data that would not otherwise be apparent. In this paper, we focus on prefix verbs: verbs formed by adding a prefix to an existing verb stem. A prefix verb is considered compositional if it can be decomposed into a semantically equivalent expression involving its stem. We develop a classifier to predict compositionality via a range of lexical and distributional features, including novel features derived from web-scale Ngram data. Results on a new annotated corpus show that prefix verb compositionality can be predicted with high accuracy. Our system also performs well when trained and tested on conventional morphological segmentations of prefix verbs.

reference text

Jordi Atserias, Bernardino Casas, Elisabet Comelles, Meritxell Gonz a´lez, Llu ı´s Padr o´, and Muntsa Padr o´. 2006. FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. In LREC. R. Harald Baayen and Antoinette Renouf. 1996. Chronicling the Times: Productive lexical innovations in an English newspaper. Language, 72(1). Harald Baayen and Richard Sproat. 1996. Estimating lexical priors for low-frequency morphologically ambiguous forms. Comput. Linguist. , 22(2): 155–166. R. Harald Baayen, Richard Piepenbrock, and Leon Gulikers. 1996. The CELEX2 lexical database. LDC96L14. 302 Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In ACL 2003 Workshop on Multiword Expressions. Colin Bannard, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to the semantics of verbparticles. In ACL 2003 Workshop on Multiword Expressions. Marco Baroni, Johannes Matiasek, and Harald Trost. 2002. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In ACL-02 Workshop on Morphological and Phonological Learning (SIGPHON), pages 48–57. Matthew W. Bilotti, Boris Katz, and Jimmy Lin. 2004. What works better for question answering: Stemming or morphological query expansion? In Information Retrieval for Question Answering (IR4QA) Workshop at SIGIR 2004. Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram Corpus Version 1.1. LDC2006T13. Mathias Creutz and Krista Lagus. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process., 4(1): 1– 34. Hal Daum e´ III and Daniel Marcu. 2005. A large-scale exploration of effective global features for a joint entity detection and tracking model. In HLT-EMNLP. Carl de Marken. 1996. Linguistic structure as composition and perturbation. In ACL. Markus Dreyer, Jason Smith, and Jason Eisner. 2008. Latent-variable modeling of string transductions with finite-state methods. In EMNLP. Toma z˘ Erjavec and Sa˘ so D ˘zeroski. 2004. Machine learning of morphosyntactic structure: Lemmatising unknown Slovene words. Applied Artificial Intelligence, 18: 17–41. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A li- brary for large linear classification. JMLR, 9: 1871 1874. Afsaneh Fazly, Paul Cook, and Suzanne Stevenson. 2009. Unsupervised type and token identification of idiomatic expressions. Comput. Linguist. , 35(1):61– 103. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist., 27(2): 153–198. David Graff. 2003. English Gigaword. LDC2003T05. Vera Hollink, Jaap Kamps, Christof Monz, and Maarten de Rijke. 2004. Monolingual document retrieval for European languages. IR, 7(1):33–52. Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. 2006. Generating query substitutions. In WWW. Daniel Jurafsky and James H. Martin. 2000. Speech and language processing. Prentice Hall. Daniel Karp, Yves Schabes, Martin Zaidel, and Dania Egedi. 1992. A freely available wide coverage morphological analyzer for English. In COLING. Francis Katamba. 1993. Morphology. MacMillan Press. Samarth Keshava and Emily Pitler. 2006. A simpler, intuitive approach to morpheme induction. In 2nd Pascal Challenges Workshop. Jonathan K. Kummerfeld and James R. Curran. 2008. Classification of verb particle constructions with the Google Web1T Corpus. In Australasian Language Technology Association Workshop. Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New tools for web-scale N-grams. In LREC. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In COLING-ACL. Dekang Lin. 1999. Automatic identification of noncompositional phrases. In ACL. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch u¨tze. 2008. Introduction to Information Retrieval. Cambridge University Press. Guido Minnen, John Carroll, and Darren Pearce. 2001. Applied morphological processing of English. Nat. Lang. Eng., 7(3):207–223. Preslav Nakov and Marti Hearst. 2005. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In CoNLL. Preslav Ivanov Nakov. 2007. Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. Ph.D. thesis, University of California, Berkeley. Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In HLT-NAACL. Martin F. Porter. 1980. An algorithm for suffix stripping. Program, 14(3). Patrick Schone and Daniel Jurafsky. 2000. Knowledgefree induction of morphology using latent semantic analysis. In LLL/CoNLL. Patrick Schone and Daniel Jurafsky. 2001. Knowledgefree induction of inflectional morphologies. In NAACL. 303 Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In ACL-IJCNLP. Peter D. Turney. 2001 . Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning. Antal Van den Bosch and Walter Daelemans. 1999. Memory-based morphological analysis. In ACL. Richard Wicentowski. 2004. Multilingual noise-robust supervised morphological analysis using the wordframe model. In ACL SIGPHON. Ying Xu, Christoph Ringlstetter, and Randy Goebel. 2009. A continuum-based approach for tightness analysis of Chinese semantic units. In PACLIC. David Yarowsky and Richard Wicentowski. 2000. Minimally supervised morphological analysis by multimodal alignment. In ACL.