emnlp emnlp2011 emnlp2011-2 emnlp2011-2-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze
Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1
Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina, and Josef van Genabith. 2010. Automatic extraction of arabic multiword expressions. In Proceedings of the 2010 Workshop on Multiword Expressions, pages 19–27, Beijing, China. Coling 2010 Organizing Committee. Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword Expressions, pages 89–96, Sapporo, Japan. Association for Computational Linguistics. Helena Caseli, Aline Villavicencio, Andr e´ Machado, and Maria Jos e´ Finatto. 2009. Statistically-driven alignment-based multiword expression identification for technical domains. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 1–8, Singapore. Association for Computational Linguistics. Yaacov Choueka. 1988. Looking for needles in a haystack. In Proceedings of RIAO88, pages 609–623. Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the 2007 on Multiword Expressions, pages 41–48, Prague, Czech Republic. Association for Computational Linguistics. Mona Diab and Pravin Bhutada. 2009. Verb noun construction mwe token classification. In Proceedings of the 2009 Workshop on Multiword Expressions, pages 17–22, Singapore. Association for Computational Linguistics. Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 188–195. Association for Computational Linguistics. Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Institut f u¨r maschinelle Sprachverarbeitung (IMS), Universit¨ at Stuttgart. Graham Katz and Eugenie Giesbrecht. 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the 2006 Workshop on Multiword Expressions, pages 12–19, Sydney, Australia. Association for Computational Linguistics. Linlin Li and Caroline Sporleder. 2010. Linguistic cues for distinguishing literal and non-literal usages. In Coling 2010: Posters, pages 683–691, Beijing, China. Coling 2010 Organizing Committee. Dekang Lin. 1999. Automatic identification of noncompositional phrases. In Proceedings of the 37th 803 Annual Meeting of the Association for Computational Linguistics, pages 3 17–324, College Park, Maryland, USA. Association for Computational Linguistics. Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in Information Retrieval, 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 2831, 2010. Proceedings, pages 627–630. Christopher D. Manning and Hinrich Sch u¨tze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA. Donald Metzler and W. Bruce Croft. 2004. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735–750. B.V. Moir o´n and J o¨rg Tiedemann. 2006. Identifying Idiomatic Expressions Using Automatic WordAlignment. In Multi-Word-Expressions in a Multilingual Context, page 33. Pavel Pecina. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1-2): 138–158. Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. mwetoolkit: a framework for multiword expression identification. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta. Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, pages 1– 15, Mexico City. Patrick Schone and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 100–108, Pittsburgh, Pennsylvania, USA. Association for Computational Linguistics. Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational linguistics, 19(1): 143–177. ChengXiang Zhai and John D. Lafferty. 2002. Two-stage language models for information retrieval. In SIGIR, pages 49–56. ACM.