acl acl2011 acl2011-134 acl2011-134-reference knowledge-graph by maker-knowledge-mining

134 acl-2011-Extracting and Classifying Urdu Multiword Expressions


Source: pdf

Author: Annette Hautli ; Sebastian Sulger

Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.


reference text

Mohammed Attia, Antonio Toral, Lamia Tounsi, Pavel Pecina, and Josef van Genabith. 2010. Automatic Extraction of Arabic Multiword Expressions. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010). Satanjeev Banerjee and Ted Pedersen. 2003. The Design, Implementation and Use of the Ngram Statistics Package. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics. Kenneth Beesley and Lauri Karttunen. 2003. Finite State Morphology. CSLI Publications, Stanford, CA. Tina B ¨ogel, Miriam Butt, Annette Hautli, and Sebastian Sulger. 2007. Developing a Finite-State Morphological Analyzer for Urdu and Hindi: Some Issues. In Proceedings of FSMNLP07, Potsdam, Germany. Tina B ¨ogel, Miriam Butt, Annette Hautli, and Sebastian Sulger. 2009. Urdu and the Modular Architecture of ParGram. In Proceedings of the Conference on Language and Technology 2009 (CLT09). Miriam Butt and Tracy Holloway King. 2007. Urdu in a Parallel Grammar Development Environment. Language Resources and Evaluation, 41(2): 191–207. Miriam Butt, Tracy Holloway King, Mar ı´a-Eugenia Ni˜ no, and Fr´ ed´ erique Segond. 1999. A Grammar Writer’s Cookbook. CSLI Publications. Miriam Butt. 1993. The Structure of Complex Predicates in Urdu. Ph.D. thesis, Stanford University. Debasri Chakrabarti, Vaijayanthi M. Sarma, and Pushpak Bhattacharyya. 2008. Hindi Compound Verbs and their Automatic Extraction. In Proceedings of COLING 2008, pages 27–30. Tanmoy Chakraborty and Sivaji Bandyopadhyay. 2010. Identification of Reduplication in Bengali Corpus and their Semantic Analysis: A Rule-Based Approach. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 72–75. Dick Crouch, Mary Dalrymple, Ronald M. Kaplan, Tracy Holloway King, John T. Maxwell III, and Paula Newman, 2010. XLE Documentation. Palo Alto Research Center. Mary Dalrymple. 2001. Lexical Functional Grammar, volume 34 of Syntax and Semantics. Academic Press. Dipankar Das, Santanu Pal, Tapabrata Mondal, Tanmoy Chakraborty, and Sivaji Bandyopadhyay. 2010. Automatic Extraction of Complex Predicates in Bengali. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 37–45. Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, IMS, University of Stuttgart. Sarmad Hussain. 2008. Resources for Urdu Language Processing. In Proceedings of the 6th Workshop on Asian Language Resources, IJCNLP’08. John Kizito, Ismail Fahmi, Erik Tjong Kim Sang, Gosse Bouma, and John Nerbonne. 2009. Computational Linguistics and the History of Science. In Liborio Dibattista, editor, Storia della Scienza e Linguistica Computazionale. FrancoAngeli. Muhammad Kamran Malik, Tafseer Ahmed, Sebastian Sulger, Tina B ¨ogel, Atif Gulzar, Ghulam Raza, Sarmad Hussain, and Miriam Butt. 2010. Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10). Scott Martens and Vincent Vandeghinste. 2010. An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 84–87. Amitabha Mukerjee, Ankit Soni, and Achla M. Raina. 2006. Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (MWE ’06), pages 28–35. David Pearce. 2001. Synonymy in Collocation Extraction. In WordNet and Other Lexical Resources: Applications, Extensions & Customizations, pages 41–46. Carlos Ramisch, Paulo Schreiner, Marco Idiart, and Aline Villavicencio. 2008. An Evaluation of Methods for the Extraction of Multiword Expressions. In Proceedings of the Workshop on Multiword Expressions: Towards a Shared Task for Multiword Expressions (MWE 2008). R. Mahesh K. Sinha. 2009. Mining Complex Predicates in Hindi Using a Parallel Hindi-English Corpus. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, pages 40–46. Sina Zarrieß and Jonas Kuhn. 2009. Exploiting Translational Correspondences for Pattern-Independent MWE Identification. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, pages 23–30. 29