acl acl2011 acl2011-259 acl2011-259-reference knowledge-graph by maker-knowledge-mining

259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Source: pdf

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

reference text

Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. Decompounding query keywords from compounding languages. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL’08), pages 253–256. Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), pages 1208–1212. Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):61–74. Stefan Evert. 2008. Corpora and collocations. In A. Ludeling and M. Kyto, editors, Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter, Berlin. John Firth. 1957. A synopsis of linguistic theory 19301955. Studies in Linguistic Analysis, Philological. Longman. Pascale Fung. 2000. A statistical view on bilingual lexicon extraction–from parallel corpora to non-parallel corpora. In Jean V ´eronis, editor, Parallel Text Processing, page 428. Kluwer Academic Publishers. William A. Gale and Kenneth W. Church. 1991. Iden- tifying word correspondence in parallel texts. In Proceedings of the workshop on Speech and Natural Language, HLT’91, pages 152–157, Morristown, NJ, USA. Association for Computational Linguistics. Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publisher. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The weka data mining software: An update. SIGKDD Explorations, 11. Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In 23rd International Conference on Computational Linguistics (Coling 2010), pages 617–625, Beijing, China, Aug. Emmanuel Morin, B ´eatrice Daille, Koichi Takeuchi, and Kyo Kageura. 2007. Bilingual Terminology Mining Using Brain, not brawn comparable corpora. In Proceedings ofthe 45th Annual Meeting ofthe Association for Computational Linguistics (ACL’07), pages 664– 671, Prague, Czech Republic. 1335 Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4):477–504. Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, and Andrea Mulloni. 2006. Finding translations for lowfrequency words in comparable corpora. Machine Translation, 20(4):247–266. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 403– 411. Tao Tao and ChengXiang Zhai. 2005. Mining comparable bilingual text corpora for cross-language information integration. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 691–696, New York, NY, USA. ACM. Shanheng Zhao and Hwee Tou Ng. 2007. Identification and resolution of Chinese zero pronouns: A machine learning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic.