acl acl2011 acl2011-139 acl2011-139-reference knowledge-graph by maker-knowledge-mining

139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Source: pdf

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

reference text

Lisa Ballesteros and W. Bruce Croft. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International Conference on Database and Expert Systems Applications, DEXA ’96, pages 791–801, London, UK. Springer-Verlag. Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Uncertainty in Artificial Intelligence. Scott Deerwester. 1988. Improving Information Retrieval with Latent Semantic Indexing. In Christine L. Borgman and Edward Y. H. Pai, editors, Proceedings of the 51st ASIS Annual Meeting (ASIS ’88), volume 25, Atlanta, Georgia, October. American Society for Information Science. William A. Gale and Kenneth W. Church. 1991 . A program for aligning sentences in bilingual corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 177–184, Morristown, NJ, USA. Association for Computational Linguistics. Wei Gao, John Blitzer, and Ming Zhou. 2008. Using english information in non-english web search. In iNEWS ’08: Proceeding of the 2nd ACM workshop on Improving non english web searching, pages 17–24, New York, NY, USA. ACM. Wei Gao, John Blitzer, Ming Zhou, and Kam-Fai Wong. 2009. Exploiting bilingual information to improve web search. In Proceedings ofHuman Language Technologies: The 2009 Conference of the Association for Computational Linguistics, ACL-IJCNLP ’09, pages 1075–1083, Morristown, NJ, USA. ACL. Arthur Gretton, Arthur Gretton, Olivier Bousquet, Olivier Bousquet, Er Smola, Bernhard Schlkopf, and Bernhard Schlkopf. 2005. Measuring statistical dependence with hilbert-schmidt norms. In Proceedings of Algorithmic Learning Theory, pages 63–77. SpringerVerlag. Aria Haghighi, Percy Liang, Taylor B. Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL-08: HLT, pages 77 1–779, Columbus, Ohio, June. Association for Computational Linguistics. Ulf Hermjakob, Kevin Knight, and Hal Daum e´ III. 2008. Name translation in statistical machine translation learning when to transliterate. In Proceedings of ACL08: HLT, pages 389–397, Columbus, Ohio, June. Association for Computational Linguistics. H. Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322–377. Jagadeesh Jagaralmudi, Seth Juarez, and Hal Daum e´ III. 2010. Kernelized sorting for natural language processing. In Proceedings of AAAI Conference on Artificial Intelligence. Jagadeesh Jagarlamudi and Hal Daum e´ III. 2010. Extracting multilingual topics from unaligned comparable corpora. In Advances in Information Retrieval, 32nd European Conference on IR Research, ECIR, volume 5993, pages 444–456, Milton Keynes, UK. Springer. R. Jonker and A. Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proceed- ings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 817–824, Stroudsburg, PA, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 880–889, Stroudsburg, PA, USA. Association for Computational Linguistics. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist., 3 1:477– 504, December. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19–5 1. Ari Pirkola, Turid Hedlund, Heikki Keskustalo, and Kalervo Jrvelin. 2001. Dictionary-based crosslanguage information retrieval: Problems, methods, and research findings. Information Retrieval, 4:209– 230. 152 John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from discriminative projections. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10, pages 25 1–261, Stroudsburg, PA, USA. Novi Quadrianto, Le Song, and Alex J. Smola. 2009. Kernelized sorting. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1289–1296. Piyush Rai and Hal Daum e´ III. 2009. Multi-label prediction via sparse infinite cca. In Advances in Neural Information Processing Systems, Vancouver, Canada. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 5 19–526, Stroudsburg, PA, USA. Sujith Ravi and Kevin Knight. 2009. Learning phoneme mappings for transliteration without parallel data. In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 37–45, Boulder, Colorado, June. K. Ahuja Ravindra, L. Magnanti Thomas, and B. Orlin James. 1993. Network flows: Theory, algorithms, and applications. Michael L. Littman Susan T. Dumais, Thomas K. Landauer. 1996. Automatic cross-linguistic information retrieval using latent semantic indexing. In Working Notes of the Workshop on Cross-Linguistic Information Retrieval, SIGIR, pages 16–23, Zurich, Switzerland. ACM. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artif. Intell. Res. (JAIR), 37: 141–188. Raghavendra Udupa, K. Saravanan, A. Kumaran, and Jagadeesh Jagarlamudi. 2009. Mint: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In EACL, pages 799–807. The Association for Computer Linguistics. Alexei Vinokourov, John Shawe-taylor, and Nello Cristianini. 2003. Inferring a semantic representation of text via cross-language correlation analysis. In Advances in Neural Information Processing Systems, pages 1473–1480, Cambridge, MA. MIT Press. Thuy Vu, AiTi Aw, and Min Zhang. 2009. Feature-based method for document alignment in comparable news corpora. In EACL, pages 843–85 1. Manfred K. Warmuth and Dima Kuzmin. 2006. Randomized pca algorithms with regret bounds that are logarithmic in the dimension. In Neural Information Processing Systems, pages 148 1–1488.