emnlp emnlp2011 emnlp2011-73 emnlp2011-73-reference knowledge-graph by maker-knowledge-mining

73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

Source: pdf

Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.

reference text

Francis R. Bach and Michael I. Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. Technical report, Dept Statist Univ California Berkeley CA Tech. Lisa Ballesteros and W. Bruce Croft. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International Conference on Database and Expert Systems Applications, DEXA ’96, pages 791–801, London, UK. Springer-Verlag. Onureena Banerjee, Alexandre d’Aspremont, and Laurent El Ghaoui. 2005. Sparse covariance selection 939 CoRR, via robust maximum likelihood estimation. abs/cs/0506023. Nuria Bel, Cornelis H. A. Koster, and Marta Villegas. 2003. Cross-lingual text categorization. Hal Daume III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguistics: Human Language Technologies, pages 407–412, Portland, Oregon, USA, June. Association for Computational Linguistics. Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist., 19(1):61–74, March. William A. Gale and Kenneth W. Church. 1991. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 177–184, Morristown, NJ, USA. Association for Computational Linguistics. Wei Gao, John Blitzer, Ming Zhou, and Kam-Fai Wong. 2009. Exploiting bilingual information to improve web search. In Proceedings ofHuman Language Technologies: The 2009 Conference of the Association for Computational Linguistics, ACL-IJCNLP ’09, pages 1075–1083, Morristown, NJ, USA. ACL. Aria Haghighi, Percy Liang, Taylor B. Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL-08: HLT, pages 771–779, Columbus, Ohio, June. Association for Computational Linguistics. David R. Hardoon and John Shawe-Taylor. 2011. Sparse canonical correlation analysis. Journal of Machine Learning, 83(3):331–353. Ulf Hermjakob, Kevin Knight, and Hal Daum e´ III. 2008. Name translation in statistical machine translation learning when to transliterate. In Proceedings of ACL08: HLT, pages 389–397, Columbus, Ohio, June. Association for Computational Linguistics. Hung Huu Hoang, Su Nam Kim, and Min-Yen Kan. 2009. A Re-examination of Lexical Association Measures. In Proceedings of ACL-IJCNLP 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August. Association for Computational Linguistics. H. Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322–377. Diana Zaiu Inkpen and Graeme Hirst. 2002. Acquiring collocations for lexical choice between nearsynonyms. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9, ULA ’02, pages 67–76, Stroudsburg, PA, USA. Association for Computational Linguistics. Jagadeesh Jagarlamudi and Hal Daum e´ III. 2010. Extracting multilingual topics from unaligned comparable corpora. In Advances in Information Retrieval, 32nd European Conference on IR Research, ECIR, volume 5993, pages 444–456, Milton Keynes, UK. Springer. Jagadeesh Jagarlamudi, Hal Daume III, and Raghavendra Udupa. 2011. From bilingual dictionaries to interlin- gual document representations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 147–152, Portland, Oregon, USA, June. Association for Computational Linguistics. R. Jonker and A. Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 817–824, Stroudsburg, PA, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT. David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 880–889, Stroudsburg, PA, USA. Association for Computational Linguistics. Robert C. Moore. 2004. On Log-Likelihood-Ratios and the Significance of Rare Events. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 333–340, Barcelona, Spain, July. Association for Computational Linguistics. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist., 3 1:477– 504, December. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19–5 1. John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from discriminative projections. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10, pages 25 1–261, Stroudsburg, PA, USA. Association for Computational Linguistics. 940 Piyush Rai and Hal Daum e´ III. 2009. Multi-label prediction via sparse infinite cca. In Advances in Neural Information Processing Systems, Vancouver, Canada. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 5 19–526, Stroudsburg, PA, USA. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2009. Learning phoneme mappings for transliteration without parallel data. In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 37–45, Boulder, Colorado, June. Association for Computational Linguistics. K. Ahuja Ravindra, L. Magnanti Thomas, and B. Orlin James. 1993. Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, Inc. Harry T Reis and Charles M Judd. 2000. Handbook of Research Methods in Social and Personality Psychology. Cambridge University Press. Alexander Schrijver. 2003. Combinatorial Optimization. Springer. Michael L. Littman Susan T. Dumais, Thomas K. Landauer. 1996. Automatic cross-linguistic information retrieval using latent semantic indexing. In Working Notes of the Workshop on Cross-Linguistic Information Retrieval, SIGIR, pages 16–23, Zurich, Switzerland. ACM. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artif. Intell. Res. (JAIR), 37: 141–188. Alexei Vinokourov, John Shawe-taylor, and Nello Cristianini. 2003. Inferring a semantic representation of text via cross-language correlation analysis. In Advances in Neural Information Processing Systems, pages 1473–1480, Cambridge, MA. MIT Press. Thuy Vu, AiTi Aw, and Min Zhang. 2009. Feature-based method for document alignment in comparable news corpora. In EACL, pages 843–85 1. Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1128–1 137, Uppsala, Sweden, July. Association for Computational Linguistics.