emnlp emnlp2010 emnlp2010-109 emnlp2010-109-reference knowledge-graph by maker-knowledge-mining

109 emnlp-2010-Translingual Document Representations from Discriminative Projections

Source: pdf

Author: John Platt ; Kristina Toutanova ; Wen-tau Yih

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

reference text

Massih-Reza Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from multiple partially observed views - an application to multilingual text categorization. In Advances in Neural Information Processing Systems 22 (NIPS 2009), pages 28–36. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of Uncertainty in Artificial Intelligence, pages 27–34. Lisa Ballesteros and Bruce Croft. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pages 791–801. David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Christopher J.C. Burges, John C. Platt, and Soumya Jana. 2003. Distortion discriminant analysis for audio fingerprinting. IEEE Transactions on Speech and Audio Processing, 11(3): 165–174. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391– 407. Konstantinos I. Diamantaras and S.Y. Kung. 1996. Principal Component Neural Networks: Theory and Applications. Wiley-Interscience. Susan T. Dumais, Todd A. Letsche, Michael L. Littman, and Thomas K. Landauer. 1997. Automatic crosslanguage retrieval using latent semantic indexing. In AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval. Susan T. Dumais. 1990. Enhancing performance in latent semantic indexing (LSI) retrieval. Technical Report TM-ARH-017527, Bellcore. Pascale Fung and Lo Yuen Yee. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL, pages 414–420. Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, and Ben Taskar. 2009. Posterior regularization for structured latent variable models. Technical Report MSCIS-09-16, University of Pennsylvania. Joao Graca, Kuzman Ganchev, and Ben Taskar. 2008. Expectation maximization and posterior constraints. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press, Cambridge, MA. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons 261 from monolingual corpora. In Proc. ACL, pages 771– 779. Xiaodong He. 2007. Using word-dependent transition models in HMM based word alignment for statistical machine translation. In ACL 2nd Statistical MT workshop, pages 80–87. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of Uncertainty in Artificial Intelligence, pages 289–296. Jagadeesh Jagarlamudi and Hal Daum e´, III. 2010. Extracting multilingual topics from unaligned comparable corpora. In ECIR. David Mimno, Hanna W. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of Empirical Methods in Natural Language Processing, pages 880–889. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploit- ing non-parallel corpora. Computational Linguistics, 31:477–504. Douglas W. Oard and Anne R. Diekema. 1998. Crosslanguage information retrieval. In Martha Williams, editor, Annual Review of Information Science (ARIST), volume 33, pages 223–256. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proc. EMNLP, pages 79–86. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the ACL, pages 5 19–526. Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. 2010. Spherical topic models. In Proc. ICML. Nicola Ueffing, Michel Simard, Samuel Larkin, and J. Howard Johnson. 2007. NRC’s PORTAGE system for WMT 2007. In ACL-2007 2nd Workshop on SMT, pages 185–188. Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini. 2003. Inferring a semantic representation of text via cross-language correlation analysis. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1473–1480, Cambridge, MA. MIT Press. Fridolin Wild, Christina Stahl, Gerald Stermsek, and Gustaf Neumann. 2005. Parameters driving effectiveness of automated essay scoring with LSA. In Proceedings 9th Internaional Computer-Assisted Assessment Conference, pages 485–494. Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proc. ACL, pages 1128–1 137, Uppsala, Sweden. Association for Computational Linguistics.