emnlp emnlp2011 emnlp2011-44 emnlp2011-44-reference knowledge-graph by maker-knowledge-mining

44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Source: pdf

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

reference text

Amittai Axelrod. 2006. Factored Language Models for Statistical Machine Translation. M.Sc. Thesis. University of Edinburgh, Scotland. Alexandra Birch, Miles Osborne and Philipp Koehn. 2007. CCG Supertags in Factored Translation Models. Workshop on Statistical Machine Translation, Association for Computational Linguistics. Stanley Chen and Joshua Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report 10-98, Computer Science Group, Harvard University. Matthias Eck, Stephan Vogel, and Alex Waibel. 2004. Language Model Adaptation for Statistical Machine Translation based on Information Retrieval. Language Resources and Evaluation. George Foster and Roland Kuhn. 2007. Mixture-Model Adaptation for SMT. Workshop on Statistical Machine Translation, Association for Computational Linguistics. Andreas Stolcke. 2002. SRILM An Extensible Language Modeling Toolkit. Spoken Language Processing. Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, Eiichiro Sumita. 2008. Method of Selecting Training Data to Build a Compact and Efficient Transla- George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative Instatnce Weighting for Domain Adaptation in Statistical Machine Translation. Empirical Methods in Natural Language Processing. Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a Unified Approach to Statistical Language Modeling for Chinese. ACM Transactions on Asian Language Information Processing. Xiaodong He. 2007. Using Word-Dependent Transition Models in HMM-based Word Alignment for Statistical Machine Translation. Workshop on Statistical Machine Translation, Association for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2003. Moses: Open Source Toolkit for Statistical Machine Translation. Demo Session, Association for Computational Linguistics. Philipp Koehn and Josh Schroeder. 2007. Experiments in Domain Adaptation for Statistical Machine Translation. Workshop on Statistical Machine Translation, Association for Computational Linguistics. Yajuan L u¨, Jin Huang and Qun Liu. 2007. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization. Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Spyros Matsoukas, Antti-Veikko Rosti, Bing Zhang. 2009. Discriminative Corpus Weight Estimation for Machine Translation. Empirical Methods in Natural Language Processing. Robert Moore and William Lewis. 2010. Intelligent Selection ofLanguage Model Training Data. Association for Computational Linguistics. Preslav Nakov. 2008. Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing. Workshop on Statistical Machine Translation, Association for Computational Linguistics. Franz Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics Franz Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. Association for Computational Linguistics tionModel. InternationalJointConference onNatural Language Processing. 362 –