acl acl2010 acl2010-151 acl2010-151-reference knowledge-graph by maker-knowledge-mining

151 acl-2010-Intelligent Selection of Language Model Training Data

Source: pdf

Author: Robert C. Moore ; William Lewis

Abstract: We address the problem of selecting nondomain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domainspecific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods.

reference text

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28–30, Prague, Czech Republic, 858–867. Fran ¸cois Denis, Remi Gilleron, and Marc Tommasi. 2002. Text classification from positive and unlabeled examples. In The 9th International Conference on Information Processing and Management of Uncertainty in KnowledgeBased Systems (IPMU 2002), 1927–1934. Charles Elkin and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In KDD 2008, August 24–27, Las Vegas, Nevada, USA, 213–220. Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Informa- tion Processing, 1(1):3–33. Dietrich Klakow. 2000. Selecting articles from the language model training corpus. In ICASSP 2000, June 5–9, Istanbul, Turkey, vol. 3, 1695– 1698. Philipp Koehn. 2005. Europarl: a parallel corpus for statistical machine translation. In MT Summit X, September 12–16, Phuket, Thailand, 79–86. Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien, Ker-Jiann Chen, and Lin-Shan Lee. 1997. Chinese language model adaptation based on document classification and multiple domainspecific language models. In EUROSPEECH1997, 1463–1466. Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring dependencies in stochastic language modelling. Computer Speech and Language, 8: 1–38. 224