acl acl2013 acl2013-235 acl2013-235-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuki Arase ; Ming Zhou
Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.
Alexandra Antonova and Alexey Misyurev. 2011. Building a web-based parallel corpus and filtering out machine translated text. In Proceedings of the Workshop on Building and Using Comparable Corpora, pages 136–144. Eleftherios Avramidis, Maja Popovic, David Vilar Torres, and Aljoscha Burchardt. 2011. Evaluate with confidence estimation: Machine ranking of translation outputs using grammatical features. In Proceedings of the Workshop on Statistical Machine Translation (WMT 2011), pages 65–70. Mohit Bansal, Chris Quirk, and Robert C. Moore. 2011. Gappy phrasal alignment by agreement. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 1308– 1317. Marco Baroni and Silvia approach to the study learning the difference lated text. Literary 21(3):259–274. Bernardini. 2005. A new of translationese: Machinebetween original and transand Linguistic Computing, 1605 Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27: 1–27:27. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 263–270. Simon Corston-Oliver, Michael Gamon, and Chris Brockett. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2001), pages 148–155. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1535–1545. Michael Gamon, Anthony Aue, and Martine Smets. 2005. Sentence-level MT evaluation without reference translations: Beyond language modeling. In Proceedings of European Association for Machine Translation (EAMT 2005). Google N-gram Corpus. 2006. http : / /www . ldc . upenn .edu / Cat alog/ Cat alogEnt ry . j sp ? cat alogI d=LDC2 0 0 6T 1 . 3 Google Translate. 2006. http : / / code . google . com/ api s / l anguage / . Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, and Ruslan Mitkov. 2010. Identification of translationese: A machine learning approach. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2010), pages 503–5 11. Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita, and Kazuhide Yamamoto. 2009. Development of a Japanese-English software manual parallel corpus. In Proceedings of the Machine Translation Summit (MT Summit XII). Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, and Qingsheng Zhu. 2009. Mining bilingual data from the web with adaptively learnt patterns. In Proceedings of the Joint Conference of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), pages 870–878. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pages 230– 237. David Kurokawa, Cyril Goutte, and Pierre Isabelle. 2009. Automatic detection of translated text and its impact on machine translation. In Proceedings of the Machine Translation Summit (MT-Summit XII). Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers. Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys, 40(3): 1–49. Xiaoyi Ma. 2006. Champollion: a robust parallel text sentence aligner. In Proceedings ofthe International Conference on Language Resources and Evaluation (LREC 2006), pages 489–492. Microsoft Translator. 2009. http : / /www . mi cro s o ftt rans lat or .com/ dev/ . Microsoft Web N-gram Services. 2010. http : / / re s earch .mi cro s o ft . com/web-ngram. Robert Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2010), pages 220–224. Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A taxonomy of relational patterns with semantic types. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pages 1135–1 145. Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic min- ing of parallel texts from the web. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR 1999), pages 74–81. Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 3 11–3 18. Kristen Parton, Joel Tetreault, Nitin Madnani, and Martin Chodorow. 2011. E-rating machine translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT 2011), pages 108–1 15. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and MeiChun Hsu. 2001 . PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the International Conference on Data Engineering (ICDE 2001), pages 215–224. 1606 2011. Spencer Rarrick, Chris Quirk, and Will Lewis. MT detection in web-scraped parallel corpora. Proceedings Translation of the Machine In Summit (MT Summit XIII). Fabrizio Sebastiani. 2002. Machine learning in au- tomated text categorization. ACM Computing Sur- Lei Shi, Cheng Niu, Ming Zhou, and Jianfeng Gao. veys, 34(1): 2006. 1–47. A DOM tree alignment model for mining par- allel data from the web. In Proceedings of the Inter- national Conference on Computational Linguistics and the Annual Meeting of the Association for Com- putational Linguistics (COLING-ACL 2006), pages 489–496. Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of International Conference on World Wide Web (WWW 2007), pages 697–706. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLTNAACL 2003), pages 252–259. Vladimir N. Vapnik. 1995. The nature of statistical learning theory. Springer. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. StatSnowball: a statistical approach to extracting entity relationships. In Proceedings of International Conference on World Wide Web (WWW 2009), pages 101–1 10. 1607