acl acl2012 acl2012-219 acl2012-219-reference knowledge-graph by maker-knowledge-mining

219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Source: pdf

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

reference text

Alfred V. Aho and Margaret J. Corasick. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333–340, June. Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of NAACL HLT 2010, pages 229–237, Los Angeles, USA. Jamie Callan and Mark Hoy, 2009. ClueWeb09 Dataset. Available at http : / /bost on . lt i c s . . cmu .edu /Dat a / clueweb0 9 / . Simon Carter, Wouter Weerkamp, and Manos Tsagkias. to appear. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal. 30 William B. Cavnar and John M. Trenkle. 1994. Ngram-based text categorization. In Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, USA. Hakan Ceylan and Yookyung Kim. 2009. Language identification of search engine queries. In Proceedings of ACL2009, pages 1066–1074, Singapore. George Forman. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3(7-8): 1289– 1305, October. Rayid Ghani, Rosie Jones, and Dunja Mladenic. 2004. Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems, 7(1):56–83, February. Harald Hammarstrom. 2007. A Fine-Grained Model for Language Identication. In Proceedings of iNEWS07, pages 14–20. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT summit, 11. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 553–561, Chiang Mai, Thailand. Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, USA. J.R. Quinlan. 1986. Induction of Decision Trees. Machine Learning, 1(1):8 1–106, October. Penelope Sibun and Jeffrey C. Reynar. 1996. Language determination: Examining the issues. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, pages 125–135, Las Vegas, USA. J o¨rg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing, V:237–248. Erik Tromp and Mykola Pechenizkiy. 2011. GraphBased N-gram Language Identification on Short Texts. In Proceedings of Benelearn 2011, pages 27–35, The Hague, Netherlands. Tommi Vatanen, Jaakko J. Vayrynen, and Sami Virpioja. 2010. Language identification of short text segments with n-gram models. In Proceedings of LREC 2010, pages 3423–3430. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML 97.