emnlp emnlp2013 emnlp2013-204 emnlp2013-204-reference knowledge-graph by maker-knowledge-mining

204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Source: pdf

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

reference text

J. Androutsopoulos, 2007. The multilingual internet. Language, Culture and communication online, chapter Language choice and code-switching in German-based diasporic web-forums., pages 340–361. Oxford: Oxford University Press. P. Auer and L. Wei. 2007. Introduction: Multilingualism as a problem? Monolingualism as a problem? In Handbook of Multilingualism and Multilingual Communication, volume 5 of Handbooks of Applied Linguistics, pages 1–14. Mouton de Gruyter. P. Auer. 1999. From codeswitching via language mix- ing to fused lects toward a dynamic typology of bilingual speech. International Journal of Bilingualism, 3(4):309–332. T. Baldwin and M. Lui. 2010. Language identification: the long and the short of the matter. In Proceedings of NAACL 2010. S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, and T. Wilson. 2012. Language identification for creating language-specific twitter collections. In Proceedings of the Second Workshop on Language in Social Media. S. Carter, W. Weerkamp, and M. Tsagkias. 2012. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, pages 1–21 . W.B. Cavnar and J. M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval. H. Ceylan and Y. Kim. 2009. Language identification of search engine queries. In Proceedings of ACL 2009. B. Danet and S. C. Herring. 2007. The multilingual Internet: Language, culture, and communication online. Oxford University Press Oxford. M. Durham. 2003. Language choice on a Swiss mailing list. Journal of Computer-Mediated Communication, 9(1). T. Gottron and N. Lipka. 2010. A comparison of lan- guage identification approaches on short, query-style texts. In Proceedings of ECIR 2010. 862 H. Hammarstr o¨m. 2007. A fine-grained model for language identification. In Proceedings of iNEWS-07 Workshop at SIGIR 2007. B. Hughes, T. Baldwin, S. Bird, J. Nicholson, and A. Mackinlay. 2006. Reconsidering language identification for written language resources. In Proceedings of LREC 2006. B. King and S. Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL 2013. J. Lafferty, A. McCallum, and F. C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML 2001. M. Lui and T. Baldwin. 2012. langid.py: an off-the-shelf language identification tool. In Proceedings of ACL 2012. P. McNamee. 2005. Language identification: a solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3):94–101 . J.C. Paolillo. 2011. “Conversational” codeswitching on Usenet and Internet Relay Chat. Language@Internet, 8(3). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12:2825– 2830. H. Sak, T. G ¨ung¨ or, and M. Sara ¸clar. 2008. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. In GoTAL 2008, volume 5221 of LNCS, pages 417–427. Springer. R. Sch a¨fer and F. Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Proceedings of LREC 2012. J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. 2006. Effects of age and gender on blogging. In Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. D. Trieschnigg, D. Hiemstra, M. Theune, F. Jong, and T. Meder. 2012. An exploration of language identification techniques for the Dutch folktale database. In Adaptation of Language Resources and Tools for Processing Cultural Heritage workshop (LREC 2012). T. Vatanen, J. J. V ¨ayrynen, and S. Virpioja. 2010. Language identification of short text segments with ngram models. In Proceedings of LREC 2010. H. Yamaguchi and K. Tanaka-Ishii. 2012. Text segmen- tation by language using minimum description length. In Proceedings of ACL 2012.