acl acl2013 acl2013-326 acl2013-326-reference knowledge-graph by maker-knowledge-mining

326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

Source: pdf

Author: Hany Hassan ; Arul Menezes

Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.

reference text

AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 3340, Sydney, Australia. Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction, In ACL 2000: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Englewood Cliffs, NJ, USA. Ye-In Chang and Jiun-Rung Chen and Min-Tze Hsu 2010. A hash trie filter method for approximate string matching in genomic databases Applied Intelligence, 33: 1, pages 21:38, Springer US. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu 2007. Investigation and modeling of the structure of texting language. International Journal ofDocument Analysis and Recognition, vol. 10, pp. 157: 174. Danish Contractor and Tanveer Faruquie and Venkata Subramaniam 2010. Unsupervised cleansing of noisy text. In COLING ’ 10 Proceedings of the 23rd International Conference on Computational Linguistics, pages 189: 196. Paul Cook and Suzanne Stevenson. 2009. An unsupervised model for text message normalization.. In CALC 09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pages 71:78, Boulder, USA. Dipanjan Das and Slav Petrov 2011 Unsupervised part-of-speech tagging with bilingual graphbased projections Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 600:609, Portland, Oregon Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 82:90, Edinburgh, Scotland. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pages 368:378, Portland, Oregon, USA. Bo Han and Paul Cook and Timothy Baldwin 2012. Automatically Constructing a Normalisation Dictionary for Microblogs. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pages 421:432, Jeju Island, Korea. Hieu Hoang and Alexandra Birch and Chris Callisonburch and Richard Zens and Rwth Aachen and Alexandra Constantin and Marcello Federico and Nicola Bertoldi and Chris Dyer and Brooke Cowan and Wade Shen and Christine Moran and Ondrej Bojar 2007. Moses: Open source toolkit for statistical machine translation. Thad Hughes and Daniel Ramage 2007. mantic relatedness with random graph ceedings of Conference on Empirical Natural Language Processing EMNLP, Prague Lexical sewalks ProMethods in pp. 581589, Fei Liu and Fuliang Weng and Bingqing Wang and Yang Liu 2011. Insertion, Deletion, or Substitution? Normalizing Text Messages without Precategorization nor Supervision Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 19:24, Portland, Oregon Dan Melamed 1999. Bitext Maps and Alignment via Pattern Recognition. In Computational Linguistics, 25, pages 107: 130. Einat Minkov and William Cohen Graph Based Similarity Measures for Synonym Extraction from Parsed Text In Proceedings of the TextGraphs workshop 2012 J. Norris 1997. Markov Chains. Cambridge University Press. Kishore Papineni and Salim Roukos and Todd Ward and Wei-jing Zhu 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. in Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. , pages 311:318. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. Normalization of non-standard words. 2001. Xu Sun and Jianfeng Gao and Daniel Micol and Chris Quirk 2010. Learning Phrase-Based Spelling Error Models from Clickthrough Data. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 266:274, Sweeden. Martin Szummer and Tommi 2002. Partially labeled classification with markov random walks. In Advances in Neural Information Processing Systems, pages 945:952. 1585 Kristina Toutanova and Robert C. Moore. Pronunciation modeling for improved spelling correction.. 2002. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ,pages 14415 1, Philadelphia, USA. Justin Zobel and Philip Dart 1996. Phonetic string matching: Lessons from information retrieval. in Proceedings of the Eighteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 166: 173, Zurich, Switzerland. 1586