acl acl2013 acl2013-37 acl2013-37-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li
Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.
AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for sms text normalization. In ACL, pages 33–40. Zhuowei Bao, Benny Kimelfeld, and Yunyao Li. 2011. A graph approach to spelling correction in domaincentric search. In ACL, pages 905–914. Richard Beaufort, Sophie Roekhaut, Louise-Am e´lie Cougnon, and C ´edrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In ACL, pages 770–779. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivaku- mar Vaithyanathan. 2010. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128–137. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007. Investigation and modeling of the structure of texting language. IJDAR, 10(3-4): 157–174. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, pages 1–8. Paul Cook and Suzanne Stevenson. 2009. An unsupervised model for text message normalization. In CALC, pages 71–78. Mark Davies. 2008-. The corpus of contemporary american english: 450 million words, 1990present. Avialable online at: http : / / corpus . byu .edu / coca / . Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In ACL, pages 368–378. 1167 Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In EMNLP-CoNLL, pages 421–432. Catherine Kobus, Fran ¸cois Yvon, Damnati. 2008. Normalizing and G ´eraldine SMS: are two metaphors better than one? In COLING, pages 441– 448. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011. Insertion, deletion, or substitution? normalizing text messages without pre-categorization nor supervision. In ACL, pages 71–76. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In ACL, pages 1035–1044. Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC-06, pages 449–454. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311– 318. Deana Pennell and Yang Liu. 2010. Normalization of text messages for text-to-speech. In ICASSP, pages 4842–4845. Deana Pennell and Yang Liu. 2011. A character-level machine translation approach for normalization of SMS abbreviations. In IJCNLP, pages 974–982. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in Tweets: An experimental study. In EMNLP, pages 1524–1534. Richard Sproat, Alan W. Black, Stanley F. Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer Speech & Language, 15(3):287– 333. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Analyzing Microtext, volume WS-1 1-05 of AAAI Workshops. 1168