emnlp emnlp2013 emnlp2013-9 emnlp2013-9-reference knowledge-graph by maker-knowledge-mining

9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization

Source: pdf

Author: Yi Yang ; Jacob Eisenstein

Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.

reference text

S. Argamon, M. Koppel, J. Pennebaker, and J. Schler. 2007. Mining the blogosphere: age, gender, and the varieties of self-expression. First Monday, 12(9). AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of ACL, pages 33–40. Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of ACL, pages 770– 779. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Proceedings of NAACL, pages 582–590. Samuel Brody and Nicholas Diakopoulos. 2011. Cooooooooooooooollllllllllllll! !!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs. In Proceedings of EMNLP. John D. Burger, John C. Henderson, George Kim, and Guido Zarrella. 2011. Discriminating gender on twitter. In Proceedings of EMNLP. Olivier Cappe, Simon J. Godsill, and Eric Moulines. 2007. An overview of existing methods and recent advances in sequential monte carlo. Proceedings of the IEEE, 95(5):899–924, May. M. Choudhury, R. Saraf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu. 2007a. Investigation and modeling of the structure of texting language. International Journal on DocumentAnalysis andRecognition, 10(3): 157–174. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007b. Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), 10(3-4): 157–174. Danish Contractor, Tanveer A. Faruquie, and L. Venkata Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of COLING, pages 189–196. Paul Cook and Suzanne Stevenson. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, CALC ’09, pages 71–78, Stroudsburg, PA, USA. Association for Computational Linguistics. A. Doucet, N.J. Gordon, and V. Krishnamurthy. 2001. Particle filters for state estimation of jump markov linear systems. Trans. Sig. Proc., 49(3):613–624, March. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of EMNLP. Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of ACL. Jacob Eisenstein. 2013a. Phonological factors in social media writing. In Proceedings of the NAACL Workshop on Language Analysis in Social Media. Jacob Eisenstein. 2013b. What to do about bad language on the internet. In Proceedings of NAACL, pages 359– 369. 71 Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of ACL. Simon J. Godsill, Arnaud Doucet, and Mike West. 2004. Monte carlo smoothing for non-linear time series. In Journal of the American Statistical Association, pages 156–168. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First Workshop on Unsupervised Learning in NLP, EMNLP ’ 11. Lisa J. Green. 2002. African American English: A Linguistic Introduction. Cambridge University Press, September. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of ACL, pages 368–378. Bo Han, Paul Cook, and Timothy Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology, 4(1):5. Hany Hassan and Arul Menezes. 2013. Social text normalization using contextual graph random walks. In Proceedings of ACL. Catherine Kobus, François Yvon, and Géraldine Damnati. 2008. Normalizing sms: are two metaphors better than one? In Proceedings of COLING, pages 441–448. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pages 282–289. John Langford, Lihong Li, and Tong Zhang. 2009. online learning via truncated gradient. The Journal of Machine Learning Research, 10:777–801 . D. D. Lee and H. S. Seung. 2001. Algorithms for NonNegative Matrix Factorization. In Advances in Neural Information Processing Systems (NIPS), volume 13, pages 556–562. Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011. Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In Proceedings of ACL, pages 71–76. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012a. A broadcoverage normalization system for social media language. In Proceedings of ACL, pages 1035–1044. Xiaohua Liu, Ming Zhou, Xiangyang Zhou, Zhongyang Fu, and Furu Wei. 2012b. Joint inference of named entity recognition and normalization for tweets. In Proceedings of ACL. Sparse Saša Petrovi c´, Miles Osborne, and Victor Lavrenko. 2010. The edinburgh twitter corpus. In Proceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media, pages 25–26. Noah A. Smith and Jason Eisner. 2005. Contrastive estimation: training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 354–362, Stroudsburg, PA, USA. Association for Computational Linguistics. R. Sproat, A.W. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 2001. Normalization of non-standard words. Computer Speech & Language, 15(3):287–333. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP, pages 901–904. Sali Tagliamonte and Rosalind Temple. 2005. New perspectives on an ol’ variable: (t,d) in british english. Language Variation and Change, 17:28 1–302, September. Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yunyao Li. 2013. Adaptive parsercentric text normalization. In Proceedings of ACL, pages 1159–1 168. 72