emnlp emnlp2011 emnlp2011-98 emnlp2011-98-reference knowledge-graph by maker-knowledge-mining

98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

Source: pdf

Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni

Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp

reference text

Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92–100. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Classbased n-gram models of natural language. Comput. Linguist. Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’ 10. Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-ofspeech tagging. In AAAI, pages 784–789. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Empirical Methods in Natural Language Processing. Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In Proceedings of the 20th international joint conference on Artifical intelligence. Doug Downey, Oren Etzioni, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(1 1):726–748. Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09. Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Text Language Data With Amazon ’s Mechanical Turk. Association for Computational Linguistics, June. 1533 Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05. Radu Florian. 2002. Named entity recognition as a house of cards: classifier stacking. In Proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Proceedings of the International Conference on Semantic Computing. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. In ACL. Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical report, Microsoft Research. Stephan Gouws, Donald Metzler, Congxing Cai, and Eduard Hovy. 2011. Contextual bearing on linguistic variation in social media. In ACL Workshop on Language in Social Media, Portland, Oregon, USA. To appear. T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, April. Mark Hachman. 2011. Humanity’s tweets: Just 20 terabytes. In PCMAG.COM. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. Catherine Kobus, Fran ¸cois Yvon, and G ´eraldine Damnati. 2008. Normalizing sms: are two metaphors better than one ? In COLING, pages 441–448. Zornitsa Kozareva and Eduard H. Hovy. 2010. Not all seeds are equal: Measuring the quality of text mining seeds. In HLT-NAACL. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011. Recognizing named entities in tweets. In ACL. Brian Locke and James Martin. 2009. Named entity recognition: Adapting to microblogging. In Senior Thesis, University of Colorado. Mitchell P. Marcus, Beatrice Santorini, and Mary A. Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In ht tp: ://mal l . et cs . uma s s .e du. Tara McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10. Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 443–450, Morristown, NJ, USA. Association for Computational Linguistics. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP 2009. Christoph M ¨uller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 248–256, Morristown, NJ, USA. Association for Computational Linguistics. Christina Sauper, Aria Haghighi, and Regina Barzilay. 2010. Incorporating content structure into text analysis applications. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10, pages 377–387, Morristown, NJ, USA. Association for Computational Linguistics. Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the NorthAmerican Chapter ofthe Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03. Sameer Singh, Dustin Hillard, and Chris Leggetter. 2010. Minimally-supervised extraction of entities from text advertisements. In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). 1534 Charles Sutton. 2004. Collective segmentation and labeling of distant entities in information extraction. Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1473–1481 . Association for Computational Linguistics. Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, ConLL ’00. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’ 10. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceed- ings of the 33rd annual meeting on Association for Computational Linguistics, ACL ’95.