emnlp emnlp2011 emnlp2011-41 emnlp2011-41-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John D. Burger ; John Henderson ; George Kim ; Guido Zarrella
Abstract: Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.
Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler. 2007. Mining the blogosphere: Age, gender, and the varieties of self-expression. First Monday, 12(9), September. John D. Burger and John C. Henderson. 2006. An exploration of observable features related to blogger age. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. AAAI Press. Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, CSLDAMT ’ 10. Association for Computational Linguistics. Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. Software available at http : / /www . c s ie .ntu .edu .tw/ ∼c j l in/ l svm. ib A.P. Dawid and A.M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1). Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Conference on Empirical Methods on Natural Language Processing. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1). 1309 Bill Heil and Mikolaj Jan Piskorski. 2009. New Twitter research: Men follow men and nobody tweets. Harvard Business Review, June 1. Susan C. Herring, Inna Kouper, Lois Ann Scheidt, and Elijah L. Wright. 2004. Women and children last: The discursive construction of weblogs. In L. Gurak, S. Antonijevic, L. Johnson, C. Ratliff, and J. Reyman, editors, Into the Blogosphere: Rhetoric, Community, and Culture of Weblogs. http : / /blog . l .umn .edu /blogo sphere / . ib David Huffaker. 2004. Gender similarities and differences in online identity and language use among teenage bloggers. Master’s thesis, Georgetown University. http : / / cct . georget own .edu /the s i /DavidHu ffaker .pdf. s Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the Second Human Computation Workshop (KDD-HCOMP 2010). Nick Littlestone. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, April. Andrew Kachites McCallum. 2002. MALLET: A machine learning for language toolkit. http : / /mal let . c s . uma s s .e du . Claire Cain Miller. 2010. Why Twitter’s C.E.O. demoted himself. New York Times, October 30. http : / /www .nyt ime s . com/ 2 0 10 / 10 / 3 1/ technol ogy/ 3 1 .html. ev Arjun Mukherjee and Bing Liu. 2010. Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October. Association for Computational Linguistics. Sasa Petrovic, Miles Osborne, and Victor Lavrenko. 2010. The Edinburgh Twitter corpus. In Computational Linguistics in a World of Social Media. AAAI Press. Workshop at NAACL. Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in Twitter. In 2nd International Workshop on Search and Mining UserGenerated Content. ACM. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker. 2006. Effects of age and gender on blogging. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. AAAI Press, March. Robin Wauters. 2010. Only 50% of Twitter messages are in English, study says. TechCrunch, February 1. http : / /t echcrunch .com/ 2 0 10 / 0 2 / 2 4 / twitt er-language s / .