acl acl2013 acl2013-373 acl2013-373-reference knowledge-graph by maker-knowledge-mining

373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users


Source: pdf

Author: Shane Bergsma ; Benjamin Van Durme

Abstract: We describe a novel approach for automatically predicting the hidden demographic properties of social media users. Building on prior work in common-sense knowledge acquisition from third-person text, we first learn the distinguishing attributes of certain classes of people. For example, we learn that people in the Female class tend to have maiden names and engagement rings. We then show that this knowledge can be used in the analysis of first-person communication; knowledge of distinguishing attributes allows us to both classify users and to bootstrap new training examples. Our novel approach enables substantial improvements on the widelystudied task of user gender prediction, ob- taining a 20% relative error reduction over the current state-of-the-art.


reference text

Enrique Alfonseca, Marius Pas ¸ca, and Enrique Robledo-Arnuncio. 2010. Acquisition of instance attributes via labeled and related instances. In Proc. SIGIR, pages 58–65. Abdulrahman Almuhareb and Massimo Poesio. 2004. Attribute-based and value-based clustering: An evaluation. In Proc. EMNLP, pages 158–165. Kedar Bellare, Partha P. Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum, and Mark Dredze. 2007. Lightly-Supervised Attribute Extraction. In NIPS Workshop on Machine Learning for Web Search. Shane Bergsma and Dekang Lin. 2006. Bootstrapping path-based pronoun resolution. In Proc. ColingACL, pages 33–40. Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, and Theresa Wilson. 2012. Language identification for creating language-specific Twitter collections. In Proceedings of the Second Workshop on Language in Social Media, pages 65–74. Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, and David Yarowsky. 2013. Broadly improving user classification via communication-based name and location clustering on twitter. In Proc. NAACL. Matthew Berland and Eugene Charniak. 1999. Finding parts in very large corpora. In Proc. ACL, pages 57–64. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1): 13– 47. John D. Burger and John C. Henderson. 2006. An exploration of observable features related to blogger age. In Proc. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 15– 20. John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discriminating gender on Twitter. In Proc. EMNLP, pages 1301–1309. Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. PVLDB, 1(1):538–549. Kenneth W. Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1). Hal Daum e´ III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26. Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proc. ACL, pages 1365–1374. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res., 9: 1871–1874. Clayton Fink, Jonathon Kopecky, Nathan Bos, and Max Thomas. 2012. Mapping the Twitterverse in the developing world: An analysis of social media use in Nigeria. In Proc. International Conference on Social Computing, Behavioral Modeling, and Prediction, pages 164–171 . John L. Fischer. 1968. Social influences on the choice of a linguistic variant. Word, 14:47–56. Nikesh Garera and David Yarowsky. 2009. Modeling latent biographic attributes in conversational genres. In Proc. ACL-IJCNLP, pages 710–718. Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28:245–288. Roxana Girju, Adriana Badulescu, and Dan Moldovan. 2006. Automatic discovery of part-whole relations. Computational Linguistics, 32(1):83–135. Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proc. Coling, pages 539–545. Emre Kiciman. 2010. Language differences and metadata features on Twitter. In Proc. SIGIR 2010 Web N-gram Workshop, pages 47–5 1. Moshe Koppel and Jonathan Schler. 2004. Authorship verification as a one-class classification problem. In Proc. ICML, pages 489–495. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401–412. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proc. KDD, pages 624– 628. 718 William Labov. 1972. Sociolinguistic Patterns. University of Pennsylvania Press. Douglas B. Lenat, R. V. Guha, Karen Pittman, Dexter Pratt, and Mary Shepherd. 1990. CYC: toward programs with common sense. Commun. ACM, 33(8):30–49. Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proc. ACL-IJCNLP, pages 1030–1038. Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New tools for web-scale N-grams. In Proc. LREC, pages 2221– 2227. John McCarthy. 1959. Programs with common sense. In Proc. Teddington Conference on the Mechanization of Thought Processes, pages 75–91 . London: Her Majesty’s Stationery Office. Ken McRae, George S. Cree, Mark S. Seidenberg, and Chris McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4):547– 559. George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography, 3(4). Arjun Mukherjee and Bing Liu. 2010. Improving gender classification of blog authors. In Proc. EMNLP, pages 207–217. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proc. ICWSM, pages 122–129. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Proc. of NAACL. Patrick Pantel and Marco Pennacchiotti. 2006. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proc. ColingACL, pages 113–120. Marius Pas ¸ca and Benjamin Van Durme. 2007. What you seek is what you get: extraction of class attributes from query logs. In Proc. IJCAI, pages 2832–2837. Marius Pas ¸ca and Benjamin Van Durme. 2008. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In Proc. ACL-08: HLT, pages 19–27. Michael Paul and Mark Dredze. 2011. You are what you tweet: Analyzing Twitter for public health. In Proc. ICWSM, pages 265–272. Marco Pennacchiotti and Ana-Maria Popescu. 2011. A machine learning approach to Twitter user classification. In Proc. ICWSM, pages 281–288. Xuan-Hieu Phan. 2006. CRFTagger: CRF English POS Tagger. cr ft agger . s ource forge .net . Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in Twitter. In Proc. International Workshop on Search and Mining User-Generated Contents, pages 37–44. Delip Rao, Michael Paul, Clay Fink, David Yarowsky, Timothy Oates, and Glen Coppersmith. 2011. Hierarchical bayesian models for latent attribute detection in social media. In Proc. ICWSM, pages 598– 601. Joseph Reisinger and Marius Pas ¸ca. 2009. Latent variable models of concept-attribute attachment. In Proc. ACL-IJCNLP, pages 620–628. Stephen D. Richardson, William B. Dolan, and Lucy Vanderwende. 1998. MindNet: Acquiring and structuring semantic information from text. In Proc. ACL-Coling, pages 1098–1 102. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age and gender on blogging. In Proc. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 199–205. Lenhart Schubert. 2002. Can we derive general world knowledge from texts? In Proc. HLT, pages 84–87. Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34: 1–47. Ian Soboroff, Dean McCullough, Jimmy Lin, Craig Macdonald, Iadh Ounis, and Richard McCreadie. 2012. Evaluating real-time search over tweets. In Proc. ICWSM. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4). Oscar T¨ ackstr o¨m, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proc. NAACL- HLT, pages 477–487. Kosuke Tokunaga, Jun’ichi Kazama, and Kentaro Torisawa. 2005. Automatic discovery of attribute words from web documents. In Proc. IJCNLP, pages 106– 118. 719 Peter D. Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proc. ACL, pages 417–424. Benjamin Van Durme, Ting Qian, and Lenhart Schubert. 2008. Class-driven attribute extraction. In Proc. Coling, pages 921–928. Benjamin Van Durme. 2012. Streaming analysis of discourse participants. In Proc. EMNLP-CoNLL, pages 48–58. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2009. Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Computational Linguistics., 35(3):399–433. Fei Wu and Daniel S. Weld. 2007. Autonomously semantifying Wikipedia. In Proc. CIKM, pages 41– 50. Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using “annotator rationales” to improve machine learning for text categorization. In Proc. NAACL-HLT. 720