emnlp emnlp2010 emnlp2010-61 emnlp2010-61-reference knowledge-graph by maker-knowledge-mining

61 emnlp-2010-Improving Gender Classification of Blog Authors


Source: pdf

Author: Arjun Mukherjee ; Bing Liu

Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.


reference text

Agrawal, R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules. VLDB. pp. 487-499. Argamon, S., Koppel, M., J Fine, AR Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text-Interdisciplinary Journal, 2003. Argamon, S., Koppel, M., Pennebaker, J. W., Schler, J. 2007. Mining the Blogosphere: Age, Gender and the varieties of self-expression, First Monday, 2007 - firstmonday.org Baayen, H., H van Halteren, F Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution, Literary and Linguistic Computing, 11, 1996. Blum, A. and Langley, P. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245-271 . BookBlog, Gender Genie, Copyright 2003-2007, http://www.bookblog.net/gender/genie.html Borgelt, C. 2003. Bayes Classifier Induction. http://www.borgelt.net/doc/bayes/bayes.html Chung, C. K. and Pennebaker, J. W. 2007. Revealing people ’s thinking in natural language: Using an automated meaning extraction method in open–ended self–descriptions, J. of Research in Personality. Corney, M., Vel, O., Anderson, A., Mohay, G. 2002. Gender Preferential Text Mining of E-mail Discourse. 18th annual Computer Security Applica- tions Conference (ACSAC), 2002. J. Dean and S. Ghemawat. 2004. Mapreduce: Simplified data processing on large clusters, Operating Systems Design and Implementation, 2004. Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. JMLR, 3: 1289 - 1306 , 2003. Garganté, R. A., Marchiori, T. E., and Kowalczyk, S. R. W., 2007. A Genetic Algorithm to Ensemble Feature Selection. Masters Thesis. Vrije Universiteit, Amsterdam. Gefen, D., D. W. Straub. 1997. Gender differences in the perception and use of e-mail: An extension to the technology acceptance model. MIS Quart. 21(4) 389–400. Herring, S. C., & Paolillo, J. C. 2006. Gender and genre variation in weblogs, Journal of Sociolinguistics, 10 (4), 439-459. Heylighen, F., and Dewaele, J. 2002. Variation in the contextuality of language: an empirical measure. Foundations of Science, 7, 293–340. Houvardas, J. and Stamatatos, E. 2006. N-gram Feature Selection for Authorship Identification, Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, pp. 77-86. Joachims, T. 1999. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support 217 Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. Joachims, T. 1997. Text categorization with support vector machines, Technical report, LS VIII Number 23, University of Dortmund, 1997 Kohavi, R. and John, G. 1997. Wrappers for feature subset selection. Artificial Intelligence, 97(12):273-324. Koppel, M., Argamon, S., Shimoni, A. R.. 2002. Automatically Categorizing Written Text by Author Gender. Literary and Linguistic Computing. Krawetz, N. 2006. Gender Guesser. Hacker Factor Solutions. http://www.hackerfactor.com/ GenderGuesser.html Mladenic, D. 1998. Feature subset selection in text learning. In Proc. of ECML-98, pp. 95–100. Mladenic, D. and Grobelnik, D. 1998. Feature selection for classification based on text hierarchy. Proceedings of the Workshop on Learning from Text and the Web, 1998 Nowson, S., Oberlander J., Gill, A. J., 2005. Gender, Genres, and Individual Differences. In Proceedings of the 27th annual meeting of the Cognitive Science Society (p. 1666–1671). Stresa, Italy. Riloff, E., Patwardhan, S., Wiebe, J.. 2006. Feature Subsumption for opinion Analysis. EMNLP, Rogati, M. and Yang, Y.2002. High performing and scalable feature selection for text classification. In CIKM, pp. 659-661, 2002. Schiffman, H. 2002. Bibliography of Gender and Language. http://ccat.sas.upenn.edu/~haroldfs/ popcult/bibliogs/gender/genbib.htm Schler, J., Koppel, M., Argamon, S, and Pennebaker J. 2006. Effects of age and gender on blogging, In Proc. of the AAAI Spring Symposium Computational Approaches to Analyzing Weblogs. Silva, J., Dias, F., Guillore, S., Lopes, G. 1999. Using LocalMaxs Algortihm for the Extraction of Contiguous and Noncontiguous Multiword Lexical Units. Springer Lecture Notes in AI 1695, 1999 Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements, In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), Avignon, France. Tannen, D. (1990). You just don ’t understand, New York: Ballantine. Tsuruoka, Y. and Tsujii, J. 2005. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, HLT/EMNLP 2005, pp. 467-474. Tuv, E., Borisov, A., Runger, G., and Torkkola, K. 2009. Feature selection with ensembles, artificial variables, and redundancy elimination. JMLR, 10. Yan, X., Yan, L. 2006. Gender Classification of Weblog Authors. Computational Approaches to Analyzing Weblogs, AAAI.