acl acl2011 acl2011-97 acl2011-97-reference knowledge-graph by maker-knowledge-mining

97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

Source: pdf

Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing

Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.

reference text

Henrietta J. Cedergren and David Sankoff. 1974. Variable rules: Performance as a statistical reflection of competence. Language, 50(2):333–355. Jonathan Chang, Itamar Rosenn, Lars Backstrom, and Cameron Marlow. 2010. ePluribus: Ethnicity on social networks. In Proceedings of ICWSM. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. 2008. Efficient projections onto the ‘1-ball for learning in high dimensions. In Proceed- ings of ICML. Kevin Duh, Katsuhito Sudoh, Hajime Tsukada, Hideki Isozaki, and Masaaki Nagata. 2010. n-best reranking by multitask learning. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model of geographic lexical variation. In Proceedings of EMNLP. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22. Joshua Goodman. 2004. Exponential priors for maximum entropy models. In Proceedings of NAACL-HLT. Daniel E. Johnson. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass, 3(1):359–383. Jun’ichi Kazama and Jun’ichi Tsujii. 2003. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of EMNLP. William Labov. 1966. The Social Stratification of English in New York City. Center for Applied Linguistics. Han Liu, Mark Palatucci, and Jian Zhang. 2009. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis dis- covery. In Proceedings of ICML. John Nerbonne. 2009. Data-driven dialectology. Language and Linguistics Compass, 3(1): 175–198. Brendan O’Connor, Jacob Eisenstein, Eric P. Xing, and Noah A. Smith. 2010. A mixture model of demographic lexical variation. In Proceedings of NIPS Workshop on Machine Learning in Computational Social Science. Ariadna Quattoni, Xavier Carreras, Michael Collins, and Trevor Darrell. 2009. An efficient projection for ‘1,∞ regularization. In Proceedings of ICML. John R. Rickford. 1999. African American Vernacular English. Blackwell. 1374 Stefan Riezler and Alexander Vasserman. 2004. Incremental feature selection and ‘1 regularization for relaxed maximum-entropy modeling. In Proceedings of EMNLP. Gerard Rushton, Marc P. Armstrong, Josephine Gittler, Barry R. Greene, Claire E. Pavlik, Michele M. West, and Dale L. Zimmerman, editors. 2008. Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control, Research, and Practice. CRC Press. Hinrich Sch u¨tze and Jan Pedersen. 1993. A vector model for syntagmatic and paradigmatic relatedness. In Proceedings of the 9th Annual Conference of the UW Cen- tre for the New OED and Text Research. Aaron Smith and Lee Rainie. 2010. Who tweets? Technical report, Pew Research Center, December. Berwin A. Turlach, William N. Venables, and Stephen J. Wright. 2005. Simultaneous variable selection. Technometrics, 47(3):349–363. Larry Wasserman and Kathryn Roeder. 2009. Highdimensional variable selection. Annals of Statistics, 37(5A):2178–2201. Larry Wasserman. 2003. All of Statistics: A Concise Course in Statistical Inference. Springer. Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and Kathryn Roeder. 2010. Screen and clean: A tool for identifying interactions in genome-wide association studies. Genetic Epidemiology, 34(3):275– 285. Qing Zhang. 2005. A Chinese yuppie in Beijing: Phonological variation and the construction of a new professional identity. Language in Society, 34:43 1–466. Kathryn Zickuhr and Aaron Smith. 2010. 4% of online Americans use location-based services. Technical report, Pew Research Center, November.