emnlp emnlp2010 emnlp2010-6 emnlp2010-6-reference knowledge-graph by maker-knowledge-mining

6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation


Source: pdf

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.


reference text

L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak. 2008. Spatial variation in search engine queries. In Proceedings of WWW. C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. D. M. Blei and M. I. Jordan. 2006. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1: 121–144. D. M. Blei and J. Lafferty. 2006a. Correlated topic models. In NIPS. D. M. Blei and J. Lafferty. 2006b. Dynamic topic models. In Proceedings of ICML. D. M. Blei and J. D. McAuliffe. 2007. Supervised topic models. In NIPS. D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. JMLR, 3:993–1022. M. Bucholtz, N. Bermudez, V. Fung, L. Edwards, and R. Vargas. 2007. Hella Nor Cal or totally So Cal? the perceptual dialectology of California. Journal of English Linguistics, 35(4):325–352. F. G. Cassidy and J. H. Hall. 1985. Dictionary of American Regional English, volume 1. Harvard University Press. J. Chambers. 2009. Sociolinguistic Theory: Linguistic Variation and its Social Significance. Blackwell. D. J Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. 2009. Mapping the world’s photos. In Proceedings of WWW, page 761770. J. Friedman, T. Hastie, and R. Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1). D. E. Johnson. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass, 3(1):359– 383. B. Johnstone. 2010. Language and place. In R. Mesthrie and W. Wolfram, editors, Cambridge Handbook of Sociolinguistics. Cambridge University Press. M. Joshi, D. Das, K. Gimpel, and N. A. Smith. 2010. Movie reviews and revenues: An experiment in text regression. In Proceedings of NAACL-HLT. H. Kurath. 1949. A Word Geography of the Eastern United States. University of Michigan Press. H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of WWW. W. Labov, S. Ash, and C. Boberg. 2006. The Atlas of North American English: Phonetics, Phonology, and Sound Change. Walter de Gruyter. W. Labov. 1966. The Social Stratification of English in New York City. Center for Applied Linguistics. Q. Mei, C. Liu, H. Su, and C. X Zhai. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW, page 542. Q. Mei, X. Ling, M. Wondra, H. Su, and C. X. Zhai. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of WWW. T. P. Minka. 2003. Estimating a Dirichlet distribution. Technical report, Massachusetts Institute of Technology. 1287 J. Nerbonne. 2009. Data-driven dialectology. Language and Linguistics Compass, 3(1). B. O’Connor, M. Krieger, and D. Ahn. 2010. TweetMotif: Exploratory search and topic summarization for twitter. In Proceedings of ICWSM. J. C. Paolillo. 2002. Analyzing Linguistic Variation: Statistical Models and Methods. CSLI Publications. M. Paul and R. Girju. 2010. A two-dimensional topicaspect model for discovering multi-faceted topics. In Proceedings of AAAI. W. D. Penny. 2001. Variational Bayes for d-dimensional Gaussian mixture models. Technical report, University College London. D. Sankoff, S. A. Tagliamonte, and E. Smith. 2005. Goldvarb X: A variable rule application for Macintosh and Windows. Technical report, Department of Linguistics, University of Toronto. R. W. Sinnott. 1984. Virtues of the Haversine. Sky and Telescope, 68(2). B. Szmrecsanyi. 2010. Geography is overrated. In S. Hansen, C. Schwarz, P. Stoeckle, and T. Streck, editors, Dialectological and Folk Dialectological Concepts of Space. Walter de Gruyter. S. A. Tagliamonte and D. Denis. 2008. Linguistic ruin? LOL! Instant messanging and teen language. American Speech, 83. S. A. Tagliamonte. 2006. Analysing Sociolinguistic Variation. Cambridge University Press. M. J. Wainwright and M. I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Now Publishers. E. P. Xing. 2005. On topic evolution. Technical Report 05-1 15, Center for Automated Learning and Discovery, Carnegie Mellon University.