emnlp emnlp2011 emnlp2011-106 emnlp2011-106-reference knowledge-graph by maker-knowledge-mining

106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

Source: pdf

Author: Dani Yogatama ; Michael Heilman ; Brendan O'Connor ; Chris Dyer ; Bryan R. Routledge ; Noah A. Smith

Abstract: We consider the problem of predicting measurable responses to scientific articles based primarily on their text content. Specifically, we consider papers in two fields (economics and computational linguistics) and make predictions about downloads and within-community citations. Our approach is based on generalized linear models, allowing interpretability; a novel extension that captures first-order temporal effects is also presented. We demonstrate that text features significantly improve accuracy of predictions over metadata features like authors, topical categories, and publication venues.

reference text

A. Ahmed and E. P. Xing. 2010. Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In Proc. of UAI. A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. S. Bethard and D. Jurafsky. 2010. Who should Icite? Learning literature search models from citation behavior. In Proc. of CIKM. D. Blei and J. Lafferty. 2006. Dynamic topic models. In Proc. of ICML. K. Borner, C. Chen, and K. Boyack. 2003. Visualiz- ing knowledge domains. In B. Cronin, editor, Annual Review of Information Science and Technology, volume 37, pages 179–255. Information Today, Inc. G. Box, G. M. Jenkins, and G. Reinsel. 2008. Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics. S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press. J. Brank and J. Leskovec. 2003. The download estimation task on KDD Cup 2003. SIGKDD Explorations, 5(2): 160–162. A. Cameron and P. Trivedi. 1998. Regression Analysis of Count Data. Cambridge University Press. E. Erosheva, S. Fienberg, and J. Lafferty. 2004. Mixed membership models of scientific publications. In Proc. of PNAS. S. Gerrish and D. M. Blei. 2010. A language-based approach to measuring scholarly impact. In Proc. of ICML. D. Hall, D. Jurafsky, and C. D. Manning. 2008. Studying the history of ideas using topic models. In Proc. of EMNLP. J. D. Hamilton. 1994. Time Series Analysis. Princeton University Press. T. Hastie, R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. A. E. Hoerl and R. W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67. T. Joachims. 2002. Optimizing search engines using clickthrough data. In Proc. of KDD. M. Joshi, D. Das, K. Gimpel, and N. A. Smith. 2010. Movie reviews and revenues: An experiment in text regression. In Proc. of HLT-NAACL. S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith. 2009. Predicting risk from financial reports with regression. In Proc. of HLT-NAACL. D. C. Liu and J. Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528. P. Mccullagh and A. J. Nelder. 1989. Generalized Linear Models. London: Chapman & Hall. P. McCullagh. 1980. Regression models for ordinal data. Journal of the Royal Statistical Society B, 42(2): 109– 142. A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and D. Jensen. 2003. Exploiting relational structure to understand publication patterns in high-energy physics. SIGKDD Explorations, 5(2): 165–172. J. Michel, Y. Shen, A. Aiden, A. Veres, M. Gray, The Google Books Team, J. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. Nowak, and E. Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science, 33 1(6014): 176– 182. V. Qazvinian and D. R. Radev. 2008. Scientific paper summarization using citation summary networks. In Proc. of COLING. D. R. Radev, M. T. Joseph, B. Gibson, and P. Muthukrishnan. 2009a. A bibliometric and network analysis of the field of computational linguistics. Journal of the American Society for Information Science and Technology. D. R. Radev, P. Muthukrishnan, and V. Qazvinian. 2009b. The ACL anthology network corpus. In Proc. of ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries. D. Ramage, C. D. Manning, and D. A. McFarland. 2010. Which universities lead and lag? Toward university rankings based on scholarly output. In Proc. of NIPS Workshop on Computational Social Science and the Wisdom of the Crowds. R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67(1):91–108. X. Wang and A. McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proc. of KDD. C. Wang, D. Blei, and D. Heckerman. 2008. Continuous time dynamic topic models. In Proc. of UAI. 604