nips nips2007 nips2007-129 nips2007-129-reference knowledge-graph by maker-knowledge-mining

129 nips-2007-Mining Internet-Scale Software Repositories

Source: pdf

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we ﬁrst develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data ﬁrst reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to signiﬁcantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

reference text

[1] S. Ugurel, R. Krovetz, and C. L. Giles. What’s the code?: automatic classiﬁcation of source code archives. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 632–638, New York, NY, USA, 2002. ACM Press.

[2] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

[3] W. Buntine. Open source search: a data mining platform. SIGIR Forum, 39(1):4–10, 2005.

[4] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Grifﬁths. Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306–315, New York, NY, USA, 2004. ACM Press.

[5] D. Newman, C. Chemudugunta, P. Smyth, and M. Steyvers. Analyzing entities and topics in news articles using statistical topic models. In ISI, pages 93–104, 2006.

[6] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.

[7] Michal Rosen-Zvi, Thomas Grifﬁths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In UAI ’04: Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 487–494, Arlington, Virginia, United States, 2004. AUAI Press.

[8] D. Newman and S. Block. Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol., 57(6):753–767, 2006.

[9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.

[10] T. L. Grifﬁths and M. Steyvers. Finding scientiﬁc topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228– 5235, April 2004.

[11] A. Schr¨ ter, T. Zimmermann, R. Premraj, and A. Zeller. If your bug database could talk. . . . In Proceedings o of the 5th International Symposium on Empirical Software Engineering, Volume II: Short Papers and Posters, pages 18–20, September 2006.

[12] E. Brill. Some advances in transformation-based part of speech tagging. In National Conference on Artiﬁcial Intelligence, pages 722–727, 1994.

[13] E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–211, 1992.

[14] G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect-oriented programming. In Mehmet Aksit and Satoshi Matsuoka, editors, Proceedings European Conference on ¸ Object-Oriented Programming, volume 1241, pages 220–242. Springer-Verlag, Berlin, Heidelberg, and New York, 1997.

[15] R. Motwani L. Page, S. Brin and T. Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see: http://dbpubs.stanford.edu/pub/1999-66).