nips nips2007 nips2007-129 knowledge-graph by maker-knowledge-mining

129 nips-2007-Mining Internet-Scale Software Repositories


Source: pdf

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Large repositories of source code create new challenges and opportunities for statistical machine learning. [sent-3, score-0.645]

2 Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. [sent-4, score-0.427]

3 For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. [sent-6, score-0.628]

4 We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. [sent-8, score-0.577]

5 Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0. [sent-10, score-0.584]

6 1 Introduction Large repositories of private or public software source code, such as the open source projects available on the Internet, create considerable new opportunities and challenges for statistical machine learning, information retrieval, and software engineering. [sent-17, score-1.204]

7 Mining such repositories is important, for instance, to understand software structure, function, complexity, and evolution, as well as to improve software information retrieval systems and identify relationships between humans and the software they produce. [sent-18, score-0.874]

8 Tools to mine source code for functionality, structural organization, team structure, and developer contributions are also of interest to private industry, where these tools can be applied to such problems as in-house code reuse and project staffing. [sent-19, score-1.413]

9 Mining large software repositories requires leveraging both the textual and structural aspects of software data, as well as any relevant meta data. [sent-22, score-0.644]

10 We then develop and apply unsupervised author-topic probabilistic models to discover the topics embedded in the code and extract topic-word and author-topic distributions. [sent-25, score-0.577]

11 Finally, we leverage the dual textual and graphical nature of software to improve code search and retrieval. [sent-26, score-0.619]

12 2 Infrastructure and Data To allow for the Internet-scale analysis of source code we have built Sourcerer, an extensive infrastructure designed for the automated crawling, downloading, parsing, organization, and storage of large software repositories in a relational database. [sent-27, score-1.037]

13 While the infrastructure is general, we apply it here to a sample of projects in Java. [sent-30, score-0.224]

14 Specifically, for the results reported, we download 12,151 projects from Sourceforge and Apache and filter out distributions packaged without source code (binaries only). [sent-31, score-0.676]

15 The end result is a repository consisting of 4,632 projects, containing 244,342 source files, with 38. [sent-32, score-0.303]

16 For the software author-topic modeling approach we also employ the Eclipse 3. [sent-34, score-0.266]

17 Though only a single project, Eclipse is a large, active open source effort that has been widely studied. [sent-36, score-0.234]

18 In this case, we consider 2,119 source files, associated with about 700,000 lines of code, a vocabulary of 15,391 words, and 59 programmers. [sent-37, score-0.234]

19 A complete list of all the projects contained in our repository is available from the supplementary materials web pages. [sent-39, score-0.33]

20 3 Statistical Analysis During the parsing process our system performs a static analysis on project source code files to extract code entities and their relationships, storing them in a relational database. [sent-40, score-1.126]

21 For java these entities consist of packages, classes, interfaces, methods, and fields, as well as more specific constructs such as constructors and static initializers. [sent-41, score-0.247]

22 The populated database represents a substantial foundation on which to base statistical analysis of source code. [sent-43, score-0.267]

23 Parsing the multi-project repository described above yields a repository of over 5 million entities organized into 48 thousand packages, 560 thousand classes, and 3. [sent-44, score-0.277]

24 Table 1: Frequency of java keyword occurrence Keyword public if new return import int null void private static final else throws Percentage 12. [sent-51, score-0.314]

25 Recent techniques include Latent Dirichlet Allocation (LDA), which probabilistically models text documents as mixtures of latent topics, where topics correspond to key concepts presented in the corpus [2] (see also [3]). [sent-107, score-0.353]

26 Author-Topic (AT) modeling is an extension of topic modeling that captures the relationship of authors to topics in addition to extracting the topics themselves. [sent-108, score-0.828]

27 Despite previous work in classifying code based on concepts [1], applications of LDA and AT models have been limited to traditional text corpora such as academic publications, news reports, corporate emails, and historical documents [7, 8]. [sent-111, score-0.407]

28 At the most basic level, however, a code repository can be viewed as a text corpus, where source files are analogous to documents and developers to authors. [sent-112, score-0.794]

29 Though vocabulary, syntax, and conventions differentiate a programming language from a natural language, the tokens present in a source file are still indicative of its function (ie. [sent-113, score-0.234]

30 Thus here we develop and apply probabilistic AT models to software data. [sent-115, score-0.232]

31 As in [7], our model assumes that each topic t is associated with a multinomial distribution φ•t over words w, and each author a is associated with a multinomial distribution θ•a over topics. [sent-122, score-0.406]

32 Given a document d containing Nd words with known authors, in generative mode each word is assigned to one of the authors a of the document uniformly, then the corresponding θ•a is sampled to derive a topic t, and finally the corresponding φ•t is sampled to derive a word w. [sent-124, score-0.407]

33 Once the data is obtained, applying this basic AT model to software requires the development of several tools to facilitate the processing and modeling of source code. [sent-129, score-0.5]

34 In addition to the crawling infrastructure described above, the primary functions of the remaining tools are to extract and resolve author names from source code, as well as convert the source code to the bag-of-words format. [sent-130, score-1.26]

35 It is a binary matrix where entry [i,j]=1 if author i contributed to document j, and 0 otherwise. [sent-133, score-0.25]

36 Extracting author information is ultimately a matter of tokenizing the code and associating developer names with file (document) names when this information is available. [sent-134, score-1.154]

37 This process is further simplified for java software due to the prevalence of javadoc tags which present this metadata in the form of attribute-value pairs. [sent-135, score-0.413]

38 0 code base, however, shows that most source files are credited to “The IBM Corporation” rather than specific developers. [sent-137, score-0.542]

39 Thus, to generate a list of authors for specific source files, we parsed the Eclipse bug data available in [11]. [sent-138, score-0.486]

40 After pruning files not associated with any author, this input dataset consists of 2,119 Java source files, comprising 700,000 lines of code, from a total of 59 developers. [sent-139, score-0.234]

41 While leveraging bug data is convenient (and necessary) to generate the developer list for Eclipse 3. [sent-140, score-0.627]

42 0, it is also desirable to develop a more flexible approach that uses only the source code itself, and not other data sources. [sent-141, score-0.542]

43 Thus to extract author names from source code we also develop a lightweight parser that examines the code for javadoc ’@author’ tags, as well as free form labels such as ’author’ and ’developer. [sent-142, score-1.271]

44 ’ Occurrences of these labels are used to isolate and identify developer names. [sent-143, score-0.439]

45 This multitude of formats, combined with the fact that author names are typically labeled in the code header, is key to our decision to extract developer names using our own parsing utilities, rather than part-of-speech taggers [12] leveraged in other text mining projects. [sent-145, score-1.379]

46 A further complication for author name extraction is the fact that the same developer may write his name in several different ways. [sent-146, score-0.752]

47 When parsing is complete for all projects, the global author list is resolved using the same process, but with a new threshold, t2, such that t2 > t1. [sent-155, score-0.317]

48 This approach effectively implements more conservative name resolution across projects in light of the observation that the scope of most developer activities is limited to a relatively small number (1 in many cases) of open source efforts. [sent-156, score-0.87]

49 As an important step in processing source files our tool removes commonly occurring stop words. [sent-165, score-0.234]

50 This is done to specifically avoid extracting common topics relating to the Java collections framework. [sent-170, score-0.256]

51 2 Topic and Author-Topic Modeling Results A representative subset of 6 topics extracted via Author-Topic modeling on the selected 2,119 source files from Eclipse 3. [sent-176, score-0.491]

52 Each topic is described by several words associated with the topic concept. [sent-178, score-0.438]

53 To the right of each topic is a list of the most likely authors for each topic with their probabilities. [sent-179, score-0.555]

54 For example, topic 1 clearly corresponds to unit testing, topic 2 to debugging, topic 4 to building projects, and topic 6 to automated code completion. [sent-181, score-1.254]

55 Remaining topics range from package browsing to compiler options. [sent-182, score-0.304]

56 Table 2: Representative topics and authors from Eclipse 3. [sent-183, score-0.285]

57 0 # 1 2 3 Topic junit run listener item suite target source debug breakpoint location ast button cplist entries astnode Author Probabilities egamma 0. [sent-184, score-0.336]

58 Topics representing major sub-domains of software development are clearly represented, with the first topic corresponding to web applications, the second to databases, the third to network applications, and the fourth to file processing. [sent-216, score-0.451]

59 Topic 5 is also demonstrative of the inherent difficulty of resolving author names, and the shortcomings of the qgram algorithm, as the developer “gert van ham” and the developer “hamgert” are most likely the same person documenting their name in different ways. [sent-218, score-1.128]

60 Though the majority of topics can be intuitively mapped to their corresponding domains, some topics are too noisy to be able to associate any functional description to them. [sent-220, score-0.446]

61 For example, one topic extracted from our repository consists of Spanish words unrelated to software engineering which seem to represent the subset of source files with comments in Spanish. [sent-221, score-0.754]

62 Other topics appear to be very project specific, and while they may indeed describe a function of code, they are not easily understood by those who are only casually familiar with the software artifacts in the codebase. [sent-222, score-0.51]

63 This is especially true with Eclipse, which is limited in both the number and diversity of source files. [sent-223, score-0.234]

64 Examining the author assignments (and probabilities) for the various topics provides a simple means by which to discover developer contributions and infer their competencies. [sent-226, score-0.849]

65 It should come as no surprise that the most probable developer assigned to the JUnit framework topic is “egamma”, or Erich Gamma. [sent-227, score-0.658]

66 In this case, there is a 97% chance that any source file in our dataset assigned to this topic will have him as a contributor. [sent-228, score-0.453]

67 This is of course a particularly Table 3: Representative topics and authors from the multi-project repository # 1 2 3 Topic servlet session response request http sql column jdbc type result packet type session snmpwalkmv address Author Probabilities craig r mcclanahan 0. [sent-230, score-0.354]

68 01505 attractive example because Erich Gamma is widely known for being a founder of the JUnit project, a fact which lends credibility to the ability of the topic modeling algorithm to assign developers to reasonable topics. [sent-262, score-0.337]

69 For example, developer “daudel” is assigned to the topic corresponding to automatic code completion with probability . [sent-264, score-0.966]

70 In addition to determining developer contributions, one may also be curious to know the scope of a developer’s involvement. [sent-268, score-0.439]

71 Does a developer work across application areas, or are his contributions highly focused? [sent-269, score-0.439]

72 How does the breadth of one developer compare to another? [sent-270, score-0.49]

73 These are natural questions that arise in the software development process. [sent-271, score-0.232]

74 To answer these questions within the framework of author-topic models, we can measure the breadth of an author a by the entropy H(a) = − t θta log θta of the corresponding distribution over topics. [sent-272, score-0.238]

75 The developer with the lowest entropy is “thierry danard,” with . [sent-275, score-0.439]

76 The developer with the highest entropy is “wdi” with 4. [sent-277, score-0.439]

77 0 authors clustered by KL divergence of the distribution of an author over topics measures the author’s breadth, the similarity between two authors can be measured by comparing their respective distributions over topics. [sent-281, score-0.534]

78 The boxes represent individual developers, and are arranged such that developers with similar topic distributions are nearest one another. [sent-284, score-0.303]

79 This information is especially useful when considering how to form a development team, choosing suitable programmers to perform code updates, or deciding to whom to direct technical questions. [sent-286, score-0.342]

80 Two other important distributions that can be retrieved from the AT modeling approach are the distribution of topics across documents, and the distribution of documents across topics (not shown). [sent-287, score-0.537]

81 The corresponding entropies provide an automated and novel way to precisely formalize and measure topic scattering and document tangling, two fundamental concepts of software design [14], which are important to software architects when performing activities such as code refactoring. [sent-288, score-1.175]

82 5 Code Search and Retrieval Sourcerer relies on a deep analysis of code to extract pertinent textual and structural features that can be used to improve the quality and performance of source code search, as well as augment the ways in which code can be searched. [sent-289, score-1.249]

83 By combining standard text information retrieval techniques with source-specific heuristics and a relational representation of code, we have available a comprehensive platform for searching software components. [sent-290, score-0.393]

84 Programs are best modeled as graphs, with code entities comprising the nodes and various relations the edges. [sent-294, score-0.444]

85 This can be applied to source as well, as it is likely that a code entity referenced by many other entities are more robust than those with few references. [sent-297, score-0.642]

86 The Code Rank of a code entity (package, class, or method) A is given by: CR(A) = (1 − d) + d(CR(T1 )/C(T1 ) + . [sent-299, score-0.308]

87 Tn are the code entities referring to A, C(A) is the number of outgoing links of A, and d is a damping factor. [sent-305, score-0.408]

88 Moreover, graph-based techniques can be combined with a variety of heuristics to further improve code search. [sent-308, score-0.352]

89 For example, keyword hits to the right of the fully-qualified name can be boosted, hits in comments can be discounted, and terms indicative of test articles can be ignored. [sent-309, score-0.281]

90 We are conducting detailed experiments to assess the effectiveness of graph-based algorithms in conjunction with standard IR techniques to search source code. [sent-310, score-0.268]

91 ’ Best hits were determined manually with a team of 3 software engineers serving as human judges of result quality, modularity, and ease of reuse. [sent-314, score-0.309]

92 Results clearly indicate that the general Google search engine is ineffective for locating relevant source code, with a mean AUC of . [sent-315, score-0.268]

93 By restricting its corpus to code alone, Google’s code search engine yields substantial improvement with an AUC of approximately . [sent-317, score-0.681]

94 Despite this improvement this system essentially relies only on regular expression matching of code keywords. [sent-319, score-0.308]

95 Using a Java-specific keyword and comment parser our infrastructure yields an immediate improvement with an AUC of . [sent-320, score-0.266]

96 6 Conclusion Here we have leveraged a comprehensive code processing infrastructure to facilitate the mining of large-scale software repositories. [sent-326, score-0.692]

97 We conduct a statistical analysis of source code on a previously unreported scale, identifying robust power-law behavior among several code entities. [sent-327, score-0.85]

98 Results indicate that the algorithm produces reasonable and interpretable automated topics and author-topic assignments. [sent-330, score-0.293]

99 Finally, by combining term-based information retrieval techniques with graphical information derived from program structure, we are able to significantly improve software search and retrieval performance. [sent-332, score-0.416]

100 Analyzing entities and topics in news articles using statistical topic models. [sent-369, score-0.542]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('developer', 0.439), ('code', 0.308), ('source', 0.234), ('software', 0.232), ('topics', 0.223), ('topic', 0.219), ('eclipse', 0.203), ('author', 0.187), ('java', 0.147), ('projects', 0.134), ('keyword', 0.132), ('names', 0.11), ('les', 0.109), ('repositories', 0.103), ('bug', 0.101), ('entities', 0.1), ('infrastructure', 0.09), ('apache', 0.084), ('developers', 0.084), ('sourcerer', 0.084), ('package', 0.081), ('parsing', 0.075), ('packages', 0.075), ('retrieval', 0.075), ('automated', 0.07), ('repository', 0.069), ('darin', 0.068), ('name', 0.063), ('document', 0.063), ('mining', 0.062), ('authors', 0.062), ('ta', 0.059), ('documents', 0.057), ('project', 0.055), ('list', 0.055), ('auc', 0.054), ('breadth', 0.051), ('coderank', 0.051), ('crawling', 0.051), ('darins', 0.051), ('daudel', 0.051), ('dmegert', 0.051), ('egamma', 0.051), ('jlanneluc', 0.051), ('junit', 0.051), ('kkolosow', 0.051), ('maeschli', 0.051), ('pagerank', 0.051), ('scattering', 0.051), ('sourceforge', 0.051), ('tangling', 0.051), ('teicher', 0.051), ('google', 0.048), ('extract', 0.046), ('textual', 0.045), ('parser', 0.044), ('heuristics', 0.044), ('hits', 0.043), ('text', 0.042), ('etc', 0.04), ('ranking', 0.04), ('million', 0.039), ('supplementary', 0.037), ('dirichlet', 0.037), ('relations', 0.036), ('materials', 0.035), ('cr', 0.035), ('private', 0.035), ('modeling', 0.034), ('directory', 0.034), ('erich', 0.034), ('gert', 0.034), ('ham', 0.034), ('hamgert', 0.034), ('inheritance', 0.034), ('jaburns', 0.034), ('javadoc', 0.034), ('jburns', 0.034), ('johna', 0.034), ('kjohnson', 0.034), ('krbarnes', 0.034), ('lbourlier', 0.034), ('mkeller', 0.034), ('nick', 0.034), ('othomann', 0.034), ('parsed', 0.034), ('pmulet', 0.034), ('programmers', 0.034), ('sloc', 0.034), ('staf', 0.034), ('tmaeder', 0.034), ('twatson', 0.034), ('wmelhem', 0.034), ('team', 0.034), ('search', 0.034), ('extracting', 0.033), ('database', 0.033), ('le', 0.033), ('leveraging', 0.032), ('corpus', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 129 nips-2007-Mining Internet-Scale Software Repositories

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

2 0.14972346 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1

3 0.13914798 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.

4 0.13133904 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

5 0.12845188 183 nips-2007-Spatial Latent Dirichlet Allocation

Author: Xiaogang Wang, Eric Grimson

Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1

6 0.12150881 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

7 0.11593682 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

8 0.11333051 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

9 0.10999414 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

10 0.10485528 47 nips-2007-Collapsed Variational Inference for HDP

11 0.060616694 197 nips-2007-The Infinite Markov Model

12 0.059462447 97 nips-2007-Hidden Common Cause Relations in Relational Learning

13 0.059334867 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

14 0.057655722 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

15 0.051930122 49 nips-2007-Colored Maximum Variance Unfolding

16 0.051401891 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

17 0.048337203 84 nips-2007-Expectation Maximization and Posterior Constraints

18 0.047215044 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

19 0.047201648 143 nips-2007-Object Recognition by Scene Alignment

20 0.047199652 169 nips-2007-Retrieved context and the discovery of semantic structure


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.154), (1, 0.079), (2, -0.07), (3, -0.234), (4, 0.085), (5, 0.005), (6, 0.075), (7, -0.107), (8, 0.017), (9, 0.048), (10, 0.027), (11, -0.034), (12, 0.031), (13, 0.043), (14, 0.039), (15, -0.055), (16, 0.053), (17, -0.111), (18, -0.052), (19, -0.025), (20, 0.008), (21, -0.0), (22, -0.016), (23, -0.042), (24, 0.022), (25, -0.075), (26, 0.059), (27, 0.092), (28, 0.083), (29, -0.132), (30, -0.02), (31, -0.196), (32, -0.094), (33, 0.007), (34, 0.048), (35, 0.087), (36, -0.014), (37, 0.092), (38, -0.101), (39, 0.126), (40, -0.029), (41, -0.048), (42, -0.002), (43, -0.066), (44, -0.022), (45, -0.033), (46, -0.195), (47, -0.124), (48, -0.062), (49, -0.096)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97801155 129 nips-2007-Mining Internet-Scale Software Repositories

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

2 0.58965719 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

3 0.54268909 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

4 0.53036773 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

Author: Bing Zhao, Eric P. Xing

Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.

5 0.52198452 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1

6 0.51513553 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

7 0.45784551 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

8 0.42191306 47 nips-2007-Collapsed Variational Inference for HDP

9 0.40833044 183 nips-2007-Spatial Latent Dirichlet Allocation

10 0.40568209 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

11 0.36147934 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

12 0.30098391 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

13 0.29251739 49 nips-2007-Colored Maximum Variance Unfolding

14 0.28734496 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

15 0.27161464 101 nips-2007-How SVMs can estimate quantiles and the median

16 0.26530328 197 nips-2007-The Infinite Markov Model

17 0.26366383 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

18 0.25818789 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

19 0.25694859 150 nips-2007-Optimal models of sound localization by barn owls

20 0.25210714 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.032), (13, 0.036), (16, 0.015), (18, 0.012), (19, 0.016), (21, 0.043), (31, 0.014), (34, 0.02), (35, 0.017), (47, 0.07), (49, 0.011), (83, 0.073), (85, 0.017), (87, 0.523), (90, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91266793 129 nips-2007-Mining Internet-Scale Software Repositories

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

2 0.91094172 183 nips-2007-Spatial Latent Dirichlet Allocation

Author: Xiaogang Wang, Eric Grimson

Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” and “documents” when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA. 1

3 0.81678665 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

Author: Leonid Sigal, Alexandru Balan, Michael J. Black

Abstract: Estimation of three-dimensional articulated human pose and motion from images is a central problem in computer vision. Much of the previous work has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. Automatic initialization of such models has proved difficult and most approaches assume that the size and shape of the body parts are known a priori. In this paper we propose a method for automatically recovering a detailed parametric model of non-rigid body shape and pose from monocular imagery. Specifically, we represent the body using a parameterized triangulated mesh model that is learned from a database of human range scans. We demonstrate a discriminative method to directly recover the model parameters from monocular images using a conditional mixture of kernel regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments. 1

4 0.77484292 59 nips-2007-Continuous Time Particle Filtering for fMRI

Author: Lawrence Murray, Amos J. Storkey

Abstract: We construct a biologically motivated stochastic differential model of the neural and hemodynamic activity underlying the observed Blood Oxygen Level Dependent (BOLD) signal in Functional Magnetic Resonance Imaging (fMRI). The model poses a difficult parameter estimation problem, both theoretically due to the nonlinearity and divergence of the differential system, and computationally due to its time and space complexity. We adapt a particle filter and smoother to the task, and discuss some of the practical approaches used to tackle the difficulties, including use of sparse matrices and parallelisation. Results demonstrate the tractability of the approach in its application to an effective connectivity study. 1

5 0.53170955 189 nips-2007-Supervised Topic Models

Author: Jon D. Mcauliffe, David M. Blei

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1

6 0.47892967 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

7 0.46685347 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

8 0.43320197 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

9 0.42675632 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

10 0.41912526 143 nips-2007-Object Recognition by Scene Alignment

11 0.40812936 47 nips-2007-Collapsed Variational Inference for HDP

12 0.4073095 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning

13 0.39163968 113 nips-2007-Learning Visual Attributes

14 0.38574669 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

15 0.37970069 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

16 0.37879634 169 nips-2007-Retrieved context and the discovery of semantic structure

17 0.37666842 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

18 0.3736738 56 nips-2007-Configuration Estimates Improve Pedestrian Finding

19 0.35860655 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

20 0.35817292 180 nips-2007-Sparse Feature Learning for Deep Belief Networks