acl acl2013 acl2013-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard
Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
Reference: text
sentIndex sentText sentNum sentScore
1 We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. [sent-2, score-1.771]
2 We model the problem as a joint dependency parsing and semantic role labeling task. [sent-3, score-0.058]
3 Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. [sent-4, score-1.548]
4 0% F-Score accu- , racy compared to an F-Score of 66. [sent-6, score-0.059]
5 7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing). [sent-7, score-0.814]
6 1 1 Introduction The general problem of translating natural language specifications into executable code has been around since the field of computer science was founded. [sent-8, score-0.613]
7 Early attempts to solve this problem produced what were essentially verbose, clumsy, and ultimately unsuccessful versions of standard formal programming languages. [sent-9, score-0.236]
8 edu (a) Text Specification: The input contains a single integer T that indicates the number of test cases. [sent-15, score-0.344]
9 Each test case begins with a line contains an integer N, representing the size of wall. [sent-17, score-0.268]
10 The j-th character of the i-th line figures out the color . [sent-20, score-0.1]
11 (b) Specification Tree:the input a single integer T test cases an integer N the next N lines N characters (c) Two Program Input Examples: Y. [sent-23, score-0.55]
12 WYWY Figure 1: An example of (a) one natural language specification describing program input data; (b) the corresponding specification tree representing the program input structure; and (c) two input examples however, researchers have had success addressing specific aspects of this problem. [sent-28, score-2.518]
13 Recent advances in this area include the successful translation of natural language commands into database queries (Wong and Mooney, 2007; Zettlemoyer and Collins, 2009; Poon and Domingos, 2009; Liang et al. [sent-29, score-0.141]
14 , 2011) and the successful mapping of natural language instructions into Windows command sequences (Branavan et al. [sent-30, score-0.131]
15 In this paper we explore a different aspect of this general problem: the translation of natural language input specifications into executable code that correctly parses the input data and generates 1294 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h. [sent-33, score-1.021]
16 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1294–1303, data structures for holding the data. [sent-35, score-0.045]
17 The need to automate this task arises because input format specifications are almost always described in natural languages, with these specifications then manually translated by a programmer into the code for reading the program inputs. [sent-36, score-1.484]
18 Our method highlights potential to automate this translation, thereby eliminating the manual software development overhead. [sent-37, score-0.242]
19 If the desired parser is implemented in C++, it should create a C++ class whose instance objects hold the different fields of the input. [sent-39, score-0.281]
20 For example, one of the fields of this class is an integer, i. [sent-40, score-0.069]
21 , “a single integer T” identified in the text specification in Figure 1a. [sent-42, score-0.913]
22 Instead of directly generating code from the text specification, we first translate the specification into a specification tree (see Figure 1b), then map this tree into parser code (see Figure 2). [sent-43, score-2.123]
23 We focus on the translation from the text specification to the specification tree. [sent-44, score-1.508]
24 2 We assume that each text specification is accompanied by a set of input examples that the desired input parser is required to successfully read. [sent-45, score-1.381]
25 In standard software development contexts, such input examples are usually available and are used to test the correctness of the input parser. [sent-46, score-0.477]
26 Note that this source of supervision is noisy the generated parser may still be incorrect even when it success— fully reads all of the input examples. [sent-47, score-0.483]
27 Specifically, the parser may interpret the input examples differently from the text specification. [sent-48, score-0.449]
28 For example, the program input in Figure 1c can be interpreted simply as a list of strings. [sent-49, score-0.305]
29 The parser may also fail to parse some correctly formatted input files not in the set of input examples. [sent-50, score-0.644]
30 Therefore, our goal is to design a technique that can effectively learn from this weak supervision. [sent-51, score-0.026]
31 We model our problem as a joint dependency parsing and role labeling task, assuming a Bayesian generative process. [sent-52, score-0.096]
32 The distribution over the space of specification trees is informed by two sources of information: (1) the correlation between the text and the corresponding specification tree and (2) the success of the generated parser in reading input examples. [sent-53, score-2.207]
33 Our method uses a joint probability distribution to take both of these sources of information into account, and uses a sampling framework for the inference of specifi2During the second step of the process, the specification tree is deterministically translated into code. [sent-54, score-0.973]
wordName wordTfidf (topN-words)
[('specification', 0.721), ('specifications', 0.352), ('input', 0.181), ('integer', 0.163), ('rinard', 0.144), ('parser', 0.13), ('code', 0.126), ('success', 0.113), ('executable', 0.11), ('branavan', 0.11), ('program', 0.098), ('tree', 0.097), ('automate', 0.097), ('reading', 0.07), ('programmer', 0.064), ('struct', 0.064), ('regina', 0.062), ('translated', 0.061), ('formatted', 0.059), ('racy', 0.059), ('verbose', 0.055), ('contest', 0.055), ('commands', 0.055), ('desired', 0.053), ('unsuccessful', 0.052), ('supervision', 0.051), ('reads', 0.05), ('intention', 0.05), ('successful', 0.049), ('translate', 0.049), ('format', 0.048), ('sources', 0.047), ('wy', 0.047), ('domingos', 0.047), ('deterministically', 0.047), ('command', 0.047), ('parsers', 0.047), ('holding', 0.045), ('poon', 0.045), ('noisy', 0.044), ('examples', 0.044), ('fields', 0.043), ('lines', 0.043), ('accompanied', 0.042), ('bayesian', 0.04), ('line', 0.04), ('eliminating', 0.04), ('tao', 0.039), ('highlights', 0.039), ('lti', 0.039), ('programming', 0.038), ('correlation', 0.038), ('generative', 0.038), ('software', 0.038), ('begins', 0.038), ('zettlemoyer', 0.038), ('translation', 0.037), ('ultimately', 0.037), ('ao', 0.036), ('lei', 0.036), ('arises', 0.035), ('color', 0.035), ('windows', 0.035), ('instructions', 0.035), ('wong', 0.035), ('fan', 0.034), ('correctly', 0.034), ('mooney', 0.034), ('interpret', 0.034), ('informed', 0.033), ('correctness', 0.033), ('labeling', 0.032), ('massachusetts', 0.032), ('files', 0.031), ('file', 0.031), ('differently', 0.031), ('addressing', 0.031), ('tn', 0.031), ('barzilay', 0.03), ('essentially', 0.03), ('hold', 0.029), ('text', 0.029), ('produced', 0.028), ('thereby', 0.028), ('fail', 0.028), ('representing', 0.027), ('generating', 0.027), ('generated', 0.027), ('class', 0.026), ('weak', 0.026), ('interpreted', 0.026), ('attempts', 0.026), ('role', 0.026), ('describing', 0.025), ('humans', 0.025), ('versions', 0.025), ('liang', 0.025), ('translating', 0.025), ('figures', 0.025), ('ai', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 163 acl-2013-From Natural Language Specifications to Program Input Parsers
Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard
Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
2 0.083012961 257 acl-2013-Natural Language Models for Predicting Programming Comments
Author: Dana Movshovitz-Attias ; William W. Cohen
Abstract: Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing. 1 Introduction and Related Work Statistical language models have traditionally been used to describe and analyze natural language documents. Recently, software engineering researchers have adopted the use of language models for modeling software code. Hindle et al. (2012) observe that, as code is created by humans it is likely to be repetitive and predictable, similar to natural language. NLP models have thus been used for a variety of software development tasks such as code token completion (Han et al., 2009; Jacob and Tairas, 2010), analysis of names in code (Lawrie et al., 2006; Binkley et al., 2011) and mining software repositories (Gabel and Su, 2008). An important part of software programming and maintenance lies in documentation, which may come in the form of tutorials describing the code, or inline comments provided by the programmer. The documentation provides a high level description of the task performed by the code, and may William W. Cohen Computer Science Department Carnegie Mellon University wcohen @ c s .cmu .edu include examples of use-cases for specific code segments or identifiers such as classes, methods and variables. Well documented code is easier to read and maintain in the long-run but writing comments is a laborious task that is often overlooked or at least postponed by many programmers. Code commenting not only provides a summarization of the conceptual idea behind the code (Sridhara et al., 2010), but can also be viewed as a form of document expansion where the comment contains significant terms relevant to the described code. Accurately predicted comment words can therefore be used for a variety of linguistic uses including improved search over code bases using natural language queries, code categorization, and locating parts of the code that are relevant to a specific topic or idea (Tseng and Juang, 2003; Wan et al., 2007; Kumar and Carterette, 2013; Shepherd et al., 2007; Rastkar et al., 2011). A related and well studied NLP task is that of predicting natural language caption and commentary for images and videos (Blei and Jordan, 2003; Feng and Lapata, 2010; Feng and Lapata, 2013; Wu and Li, 2011). In this work, our goal is to apply statistical language models for predicting class comments. We show that n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in comment typing. This is expected as n-grams have been shown as a strong model for language and speech prediction that is hard to improve upon (Rosenfeld, 2000). In some cases however, for example in a document expansion task, we wish to extract important terms relevant to the code regardless of local syntactic dependencies. We hence also evaluate the use of LDA (Blei et al., 2003) and link-LDA (Erosheva et al., 2004) topic models, which are more relevant for the term ex- traction scenario. We find that the topic model performance can be improved by distinguishing code and text tokens in the code. 35 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 35–40, 2 Method 2.1 Models We train n-gram models (n = 1, 2, 3) over source code documents containing sequences of combined code and text tokens from multiple training datasets (described below). We use the Berkeley Language Model package (Pauls and Klein, 2011) with absolute discounting (Kneser-Ney smoothing; (1995)) which includes a backoff strategy to lower-order n-grams. Next, we use LDA topic models (Blei et al., 2003) trained on the same data, with 1, 5, 10 and 20 topics. The joint distribution of a topic mixture θ, and a set of N topics z, for a single source code document with N observed word tokens, d = {wi}iN=1, given the Dirichlet parameters α sa,n dd β, {isw th}erefore p(θ, z, w|α, β) = p(θ|α) Yp(z|θ)p(w|z, (1) β) Yw Under the models described so far, there is no distinction between text and code tokens. Finally, we consider documents as having a mixed membership of two entity types, code and text tokens, d = where tthexet text ws,o drd =s are tok}ens f,r{owm comment and string literals, and the code words include the programming language syntax tokens (e.g., publ ic, private, for, etc’ ) and all identifiers. In this case, we train link-LDA models (Erosheva et al., 2004) with 1, 5, 10 and 20 topics. Under the linkLDA model, the mixed-membership joint distribution of a topic mixture, words and topics is then ({wciode}iC=n1, {witext}iT=n1), p(θ, z, w|α, β) = p(θ|α) Y wYtext · p(ztext|θ)p(wtext|ztext,β)· (2) Y p(zcode|θ)p(wcode|zcode,β) wYcode where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associated with a text word, and zcode a topic associated with a code word. The LDA and link-LDA models use Gibbs sampling (Griffiths and Steyvers, 2004) for topic inference, based on the implementation of Balasubramanyan and Cohen (201 1) with single or multiple entities per document, respectively. 2.2 Testing Methodology Our goal is to predict the tokens of the JAVA class comment (the one preceding the class definition) in each of the test files. Each of the models described above assigns a probability to the next comment token. In the case of n-grams, the probability of a token word wi is given by considering previous words p(wi |wi−1 , . . . , w0). This probability is estimated given the previous n 1tokens as p(wi|wi−1, wi−(n−1)). For t|hwe topic models, we separate the docu- ..., − ment tokens into the class definition and the comment we wish to predict. The set of tokens of the class comment are all considered as text tokens. The rest of the tokens in the document are considered to be the class definition, and they may contain both code and text tokens (from string literals and other comments in the source file). We then compute the posterior probability of document topics by solving the following inference problem conditioned on the tokens wc, wr, wr p(θ,zr|wr,α,β) =p(θp,(zwr,rw|αr,|αβ),β) (3) This gives us an estimate of the document distribution, θ, with which we infer the probability of the comment tokens as p(wc|θ,β) = Xp(wc|z,β)p(z|θ) (4) Xz Following Blei et al. (2003), for the case of a single entity LDA, the inference problem from equation (3) can be solved by considering p(θ, z, w|α, β), as in equation (1), and by taking tph(eθ marginal )di,s atrsib iunti eoqnu aotfio othne ( 1d)o,c aunmde bnyt t toakkeinngs as a continuous mixture distribution for the set w = by integrating over θ and summing over the set of topics z wr, p(w|α,β) =Zp(θ|α)· (5) YwXzp(z|θ)p(w|z,β)!dθ For the case of link-LDA where the document is comprised of two entities, in our case code tokens and text tokens, we can consider the mixedmembership joint distribution θ, as in equation (2), and similarly the marginal distribution p(w|α, β) over bimoithla rclyod teh ean mda tregxint tlok deisntsri bfruotmion w pr(.w |Sαi,nβce) comment words in are all considered as text tokens they are sampled using text topics, namely ztext, in equation (4). wc 36 3 Experimental Settings 3.1 Data and Training Methodology We use source code from nine open source JAVA projects: Ant, Cassandra, Log4j, Maven, MinorThird, Batik, Lucene, Xalan and Xerces. For each project, we divide the source files into a training and testing dataset. Then, for each project in turn, we consider the following three main training scenarios, leading to using three training datasets. To emulate a scenario in which we are predicting comments in the middle of project development, we can use data (documented code) from the same project. In this case, we use the in-project training dataset (IN). Alternatively, if we train a comment prediction model at the beginning of the development, we need to use source files from other, possibly related projects. To analyze this scenario, for each of the projects above we train models using an out-of-project dataset (OUT) containing data from the other eight projects. Typically, source code files contain a greater amount ofcode versus comment text. Since we are interested in predicting comments, we consider a third training data source which contains more English text as well as some code segments. We use data from the popular Q&A; website StackOverflow (SO) where users ask and answer technical questions about software development, tools, algorithms, etc’ . We downloaded a dataset of all actions performed on the site since it was launched in August 2008 until August 2012. The data includes 3,453,742 questions and 6,858,133 answers posted by 1,295,620 users. We used only posts that are tagged as JAVA related questions and answers. All the models for each project are then tested on the testing set of that project. We report results averaged over all projects in Table 1. Source files were tokenized using the Eclipse JDT compiler tools, separating code tokens and identifiers. Identifier names (of classes, methods and variables), were further tokenized by camel case notation (e.g., ’minMargin’ was converted to ’min margin’). Non alpha-numeric tokens (e.g., dot, semicolon) were discarded from the code, as well as numeric and single character literals. Text from comments or any string literals within the code were further tokenized with the Mallet statistical natural language processing package (Mc- Callum, 2002). Posts from SO were parsed using the Apache Tika toolkit1 and then tokenized with the Mallet package. We considered as raw code tokens anything labeled using a markup (as indicated by the SO users who wrote the post). 3.2 Evaluation Since our models are trained using various data sources the vocabularies used by each of them are different, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifying the percentage of characters that can be saved by using the model in a word-completion settings, similar to standard code completion tools built into code editors. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely words given each model filtered by the first 0, . . . , n characters ofw. Let k be the minimal ki for which w is in the top two predicted word tokens where tokens are filtered by the first ki characters. Then, the number of saved characters for w is n k. In Table 1we report the average percentage o−f ksa.v Iend T Tcahbalera 1cte wrse per ocrotm thmee avnet using eearcchen not-f the above models. The final results are also averaged over the nine input projects. As an example, in the predicted comment shown in Table 2, taken from the project Minor-Third, the token entity is the most likely token according to the model SO trigram, out of tokens starting with the prefix ’en’ . The saved characters in this case are ’tity’ . − 4 Results Table 1 displays the average percentage of characters saved per class comment using each of the models. Models trained on in-project data (IN) perform significantly better than those trained on another data source, regardless of the model type, with an average saving of 47. 1% characters using a trigram model. This is expected, as files from the same project are likely to contain similar comments, and identifier names that appear in the comment of one class may appear in the code of another class in the same project. Clearly, in-project data should be used when available as it improves comment prediction leading to an average increase of between 6% for the worst model (26.6 for OUT unigram versus 33.05 for IN) and 14% for the best (32.96 for OUT trigram versus 47. 1for IN). 1http://tika.apache.org/ 37 Model n / topics n-gram LDA Link-LDA 1 2 3 20 10 5 1 20 10 5 1 IN 33.05 (3.62) 43.27 (5.79) 47.1 (6.87) 34.20 (3.63) 33.93 (3.67) 33.63 (3.67) 33.05 (3.62) 35.76 (3.95) 35.81 (4.12) 35.37 (3.98) 34.59 (3.92) OUT 26.6 (3.37) 31.52 (4.17) 32.96 (4.33) 26.79 (3.26) 26.8 (3.36) 26.86 (3.44) 26.6 (3.37) 28.03 (3.60) 28 (3.56) 28 (3.67) 27.82 (3.62) SO 27.8 (3.51) 33.29 (4.40) 34.56 (4.78) 27.25 (3.67) 27.22 (3.44) 27.34 (3.55) 27.8 (3.51) 28.08 (3.48) 28.12 (3.58) 27.94 (3.56) 27.9 (3.45) Table 1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trained on three training sets: IN, OUT, and SO. The results are averaged over nine JAVA projects (with standard deviations in parenthesis). Model Predicted Comment trigram IN link-LDA OUT trigram SO trigram “Train “Train “Train “Train IN named-entity a named-entity a named-entity a named-entity a extractor“ extractor“ extractor“ extractor“ Table 2: Sample comment from the Minor-Third project predicted using IN, OUT and SO based models. Saved characters are underlined. Of the out-of-project data sources, models using a greater amount of text (SO) mostly outperformed models based on more code (OUT). This increase in performance, however, comes at a cost of greater run-time due to the larger word dictionary associated with the SO data. Note that in the scope of this work we did not investigate the contribution of each of the background projects used in OUT, and how their relevance to the target prediction project effects their performance. The trigram model shows the best performance across all training data sources (47% for IN, 32% for OUT and 34% for SO). Amongst the tested topic models, link-LDA models which distinguish code and text tokens perform consistently better than simple LDA models in which all tokens are considered as text. We did not however find a correlation between the number of latent topics learned by a topic model and its performance. In fact, for each of the data sources, a different num- ber of topics gave the optimal character saving results. Note that in this work, all topic models are based on unigram tokens, therefore their results are most comparable with that of the unigram in Dataset n-gram link-LDA IN 2778.35 574.34 OUT 1865.67 670.34 SO 1898.43 638.55 Table 3: Average words per project for which each tested model completes the word better than the other. This indicates that each of the models is better at predicting a different set of comment words. Table 1, which does not benefit from the backoff strategy used by the bigram and trigram models. By this comparison, the link-LDA topic model proves more successful in the comment prediction task than the simpler models which do not distin- guish code and text tokens. Using n-grams without backoff leads to results significantly worse than any of the presented models (not shown). Table 2 shows a sample comment segment for which words were predicted using trigram models from all training sources and an in-project linkLDA. The comment is taken from the TrainExtractor class in the Minor-Third project, a machine learning library for annotating and categorizing text. Both IN models show a clear advantage in completing the project-specific word Train, compared to models based on out-of-project data (OUT and SO). Interestingly, in this example the trigram is better at completing the term namedentity given the prefix named. However, the topic model is better at completing the word extractor which refers to the target class. This example indicates that each model type may be more successful in predicting different comment words, and that combining multiple models may be advantageous. 38 This can also be seen by the analysis in Table 3 where we compare the average number of words completed better by either the best n-gram or topic model given each training dataset. Again, while n-grams generally complete more words better, a considerable portion of the words is better completed using a topic model, further motivating a hybrid solution. 5 Conclusions We analyze the use of language models for predicting class comments for source file documents containing a mixture of code and text tokens. Our experiments demonstrate the effectiveness of using language models for comment completion, showing a saving of up to 47% of the comment characters. When available, using in-project training data proves significantly more successful than using out-of-project data. However, we find that when using out-of-project data, a dataset based on more words than code performs consistently better. The results also show that different models are better at predicting different comment words, which motivates a hybrid solution combining the advantages of multiple models. Acknowledgments This research was supported by the NSF under grant CCF-1247088. References Ramnath Balasubramanyan and William W Cohen. 2011. Block-lda: Jointly modeling entity-annotated text and entity-entity links. In Proceedings ofthe 7th SIAM International Conference on Data Mining. Dave Binkley, Matthew Hearn, and Dawn Lawrie. 2011. Improving identifier informativeness using part of speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. David M Blei, Andrew Y Ng, and Michael IJordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. Elena Erosheva, Stephen Fienberg, and John Lafferty. 2004. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence. Mark Gabel and Zhendong Su. 2008. Javert: fully automatic mining of general temporal properties from dynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339–349. ACM. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. of the National Academy of Sciences of the United States of America. Sangmok Han, David R Wallace, and Robert C Miller. 2009. Code completion from abbreviated input. In Automated Software Engineering, 2009. ASE’09. 24th IEEE/ACM International Conference on, pages 332–343. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE. Ferosh Jacob and Robert Tairas. 2010. Code template inference using language models. In Proceedings of the 48th Annual Southeast Regional Conference. ACM. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., volume 1, pages 181–184. IEEE. Naveen Kumar and Benjamin Carterette. 2013. Time based feedback and query expansion for twitter search. In Advances in Information Retrieval, pages 734–737. Springer. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. Whats in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pages 3–12. IEEE. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Adam Pauls and Dan Klein. 2011. Faster and smaller language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 258–267. n-gram Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 103–1 12. IEEE. 39 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8): 1270–1278. David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the 6th international conference on Aspect-oriented software development, pages 212–224. ACM. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pages 43–52. ACM. Yuen-Hsien Tseng and Da-Wei Juang. 2003. Document-self expansion for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 399–400. ACM. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Single document summarization with document expansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Roung-Shiunn Wu and Po-Chun Li. 2011. Video annotation using hierarchical dirichlet process mixture model. Expert Systems with Applications, 38(4):3040–3048. 40
3 0.062033985 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu
Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.
4 0.059779651 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation
Author: Gozde Ozbal ; Daniele Pighin ; Carlo Strapparava
Abstract: Daniele Pighin Google Inc. Z ¨urich, Switzerland danie le . pighin@ gmai l com . Carlo Strapparava FBK-irst Trento, Italy st rappa@ fbk . eu you”. As another scenario, creative sentence genWe present BRAINSUP, an extensible framework for the generation of creative sentences in which users are able to force several words to appear in the sentences and to control the generation process across several semantic dimensions, namely emotions, colors, domain relatedness and phonetic properties. We evaluate its performance on a creative sentence generation task, showing its capability of generating well-formed, catchy and effective sentences that have all the good qualities of slogans produced by human copywriters.
5 0.056758888 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
Author: Francesco Sartorio ; Giorgio Satta ; Joakim Nivre
Abstract: We present a novel transition-based, greedy dependency parser which implements a flexible mix of bottom-up and top-down strategies. The new strategy allows the parser to postpone difficult decisions until the relevant information becomes available. The novel parser has a ∼12% error reduction in unlabeled attach∼ment score over an arc-eager parser, with a slow-down factor of 2.8.
6 0.054538373 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
7 0.053091243 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce
8 0.050907169 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
9 0.050790843 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification
10 0.050389573 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
11 0.048756495 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
12 0.048655711 312 acl-2013-Semantic Parsing as Machine Translation
13 0.048370492 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
14 0.048190903 80 acl-2013-Chinese Parsing Exploiting Characters
15 0.047637474 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
16 0.046944909 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
17 0.046849728 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
18 0.04434678 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing
19 0.044067163 275 acl-2013-Parsing with Compositional Vector Grammars
20 0.043859594 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing
topicId topicWeight
[(0, 0.114), (1, -0.043), (2, -0.048), (3, -0.006), (4, -0.028), (5, -0.001), (6, 0.048), (7, -0.039), (8, 0.023), (9, -0.008), (10, -0.025), (11, 0.025), (12, -0.022), (13, -0.004), (14, 0.016), (15, -0.036), (16, -0.019), (17, -0.018), (18, -0.018), (19, -0.021), (20, -0.011), (21, -0.026), (22, 0.01), (23, 0.023), (24, -0.017), (25, 0.029), (26, 0.003), (27, -0.022), (28, -0.009), (29, -0.003), (30, -0.01), (31, 0.019), (32, -0.017), (33, -0.038), (34, -0.003), (35, 0.038), (36, -0.07), (37, -0.033), (38, 0.027), (39, 0.089), (40, -0.096), (41, 0.047), (42, -0.036), (43, -0.071), (44, -0.052), (45, 0.038), (46, 0.006), (47, -0.016), (48, 0.01), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.9486025 163 acl-2013-From Natural Language Specifications to Program Input Parsers
Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard
Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
2 0.59041721 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing
Author: Dan Goldwasser ; Dan Roth
Abstract: Semantic parsing is a domain-dependent process by nature, as its output is defined over a set of domain symbols. Motivated by the observation that interpretation can be decomposed into domain-dependent and independent components, we suggest a novel interpretation model, which augments a domain dependent model with abstract information that can be shared by multiple domains. Our experiments show that this type of information is useful and can reduce the annotation effort significantly when moving between domains.
3 0.58553654 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
Author: Johann-Mattis List ; Steven Moran
Abstract: Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit.
4 0.57935399 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
Author: Joohyun Kim ; Raymond Mooney
Abstract: We adapt discriminative reranking to improve the performance of grounded language acquisition, specifically the task of learning to follow navigation instructions from observation. Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting. Instead, we show how the weak supervision of response feedback (e.g. successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees.
5 0.56387907 176 acl-2013-Grounded Unsupervised Semantic Parsing
Author: Hoifung Poon
Abstract: We present the first unsupervised approach for semantic parsing that rivals the accuracy of supervised approaches in translating natural-language questions to database queries. Our GUSP system produces a semantic parse by annotating the dependency-tree nodes and edges with latent states, and learns a probabilistic grammar using EM. To compensate for the lack of example annotations or question-answer pairs, GUSP adopts a novel grounded-learning approach to leverage database for indirect supervision. On the challenging ATIS dataset, GUSP attained an accuracy of 84%, effectively tying with the best published results by supervised approaches.
6 0.5499987 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension
7 0.54078174 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
8 0.54036188 288 acl-2013-Punctuation Prediction with Transition-based Parsing
9 0.53857076 312 acl-2013-Semantic Parsing as Machine Translation
10 0.52445585 14 acl-2013-A Novel Classifier Based on Quantum Computation
11 0.52328181 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications
12 0.51669466 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
13 0.50431293 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE
14 0.49821213 335 acl-2013-Survey on parsing three dependency representations for English
15 0.49561468 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
16 0.494133 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
17 0.48719376 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
18 0.4775092 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions
19 0.47741777 65 acl-2013-BRAINSUP: Brainstorming Support for Creative Sentence Generation
20 0.47152737 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars
topicId topicWeight
[(0, 0.05), (6, 0.016), (11, 0.068), (15, 0.036), (24, 0.066), (26, 0.082), (35, 0.055), (40, 0.288), (42, 0.06), (48, 0.039), (64, 0.018), (70, 0.034), (88, 0.033), (90, 0.012), (95, 0.05)]
simIndex simValue paperId paperTitle
1 0.79005432 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn
Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
2 0.78571874 94 acl-2013-Coordination Structures in Dependency Treebanks
Author: Martin Popel ; David Marecek ; Jan StÄłpanek ; Daniel Zeman ; ZdÄłnÄłk Zabokrtsky
Abstract: Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.
same-paper 3 0.78301549 163 acl-2013-From Natural Language Specifications to Program Input Parsers
Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard
Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1
4 0.7193588 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
Author: Matthew R. Gormley ; Jason Eisner
Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.
5 0.64538592 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
Author: Yuki Arase ; Ming Zhou
Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.
6 0.5986861 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
7 0.51914901 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation
8 0.50671852 318 acl-2013-Sentiment Relevance
9 0.50052726 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations
10 0.49496818 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
11 0.48953217 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
12 0.48903996 225 acl-2013-Learning to Order Natural Language Texts
13 0.48857823 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
14 0.48792803 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
15 0.48719496 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
16 0.48685977 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification
17 0.48631483 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
18 0.48542449 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks
19 0.48486492 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
20 0.48413953 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction