acl acl2013 acl2013-305 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu
Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract The quality of automatic translation is affected by many factors. [sent-4, score-0.22]
2 One is the divergence between the specific source and target languages. [sent-5, score-0.136]
3 Another lies in the source text itself, as some texts are more complex than others. [sent-6, score-0.148]
4 One way to handle such texts is to modify them prior to translation. [sent-7, score-0.044]
5 Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. [sent-8, score-0.358]
6 In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. [sent-9, score-0.634]
7 An important source of problems lies in the source text itself some texts are more complex to translate than others. [sent-15, score-0.252]
8 Consider the following English-to-French translation by a popular service, BING TRANSLATOR:1 Head of Mali defense seeks more arms → Défense de la tête du Mali cherche bras plus. [sent-16, score-0.304]
9 and arms have been translated as if they were – 1http : //www . [sent-18, score-0.11]
10 The fact that the formulation of the source can strongly influence the quality of the translation has long been known, and there have been studies indicating that adherence to so-called “Controlled Language” guidelines, such as Simplified Technical English2 can reduce the MT post-edition effort. [sent-24, score-0.324]
11 We need to analyze the effect that rules are having on different language pairs and MT systems, and we need to tune our rule sets and texts accordingly”. [sent-28, score-0.044]
12 In the software system presented here, SORT (SOurce Rewriting Tool), we build on the basic insight that formulation of the source needs to be geared to the specific MT model being used, and propose the following approach. [sent-29, score-0.136]
13 First, we assume that the original source text in English (say) is not necessarily under the user’s control, but may be given to her. [sent-30, score-0.148]
14 While she is a fluent English speaker, she does not know at all the target language, but uses an MT system; crucially, this system is able to provide estimates of the quality of its translations (Specia et al. [sent-31, score-0.295]
15 SORT then automatically produces a number of rewritings of each English sentence, translates them with the MT system, and displays to the user those rewritings for which the translation quality estimates are higher than the estimate for the original source. [sent-33, score-1.077]
16 The user then interactively selects one such rewriting per sentence, checking that it does not distort the original meaning, and finally the translations of these 2http : //www. [sent-34, score-0.815]
17 c e2 A0s1s3oc Aiastsioocnia fotiron C foomrp Cuotmatpiountaatlio Lninaglu Liisntgicusi,s ptaicgses 85–90, reformulations are made available. [sent-38, score-0.117]
18 One advantage of this framework is that the proposed rewritings are implicitly “aware” of the underlying strengths and limitations of the spe- cific MT model. [sent-39, score-0.322]
19 A good quality estimation3 component, for instance, will feel more confident about the translation of an unambiguous word like weapon than about that of an ambiguous one such as arm, or about the translation of a known term in its domain than about a term not seen during training. [sent-40, score-0.43]
20 Such a tool is especially relevant for business situations where post-edition costs are very high, for instance because of lack of people both expert in the domain and competent in the target language. [sent-41, score-0.075]
21 2 The rewriting tool In this section we describe SORT, our implementation of the aforementioned rewriting approach. [sent-44, score-1.069]
22 While the entire process can in principle be fully automated, we focus here on an interactive pro- cess where the user views and approves suggested rewritings. [sent-45, score-0.201]
23 The details of the rewriting methods and of the quality estimation used in the current implementation are described in Sections 3 and 4. [sent-46, score-0.622]
24 With this interface, the user uploads the document that needs to be translated. [sent-48, score-0.223]
25 The translation confidence of each sentence is computed and displayed next to it. [sent-49, score-0.421]
26 The confidence scores are color-coded to enable quickly focusing on the sentences that require more attention. [sent-50, score-0.242]
27 Green denotes sentences for which the translation confidence is high, and are thus expected to produce good translations. [sent-51, score-0.405]
28 Red marks sentences that are estimated to be poorly translated, and all those in between are marked with an orange label. [sent-52, score-0.066]
29 We attempt to suggest rewritings only for sentences that are estimated to be not so well translated. [sent-53, score-0.388]
30 When we are able to propose rewriting(s) with higher translation confidence than the original, a magnifying glass icon is displayed next to the sentence. [sent-54, score-0.511]
31 Clicking it displays, on the right side of 3Also known as confidence estimation. [sent-55, score-0.205]
32 the screen, an ordered list of the more confident rewritings, along with their corresponding confidence estimations. [sent-56, score-0.252]
33 The first sentence on the list is always the original one, to let it be edited, and to make it easier to view the difference between the original and the rewritings. [sent-57, score-0.121]
34 An example is shown on the right side of Figure 1, where we see a rewriting suggestion for the fourth sentence in the document. [sent-58, score-0.551]
35 Here, the suggestion is simply to replace the word captured with the word caught, a rewriting that is estimated to improve the translation of the sentence. [sent-59, score-0.743]
36 The user can select one of the suggestions or choose to edit either the original or one of the rewritings. [sent-60, score-0.164]
37 The current sentence which is being examined is marked with a different color and the alternative under focus is marked with a small icon (the bidirectional arrows). [sent-61, score-0.09]
38 The differences between the alternatives and the original are highlighted. [sent-62, score-0.074]
39 After the user’s confirmation (with the check mark icon), the display of the document on the left-hand side is updated based on her selection, including the updated confidence estimation. [sent-63, score-0.281]
40 At any time, the user (if she speaks the target language) can click on the cogwheel icon and view the translation of the source or of its rewritten version. [sent-64, score-0.542]
41 When done, the user can save the edited text or its translation. [sent-65, score-0.153]
42 The Model part is formed by Java classes representing the application state (user input, selected text lines, associated rewriting propositions and scores). [sent-70, score-0.513]
43 The Controller consists of several servlet components handling each user interaction with the backend server (file uploads, SMT tools calls via XML-RPC or use of the embedded Java library that handles the actual rewritings). [sent-71, score-0.221]
44 Figure 2 shows the system architecture of SORT, 4http://www. [sent-76, score-0.06]
45 The entire process is performed via a client-server architecture in order to provide responsiveness, as required in an interactive system. [sent-82, score-0.109]
46 The user communicates with the system through the interface shown in Figure 1. [sent-83, score-0.19]
47 When a document is loaded, its sentences are translated in parallel by an SMT Moses server (Koehn et al. [sent-84, score-0.209]
48 Then, the source and the target are sent to the confidence estimator, and the translation model information is also made available to it. [sent-86, score-0.504]
49 The confidence estima- tor extracts features from that input and returns a confidence score. [sent-87, score-0.41]
50 Specifically, the language model features are computed with two SRILM servers (Stolcke, 2002), one for the source language and one for the target language. [sent-88, score-0.136]
51 Rewritings are produced by the rewriting modules (see Section 3 for the implemented rewriting methods). [sent-89, score-1.026]
52 For each rewriting, the same process of translation and confidence estimation is performed. [sent-90, score-0.42]
53 Translations are cached during the session; thus, when the user wishes to view a translation or download the translations of the entire document, the response is immediate. [sent-91, score-0.413]
54 3 Source rewriting Various methods can be used to rewrite a source text. [sent-92, score-0.648]
55 In what follows we describe two rewriting methods, based on Text Simplification techniques, which we implemented and integrated in the current version of SORT. [sent-93, score-0.513]
56 Simplification operations include the replacement of words by simpler ones, removal of complicated syntactic structures, shortening of sentences etc. [sent-94, score-0.165]
57 Our assump- tion is that simpler sentences are more likely to yield higher quality translations. [sent-96, score-0.148]
58 Clearly, this is not always the case; yet, we leave this decision to the confidence estimation component. [sent-97, score-0.257]
59 Sentence-level simplification (Specia, 2010) has proposed to model text simplification as a Statistical Machine Translation (SMT) task where the goal is to translate sentences to their simplified version in the same language. [sent-98, score-0.367]
60 In this approach, a simplification model is learnt from a parallel corpus of texts and their simplified versions. [sent-99, score-0.24]
61 Given a source text, it is translated to its simpler version, and its n-best translations are assessed by the confidence estimation component. [sent-103, score-0.598]
62 Lexical simplification One of the primary operations for text-simplification is lexical substitution (Table 2 in (Specia, 2010)). [sent-104, score-0.178]
63 Hence, in addition to rewriting a full sentence using the previous technique, we implemented a second method, addressing lexical simplification directly, and only modifying local aspects of the source sentence. [sent-105, score-0.784]
64 The approach here is to extract relevant synonyms from our trained SMT model of English to Simplified English, and use them as substitutions to simplify new sentences. [sent-106, score-0.078]
65 We check whether their lemmas were synonyms in WordNet (Fellbaum, 1998) (with all possible parts-of-speech as this information was not available in the SMT model). [sent-108, score-0.067]
66 When a match of an English word is found in the source sentence it is replaced with its simpler synonym to generate an alternative for the source. [sent-110, score-0.185]
67 For example, using this rewriting method for the source sentence “Why the Galileo research program superseded rival programs,” three rewritings of the sentence are generated when rival is substituted by competitor or superseded by replaced, and when both substitutions occur together. [sent-111, score-1.182]
68 de / dat a / In the current version of SORT, both sentencelevel and lexical simplification methods are used in conjunction to suggest rewritings for sentences with low confidence scores. [sent-115, score-0.698]
69 4 Confidence estimation Our confidence estimator is based on the system and data provided for the 2012 Quality estimation shared task (Callison-Burch et al. [sent-116, score-0.475]
70 In this task, participants were required to estimate the quality of automated translations. [sent-118, score-0.057]
71 Their estimates were compared to human scores of the translation which referred to the suitability of the translation for post-editing. [sent-119, score-0.375]
72 The scores ranged from 1to 5, where 1 corresponded to translation that practically needs to be done from scratch, and 5 to translations that requires little to no editing. [sent-120, score-0.26]
73 The task’s training set consisted of approximately 1800 source sentences in English, their Moses translations to Spanish and the scores given to the translations by the three judges. [sent-121, score-0.335]
74 5 Initial evaluation and analysis We performed an initial evaluation of our approach in an English to Spanish translation setting, using the 2008 News Commentary data. [sent-126, score-0.163]
75 440 pairs of the original sentence and the selected alternative were then both translated to Spanish and were presented as competitors to 6Available at http : / /www . [sent-129, score-0.15]
76 The sentences were placed within their context in the original document, taken from the Spanish side of the corpus. [sent-132, score-0.081]
77 The order of presentation of the two competitors was random. [sent-133, score-0.048]
78 In this evaluation, the translation of the original was preferred 20. [sent-134, score-0.239]
79 7 Among the two rewriting methods, the sentence-level method more often resulted in preferred translations. [sent-137, score-0.545]
80 These results suggest that rewriting is estimated to improve translation quality. [sent-138, score-0.705]
81 , when two source synonyms were translated to the same target word. [sent-142, score-0.229]
82 Also, often a wrong synonym was suggested as a replacement for a word (e. [sent-143, score-0.057]
83 This was somewhat surprising as we had expected the language model features of the confidence estimator to help removing these cases. [sent-146, score-0.339]
84 Putting more emphasis on context features in the confidence estimation or explicitly verifying context-suitability of a lexical substitutions could help addressing this issue. [sent-148, score-0.3]
85 6 Related work Some related approaches focus on the authoring process and control a priori the range of possible texts, either by interactively enforcing lexical and syntactic constraints on the source that simplify the operations of a rule-based translation system (Carbonell et al. [sent-149, score-0.603]
86 ing a monolingual author in the generation of multilingual texts (Power and Scott, 1998; Dymetman et al. [sent-151, score-0.094]
87 A recent approach (Venkatapathy and Mirkin, 2012) proposes an authoring tool that consults the MT system itself to propose phrases that should be used during composition to obtain better translations. [sent-153, score-0.266]
88 All these methods address the authoring of the source text from scratch. [sent-154, score-0.295]
89 , 2005) propose an interactive system where the author helps a rulebased translation system disambiguate a source text inside a structured document editor. [sent-157, score-0.456]
90 Closer to our approach of modifying the source text, one approach is to paraphrase the source or to generate sentences entailed by it (Callison-Burch et al. [sent-159, score-0.278]
91 These works, however, focus on handling out-of-vocabulary (OOV) words, do not assess the translatability of the source sentence and are not interactive. [sent-164, score-0.194]
92 Monolingual speakers of the source and target language collaborate to improve the translation. [sent-167, score-0.136]
93 Unlike our approach, here both the feedback for poorly translated sentences and the actual modification of the source is done by humans. [sent-168, score-0.237]
94 7 Conclusions and future work We introduced a system for rewriting texts for translation under the control of a confidence esti- mator. [sent-170, score-0.985]
95 Based on an evaluation of the quality of the generated alternatives as well as on user selection decisions, we may be able to learn a quality estimator for the rewriting operations themselves. [sent-174, score-0.955]
96 Such methods could be useful both in an interactive mode, to minimize the effort of the monolingual source user, as well as in an automatic mode, to avoid misinterpretation. [sent-175, score-0.235]
97 In this work we used an available baseline feature extraction module for confidence estimation. [sent-176, score-0.205]
98 A better estimator could benefit our system significantly, as we argued above. [sent-177, score-0.166]
99 Lastly, we wish to further improve the user interface of the tool, based on feedback from actual users. [sent-178, score-0.223]
100 The value of monolingual crowdsourcing in a real-world translation scenario: simulation using haitian creole emergency sms mes- sages. [sent-222, score-0.213]
wordName wordTfidf (topN-words)
[('rewriting', 0.513), ('rewritings', 0.322), ('confidence', 0.205), ('authoring', 0.191), ('mirkin', 0.176), ('translation', 0.163), ('dymetman', 0.157), ('specia', 0.148), ('estimator', 0.134), ('simplification', 0.134), ('user', 0.12), ('reformulations', 0.117), ('shachar', 0.117), ('sort', 0.109), ('source', 0.104), ('translations', 0.097), ('venkatapathy', 0.096), ('mali', 0.096), ('icon', 0.09), ('choumane', 0.088), ('interactive', 0.081), ('lucia', 0.077), ('server', 0.07), ('mt', 0.069), ('marc', 0.068), ('aziz', 0.067), ('cancedda', 0.067), ('smt', 0.063), ('spanish', 0.062), ('simplified', 0.062), ('alps', 0.059), ('bras', 0.059), ('controller', 0.059), ('fense', 0.059), ('superseded', 0.059), ('translatability', 0.059), ('uploads', 0.059), ('translated', 0.058), ('quality', 0.057), ('java', 0.056), ('carbonell', 0.056), ('simpler', 0.054), ('displayed', 0.053), ('christmas', 0.052), ('dry', 0.052), ('arms', 0.052), ('estimation', 0.052), ('marton', 0.05), ('monolingual', 0.05), ('estimates', 0.049), ('competitors', 0.048), ('confident', 0.047), ('nicola', 0.047), ('original', 0.044), ('document', 0.044), ('operations', 0.044), ('texts', 0.044), ('tool', 0.043), ('substitutions', 0.043), ('linux', 0.043), ('moses', 0.042), ('koehn', 0.041), ('interactively', 0.041), ('sriram', 0.041), ('rival', 0.041), ('winner', 0.041), ('interface', 0.038), ('suggestion', 0.038), ('feedback', 0.038), ('sentences', 0.037), ('mode', 0.036), ('hu', 0.035), ('synonyms', 0.035), ('chris', 0.034), ('edited', 0.033), ('accessed', 0.033), ('view', 0.033), ('modifying', 0.033), ('preferred', 0.032), ('check', 0.032), ('target', 0.032), ('system', 0.032), ('rewrite', 0.031), ('handling', 0.031), ('alternatives', 0.03), ('du', 0.03), ('replacement', 0.03), ('controlled', 0.03), ('ido', 0.029), ('zhu', 0.029), ('estimated', 0.029), ('fluent', 0.028), ('control', 0.028), ('english', 0.028), ('srilm', 0.028), ('assessed', 0.028), ('power', 0.028), ('architecture', 0.028), ('synonym', 0.027), ('wish', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu
Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
2 0.13254265 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
3 0.13114782 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
Author: Daniel Beck ; Lucia Specia ; Trevor Cohn
Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .
4 0.12784243 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
5 0.11463288 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
Author: Matthew Shardlow
Abstract: Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more na¨ ıve technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.
6 0.11288077 322 acl-2013-Simple, readable sub-sentences
7 0.11187659 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
8 0.10499107 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
9 0.10384126 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
10 0.10304727 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
12 0.096639737 135 acl-2013-English-to-Russian MT evaluation campaign
13 0.096550584 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
14 0.095331624 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
15 0.093760744 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
16 0.093759924 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
17 0.092852928 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
18 0.092551239 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
19 0.089291304 255 acl-2013-Name-aware Machine Translation
20 0.083968192 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
topicId topicWeight
[(0, 0.2), (1, -0.069), (2, 0.16), (3, 0.034), (4, -0.007), (5, -0.01), (6, 0.002), (7, -0.028), (8, 0.061), (9, 0.049), (10, -0.057), (11, 0.07), (12, -0.072), (13, 0.063), (14, -0.05), (15, -0.065), (16, -0.048), (17, 0.002), (18, 0.014), (19, 0.023), (20, 0.002), (21, -0.025), (22, -0.077), (23, 0.042), (24, -0.068), (25, -0.008), (26, -0.035), (27, 0.109), (28, -0.002), (29, 0.035), (30, -0.119), (31, -0.037), (32, 0.003), (33, 0.083), (34, -0.0), (35, 0.021), (36, 0.012), (37, 0.02), (38, -0.034), (39, -0.021), (40, -0.14), (41, -0.084), (42, -0.048), (43, -0.03), (44, -0.054), (45, 0.043), (46, -0.032), (47, -0.061), (48, -0.092), (49, 0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.92009449 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu
Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
2 0.80475163 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
3 0.73947865 322 acl-2013-Simple, readable sub-sentences
Author: Sigrid Klerke ; Anders Sgaard
Abstract: We present experiments using a new unsupervised approach to automatic text simplification, which builds on sampling and ranking via a loss function informed by readability research. The main idea is that a loss function can distinguish good simplification candidates among randomly sampled sub-sentences of the input sentence. Our approach is rated as equally grammatical and beginner reader appropriate as a supervised SMT-based baseline system by native speakers, but our setup performs more radical changes that better resembles the variation observed in human generated simplifications.
4 0.71814638 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
5 0.70665938 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
6 0.70084924 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
7 0.68902147 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
8 0.67063278 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
9 0.66670597 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
10 0.65099102 250 acl-2013-Models of Translation Competitions
11 0.64561969 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
12 0.64339608 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
13 0.63262141 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
14 0.6156463 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
15 0.60181397 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
17 0.59135777 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
18 0.59092063 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
19 0.58563268 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
20 0.58466697 255 acl-2013-Name-aware Machine Translation
topicId topicWeight
[(0, 0.05), (6, 0.053), (11, 0.034), (24, 0.052), (26, 0.423), (28, 0.011), (35, 0.051), (42, 0.049), (48, 0.024), (70, 0.02), (88, 0.02), (90, 0.052), (95, 0.088)]
simIndex simValue paperId paperTitle
1 0.97540259 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
Author: Alexandra Balahur ; Hristo Tanev
Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.
2 0.97119039 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections
Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina
Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639
3 0.95937049 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
Author: Kristina Toutanova ; Byung-Gyu Ahn
Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.
4 0.95905435 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies
Author: Reut Tsarfaty
Abstract: Stanford Dependencies (SD) provide a functional characterization of the grammatical relations in syntactic parse-trees. The SD representation is useful for parser evaluation, for downstream applications, and, ultimately, for natural language understanding, however, the design of SD focuses on structurally-marked relations and under-represents morphosyntactic realization patterns observed in Morphologically Rich Languages (MRLs). We present a novel extension of SD, called Unified-SD (U-SD), which unifies the annotation of structurally- and morphologically-marked relations via an inheritance hierarchy. We create a new resource composed of U-SDannotated constituency and dependency treebanks for the MRL Modern Hebrew, and present two systems that can automatically predict U-SD annotations, for gold segmented input as well as raw texts, with high baseline accuracy.
5 0.93924546 257 acl-2013-Natural Language Models for Predicting Programming Comments
Author: Dana Movshovitz-Attias ; William W. Cohen
Abstract: Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing. 1 Introduction and Related Work Statistical language models have traditionally been used to describe and analyze natural language documents. Recently, software engineering researchers have adopted the use of language models for modeling software code. Hindle et al. (2012) observe that, as code is created by humans it is likely to be repetitive and predictable, similar to natural language. NLP models have thus been used for a variety of software development tasks such as code token completion (Han et al., 2009; Jacob and Tairas, 2010), analysis of names in code (Lawrie et al., 2006; Binkley et al., 2011) and mining software repositories (Gabel and Su, 2008). An important part of software programming and maintenance lies in documentation, which may come in the form of tutorials describing the code, or inline comments provided by the programmer. The documentation provides a high level description of the task performed by the code, and may William W. Cohen Computer Science Department Carnegie Mellon University wcohen @ c s .cmu .edu include examples of use-cases for specific code segments or identifiers such as classes, methods and variables. Well documented code is easier to read and maintain in the long-run but writing comments is a laborious task that is often overlooked or at least postponed by many programmers. Code commenting not only provides a summarization of the conceptual idea behind the code (Sridhara et al., 2010), but can also be viewed as a form of document expansion where the comment contains significant terms relevant to the described code. Accurately predicted comment words can therefore be used for a variety of linguistic uses including improved search over code bases using natural language queries, code categorization, and locating parts of the code that are relevant to a specific topic or idea (Tseng and Juang, 2003; Wan et al., 2007; Kumar and Carterette, 2013; Shepherd et al., 2007; Rastkar et al., 2011). A related and well studied NLP task is that of predicting natural language caption and commentary for images and videos (Blei and Jordan, 2003; Feng and Lapata, 2010; Feng and Lapata, 2013; Wu and Li, 2011). In this work, our goal is to apply statistical language models for predicting class comments. We show that n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in comment typing. This is expected as n-grams have been shown as a strong model for language and speech prediction that is hard to improve upon (Rosenfeld, 2000). In some cases however, for example in a document expansion task, we wish to extract important terms relevant to the code regardless of local syntactic dependencies. We hence also evaluate the use of LDA (Blei et al., 2003) and link-LDA (Erosheva et al., 2004) topic models, which are more relevant for the term ex- traction scenario. We find that the topic model performance can be improved by distinguishing code and text tokens in the code. 35 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 35–40, 2 Method 2.1 Models We train n-gram models (n = 1, 2, 3) over source code documents containing sequences of combined code and text tokens from multiple training datasets (described below). We use the Berkeley Language Model package (Pauls and Klein, 2011) with absolute discounting (Kneser-Ney smoothing; (1995)) which includes a backoff strategy to lower-order n-grams. Next, we use LDA topic models (Blei et al., 2003) trained on the same data, with 1, 5, 10 and 20 topics. The joint distribution of a topic mixture θ, and a set of N topics z, for a single source code document with N observed word tokens, d = {wi}iN=1, given the Dirichlet parameters α sa,n dd β, {isw th}erefore p(θ, z, w|α, β) = p(θ|α) Yp(z|θ)p(w|z, (1) β) Yw Under the models described so far, there is no distinction between text and code tokens. Finally, we consider documents as having a mixed membership of two entity types, code and text tokens, d = where tthexet text ws,o drd =s are tok}ens f,r{owm comment and string literals, and the code words include the programming language syntax tokens (e.g., publ ic, private, for, etc’ ) and all identifiers. In this case, we train link-LDA models (Erosheva et al., 2004) with 1, 5, 10 and 20 topics. Under the linkLDA model, the mixed-membership joint distribution of a topic mixture, words and topics is then ({wciode}iC=n1, {witext}iT=n1), p(θ, z, w|α, β) = p(θ|α) Y wYtext · p(ztext|θ)p(wtext|ztext,β)· (2) Y p(zcode|θ)p(wcode|zcode,β) wYcode where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associated with a text word, and zcode a topic associated with a code word. The LDA and link-LDA models use Gibbs sampling (Griffiths and Steyvers, 2004) for topic inference, based on the implementation of Balasubramanyan and Cohen (201 1) with single or multiple entities per document, respectively. 2.2 Testing Methodology Our goal is to predict the tokens of the JAVA class comment (the one preceding the class definition) in each of the test files. Each of the models described above assigns a probability to the next comment token. In the case of n-grams, the probability of a token word wi is given by considering previous words p(wi |wi−1 , . . . , w0). This probability is estimated given the previous n 1tokens as p(wi|wi−1, wi−(n−1)). For t|hwe topic models, we separate the docu- ..., − ment tokens into the class definition and the comment we wish to predict. The set of tokens of the class comment are all considered as text tokens. The rest of the tokens in the document are considered to be the class definition, and they may contain both code and text tokens (from string literals and other comments in the source file). We then compute the posterior probability of document topics by solving the following inference problem conditioned on the tokens wc, wr, wr p(θ,zr|wr,α,β) =p(θp,(zwr,rw|αr,|αβ),β) (3) This gives us an estimate of the document distribution, θ, with which we infer the probability of the comment tokens as p(wc|θ,β) = Xp(wc|z,β)p(z|θ) (4) Xz Following Blei et al. (2003), for the case of a single entity LDA, the inference problem from equation (3) can be solved by considering p(θ, z, w|α, β), as in equation (1), and by taking tph(eθ marginal )di,s atrsib iunti eoqnu aotfio othne ( 1d)o,c aunmde bnyt t toakkeinngs as a continuous mixture distribution for the set w = by integrating over θ and summing over the set of topics z wr, p(w|α,β) =Zp(θ|α)· (5) YwXzp(z|θ)p(w|z,β)!dθ For the case of link-LDA where the document is comprised of two entities, in our case code tokens and text tokens, we can consider the mixedmembership joint distribution θ, as in equation (2), and similarly the marginal distribution p(w|α, β) over bimoithla rclyod teh ean mda tregxint tlok deisntsri bfruotmion w pr(.w |Sαi,nβce) comment words in are all considered as text tokens they are sampled using text topics, namely ztext, in equation (4). wc 36 3 Experimental Settings 3.1 Data and Training Methodology We use source code from nine open source JAVA projects: Ant, Cassandra, Log4j, Maven, MinorThird, Batik, Lucene, Xalan and Xerces. For each project, we divide the source files into a training and testing dataset. Then, for each project in turn, we consider the following three main training scenarios, leading to using three training datasets. To emulate a scenario in which we are predicting comments in the middle of project development, we can use data (documented code) from the same project. In this case, we use the in-project training dataset (IN). Alternatively, if we train a comment prediction model at the beginning of the development, we need to use source files from other, possibly related projects. To analyze this scenario, for each of the projects above we train models using an out-of-project dataset (OUT) containing data from the other eight projects. Typically, source code files contain a greater amount ofcode versus comment text. Since we are interested in predicting comments, we consider a third training data source which contains more English text as well as some code segments. We use data from the popular Q&A; website StackOverflow (SO) where users ask and answer technical questions about software development, tools, algorithms, etc’ . We downloaded a dataset of all actions performed on the site since it was launched in August 2008 until August 2012. The data includes 3,453,742 questions and 6,858,133 answers posted by 1,295,620 users. We used only posts that are tagged as JAVA related questions and answers. All the models for each project are then tested on the testing set of that project. We report results averaged over all projects in Table 1. Source files were tokenized using the Eclipse JDT compiler tools, separating code tokens and identifiers. Identifier names (of classes, methods and variables), were further tokenized by camel case notation (e.g., ’minMargin’ was converted to ’min margin’). Non alpha-numeric tokens (e.g., dot, semicolon) were discarded from the code, as well as numeric and single character literals. Text from comments or any string literals within the code were further tokenized with the Mallet statistical natural language processing package (Mc- Callum, 2002). Posts from SO were parsed using the Apache Tika toolkit1 and then tokenized with the Mallet package. We considered as raw code tokens anything labeled using a markup (as indicated by the SO users who wrote the post). 3.2 Evaluation Since our models are trained using various data sources the vocabularies used by each of them are different, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifying the percentage of characters that can be saved by using the model in a word-completion settings, similar to standard code completion tools built into code editors. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely words given each model filtered by the first 0, . . . , n characters ofw. Let k be the minimal ki for which w is in the top two predicted word tokens where tokens are filtered by the first ki characters. Then, the number of saved characters for w is n k. In Table 1we report the average percentage o−f ksa.v Iend T Tcahbalera 1cte wrse per ocrotm thmee avnet using eearcchen not-f the above models. The final results are also averaged over the nine input projects. As an example, in the predicted comment shown in Table 2, taken from the project Minor-Third, the token entity is the most likely token according to the model SO trigram, out of tokens starting with the prefix ’en’ . The saved characters in this case are ’tity’ . − 4 Results Table 1 displays the average percentage of characters saved per class comment using each of the models. Models trained on in-project data (IN) perform significantly better than those trained on another data source, regardless of the model type, with an average saving of 47. 1% characters using a trigram model. This is expected, as files from the same project are likely to contain similar comments, and identifier names that appear in the comment of one class may appear in the code of another class in the same project. Clearly, in-project data should be used when available as it improves comment prediction leading to an average increase of between 6% for the worst model (26.6 for OUT unigram versus 33.05 for IN) and 14% for the best (32.96 for OUT trigram versus 47. 1for IN). 1http://tika.apache.org/ 37 Model n / topics n-gram LDA Link-LDA 1 2 3 20 10 5 1 20 10 5 1 IN 33.05 (3.62) 43.27 (5.79) 47.1 (6.87) 34.20 (3.63) 33.93 (3.67) 33.63 (3.67) 33.05 (3.62) 35.76 (3.95) 35.81 (4.12) 35.37 (3.98) 34.59 (3.92) OUT 26.6 (3.37) 31.52 (4.17) 32.96 (4.33) 26.79 (3.26) 26.8 (3.36) 26.86 (3.44) 26.6 (3.37) 28.03 (3.60) 28 (3.56) 28 (3.67) 27.82 (3.62) SO 27.8 (3.51) 33.29 (4.40) 34.56 (4.78) 27.25 (3.67) 27.22 (3.44) 27.34 (3.55) 27.8 (3.51) 28.08 (3.48) 28.12 (3.58) 27.94 (3.56) 27.9 (3.45) Table 1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trained on three training sets: IN, OUT, and SO. The results are averaged over nine JAVA projects (with standard deviations in parenthesis). Model Predicted Comment trigram IN link-LDA OUT trigram SO trigram “Train “Train “Train “Train IN named-entity a named-entity a named-entity a named-entity a extractor“ extractor“ extractor“ extractor“ Table 2: Sample comment from the Minor-Third project predicted using IN, OUT and SO based models. Saved characters are underlined. Of the out-of-project data sources, models using a greater amount of text (SO) mostly outperformed models based on more code (OUT). This increase in performance, however, comes at a cost of greater run-time due to the larger word dictionary associated with the SO data. Note that in the scope of this work we did not investigate the contribution of each of the background projects used in OUT, and how their relevance to the target prediction project effects their performance. The trigram model shows the best performance across all training data sources (47% for IN, 32% for OUT and 34% for SO). Amongst the tested topic models, link-LDA models which distinguish code and text tokens perform consistently better than simple LDA models in which all tokens are considered as text. We did not however find a correlation between the number of latent topics learned by a topic model and its performance. In fact, for each of the data sources, a different num- ber of topics gave the optimal character saving results. Note that in this work, all topic models are based on unigram tokens, therefore their results are most comparable with that of the unigram in Dataset n-gram link-LDA IN 2778.35 574.34 OUT 1865.67 670.34 SO 1898.43 638.55 Table 3: Average words per project for which each tested model completes the word better than the other. This indicates that each of the models is better at predicting a different set of comment words. Table 1, which does not benefit from the backoff strategy used by the bigram and trigram models. By this comparison, the link-LDA topic model proves more successful in the comment prediction task than the simpler models which do not distin- guish code and text tokens. Using n-grams without backoff leads to results significantly worse than any of the presented models (not shown). Table 2 shows a sample comment segment for which words were predicted using trigram models from all training sources and an in-project linkLDA. The comment is taken from the TrainExtractor class in the Minor-Third project, a machine learning library for annotating and categorizing text. Both IN models show a clear advantage in completing the project-specific word Train, compared to models based on out-of-project data (OUT and SO). Interestingly, in this example the trigram is better at completing the term namedentity given the prefix named. However, the topic model is better at completing the word extractor which refers to the target class. This example indicates that each model type may be more successful in predicting different comment words, and that combining multiple models may be advantageous. 38 This can also be seen by the analysis in Table 3 where we compare the average number of words completed better by either the best n-gram or topic model given each training dataset. Again, while n-grams generally complete more words better, a considerable portion of the words is better completed using a topic model, further motivating a hybrid solution. 5 Conclusions We analyze the use of language models for predicting class comments for source file documents containing a mixture of code and text tokens. Our experiments demonstrate the effectiveness of using language models for comment completion, showing a saving of up to 47% of the comment characters. When available, using in-project training data proves significantly more successful than using out-of-project data. However, we find that when using out-of-project data, a dataset based on more words than code performs consistently better. The results also show that different models are better at predicting different comment words, which motivates a hybrid solution combining the advantages of multiple models. Acknowledgments This research was supported by the NSF under grant CCF-1247088. References Ramnath Balasubramanyan and William W Cohen. 2011. Block-lda: Jointly modeling entity-annotated text and entity-entity links. In Proceedings ofthe 7th SIAM International Conference on Data Mining. Dave Binkley, Matthew Hearn, and Dawn Lawrie. 2011. Improving identifier informativeness using part of speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. David M Blei, Andrew Y Ng, and Michael IJordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. Elena Erosheva, Stephen Fienberg, and John Lafferty. 2004. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence. Mark Gabel and Zhendong Su. 2008. Javert: fully automatic mining of general temporal properties from dynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339–349. ACM. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. of the National Academy of Sciences of the United States of America. Sangmok Han, David R Wallace, and Robert C Miller. 2009. Code completion from abbreviated input. In Automated Software Engineering, 2009. ASE’09. 24th IEEE/ACM International Conference on, pages 332–343. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE. Ferosh Jacob and Robert Tairas. 2010. Code template inference using language models. In Proceedings of the 48th Annual Southeast Regional Conference. ACM. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., volume 1, pages 181–184. IEEE. Naveen Kumar and Benjamin Carterette. 2013. Time based feedback and query expansion for twitter search. In Advances in Information Retrieval, pages 734–737. Springer. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. Whats in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pages 3–12. IEEE. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Adam Pauls and Dan Klein. 2011. Faster and smaller language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 258–267. n-gram Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 103–1 12. IEEE. 39 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8): 1270–1278. David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the 6th international conference on Aspect-oriented software development, pages 212–224. ACM. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pages 43–52. ACM. Yuen-Hsien Tseng and Da-Wei Juang. 2003. Document-self expansion for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 399–400. ACM. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Single document summarization with document expansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Roung-Shiunn Wu and Po-Chun Li. 2011. Video annotation using hierarchical dirichlet process mixture model. Expert Systems with Applications, 38(4):3040–3048. 40
same-paper 6 0.90439278 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
8 0.7131688 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study
9 0.71106833 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions
10 0.70463449 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing
11 0.69684809 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
12 0.69464302 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
13 0.69230443 163 acl-2013-From Natural Language Specifications to Program Input Parsers
14 0.68843955 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction
15 0.68314087 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification
16 0.68036878 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
17 0.67935807 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
18 0.66947168 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction
19 0.66848463 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification
20 0.6596204 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing