acl acl2011 acl2011-90 knowledge-graph by maker-knowledge-mining

90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Source: pdf

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. [sent-5, score-0.806]

2 We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. [sent-6, score-0.677]

3 Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. [sent-7, score-0.539]

4 We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. [sent-8, score-1.148]

5 We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. [sent-10, score-0.949]

6 The total cost is more than an order of magnitude lower than professional translation. [sent-11, score-0.459]

7 1 Introduction In natural language processing research, translations are most often used in statistical machine translation (SMT), where systems are trained using bilingual sentence-aligned parallel corpora. [sent-12, score-0.752]

8 These include harvesting the web for translations or comparable corpora (Resnik and Smith, 2003; Munteanu and Marcu, 2005; Smith et al. [sent-17, score-0.447]

9 , 2003; Niessen and Ney, 2004), or designing models that are capable of learning translations from monolingual corpora (Rapp, 1995; Fung and Yee, 1998; Schafer and Yarowsky, 2002; Haghighi et al. [sent-22, score-0.447]

10 For example, Germann (2001) estimated the cost of hiring professional translators to create a TamilEnglish corpus at $0. [sent-25, score-0.698]

11 In this paper we examine the idea of creating low cost translations via crowdscouring. [sent-29, score-0.577]

12 We use Amazon’s Mechanical Turk to hire a large group of nonprofessional translators, and have them recreate an Urdu–English evaluation set at a fraction of the cost of professional translators. [sent-30, score-0.55]

13 The original dataset already has professionally-produced reference translations, which allows us to objectively and quantitatively compare the quality of professional and nonprofessional translations. [sent-31, score-0.572]

14 get high quality translations in aggregate by soliciting multiple translations, redundantly editing them, and then selecting the best of the bunch. [sent-35, score-0.721]

15 To select the best translation, we use a machinelearning-inspired approach that assigns a score to each translation we collect. [sent-36, score-0.301]

16 The scores discriminate acceptable translations from those that are not (and competent translators from those who are not). [sent-37, score-0.688]

17 Therefore, soliciting translations from anonymous non-professionals carries a significant risk of poor translation quality. [sent-51, score-0.736]

18 Whereas hiring a professional translator ensures a degree of quality and care, it is not very difficult to find bad translations provided by Turkers. [sent-52, score-1.064]

19 The translations often reflect non-native English, but are generally done conscientiously (in spite of the relatively small payment). [sent-56, score-0.447]

20 We then describe a principled approach to discriminate good translations from bad ones, given a set of redundant translations for the same source sentence. [sent-63, score-1.053]

21 The set includes four different reference translations for each source sentence, produced by professional translation agencies. [sent-67, score-1.211]

22 NIST contracted the LDC to oversee the translation process and perform quality control. [sent-68, score-0.306]

23 This particular dataset, with its multiple reference translations, is very useful because we can measure the quality range for professional translators, which gives us an idea of whether or not the crowdsourced translations approach the quality of a professional translator. [sent-69, score-1.448]

24 2 Translation HIT design We solicited English translations for the Urdu sentences in the NIST dataset. [sent-71, score-0.515]

25 Our HIT involved showing the worker a sequence of Urdu sentences, and asking them to provide an English translation for each one. [sent-74, score-0.405]

26 In our first collection effort, we solicited only one 1222 translation per Urdu sentence. [sent-79, score-0.306]

27 After confirming that the task is feasible due to the large pool of workers willing and able to provide translations, we carried out a second collection effort, this time soliciting three translations per Urdu sentence (from three distinct translators). [sent-80, score-0.609]

28 The translations from the first pass were of noticeably low quality, most likely due to Turkers using automatic translation systems. [sent-84, score-0.723]

29 That said, we do not discard the translations from the first pass, and we do include them in our experiments. [sent-86, score-0.447]

30 3 Post-editing and Ranking HITs In addition to collecting four translations per source sentence, we also collected post-edited versions of the translations, as well as ranking judgments about their quality. [sent-88, score-0.672]

31 Figure 2 gives examples of the unedited translations that we collected in the translation pass. [sent-89, score-0.781]

32 We posted another MTurk task where we asked workers to edit the translations into more fluent and grammatical sentences. [sent-91, score-0.611]

33 We presented the translations in groups of four, and the annotator’s task was to rank the sentences by fluency, from best to worst (allowing ties). [sent-94, score-0.549]

34 Each translation is edited three times (by three distinct editors). [sent-96, score-0.337]

35 We solicited only one edit per translation from our first pass translation effort. [sent-97, score-0.635]

36 So, in total, we had 10 post-edited translations for These translations are put through a subsequent editing set, where multiple edited versions are produced. [sent-98, score-1.083]

37 We select the best translation from the set using features that predict the quality of each translation and each translator. [sent-99, score-0.619]

38 In the ranking task, we collected judgments from five distinct workers for each translation group. [sent-101, score-0.462]

39 1 We also use about 10% of the existing professional references in most of our experiments (see 4. [sent-115, score-0.371]

40 1223 4 Quality Control Model Our approach to building a translation set from the available data is to select, for each Urdu sentence, the one translation that our model believes to be the best out of the available translations. [sent-129, score-0.476]

41 We evaluate various selection techniques by comparing the selected Turker translations against existing professionally-produced translations. [sent-130, score-0.447]

42 The more the selected translations resemble the professional translations, the higher the quality. [sent-131, score-0.818]

43 For a source sentence si, our model assigns a score to each sentence in the set of available translations {ti,1, . [sent-134, score-0.518]

44 • Sentence length features: a good translation tSeenndtse to eb lee comparable isn: length to tthraen source sentence, whereas an overly short or long translation is probably bad. [sent-150, score-0.513]

45 • Edit rate to other translations: a bad translation iEsd likely not ttoh ebre t very asitimoinlsa:r to aodth terar ntsralnatsiloantions, since there are many more ways a translation can be bad than for it to be good. [sent-155, score-0.597]

46 So, we compute the average edit rate distance from the other translations (using the TER metric). [sent-156, score-0.531]

47 Other features (not investigated here) could include source-target information, such as translation model scores or the number of source words translated correctly according to a bilingual dictionary. [sent-170, score-0.382]

48 2 Parameter Tuning Once features are computed for the sentences, we must set the model’s weight vector Naturally, the weights should be chosen so that good translations get high scores, and bad translations get low scores. [sent-172, score-0.985]

49 We optimize translation quality against a small subset (10%) of reference (professional) translations. [sent-173, score-0.382]

50 MERT is an iterative algorithm used to tune parameters of an MT system, which operates by iteratively generating new candidate translations and adjusting the weights to give good translations a high score, then regenerating new candidates based on the updated weights, etc. [sent-176, score-0.922]

51 In our work, the set of candidate translations is fixed (the 14 English sentences for each source sentence), and therefore iterating the procedure is not applicable. [sent-177, score-0.484]

52 3 The Worker Calibration Feature Since we use a small portion of the reference translations to perform weight tuning, we can also use that data to compute another worker-specific feature. [sent-180, score-0.523]

53 Namely, we can evaluate the competency of each worker by scoring their translations against the reference translations. [sent-181, score-0.69]

54 The intuition is that workers known to produce good translations are likely to continue to produce good translations, and the opposite is likely true as well. [sent-183, score-0.558]

55 4 Evaluation Strategy To measure the quality of the translations, we make use of the existing professional translations. [sent-185, score-0.439]

56 Since we have four professional translation sets, we can calculate the BLEU score (Papineni et al. [sent-186, score-0.685]

57 , 2002) for one professional translator P1 using the other three P2,3,4 as a reference set. [sent-187, score-0.503]

58 We repeat the process four times, scoring each professional translator against the others, to calculate the expected range of professional quality translation. [sent-188, score-0.908]

59 We can see how a translation set T (chosen by our model) compares to this range by calculating T’s BLEU scores against the same four sets of three reference translations. [sent-189, score-0.39]

60 We also evaluate Turker translation quality by using them as reference sets to score various submissions to the NIST MT evaluation. [sent-191, score-0.416]

61 Specifically, we measure the correlation (using Pearson’s r) between BLEU scores of MT systems measured against nonprofessional translations, and BLEU scores measured against professional translations. [sent-192, score-0.53]

62 2 5 Experimental Results We establish the performance of professional translators, calculate oracle upper bounds on Turker translation quality, and carry out a set ofexperiments that demonstrate the effectiveness of our model and that determine which features are most helpful. [sent-195, score-0.686]

63 For the worker calibration feature, we utilize the references for 10% of the data (which is within the 80% portion). [sent-201, score-0.339]

64 13 on average, which highlights the loss in quality when collecting translations from amateurs. [sent-207, score-0.548]

65 We perform two oracle experiments to determine if there exist high-quality Turker translations in the first place. [sent-212, score-0.478]

66 The first oracle operates on the segment level: for each source segment, choose from the four translations the one that scores highest against the reference sentence. [sent-213, score-0.73]

67 The second oracle operates on the worker level: for each source segment, choose from the four translations the one provided by the worker whose translations (over all sentences) score the highest. [sent-214, score-1.4]

68 The second method selects the translation that received the best average rank, using the rank labels assigned by other Turkers (see 3. [sent-221, score-0.39]

69 ) (segment) (Turker) TER rank features features features feature features Figure 3: BLEU scores for different selection methods, measured against the reference sets. [sent-243, score-0.427]

70 The five right-most bars are colored in orange to indicate selection over a set that includes both original translations as well as edited versions of them. [sent-245, score-0.574]

71 06, which is within the range of scores for the professional translators. [sent-251, score-0.405]

72 The results, in Table 1, tell a fairly similar story as evaluating with BLEU: references and oracles naturally perform very well, and the loss in quality when selecting arbitrary Turker translations is largely eliminated using our selection strategy. [sent-254, score-0.573]

73 3 6 Analysis The oracles indicate that there is usually an accept- able translation from the Turkers for any given sentence. [sent-257, score-0.296]

74 Since the oracles select from a small group of only 4 translations per source segment, they are not overly optimistic, and rather reflect the true potential of the collected translations. [sent-258, score-0.636]

75 4) is quite low considering the amount of collected data, it would be more attractive if the cost could be reduced further without losing much in translation quality. [sent-273, score-0.391]

76 To that end, we investigated lowering cost along two dimensions: eliminating the need for professional translations, and decreasing the amount of edited translations. [sent-274, score-0.597]

77 n The professional translations are used in our approach for computing the worker calibration feature (subsection 4. [sent-278, score-1.188]

78 We use a relatively small amount for this purpose, but we investigate a different setup whereby no professional translations are used at all. [sent-280, score-0.818]

79 This eliminates the worker calibration feature, but, perhaps more critically, the feature weights must be set in a different fashion, since we cannot optimize BLEU on reference data anymore. [sent-281, score-0.446]

80 3) as a proxy for BLEU, and set the weights so that better ranked translations receive higher scores. [sent-283, score-0.447]

81 Completely eliminating the edited translations has an adverse effect, as expected (Figure 4). [sent-291, score-0.585]

82 Another option, rather than eliminating the editing phase altogether, would be to consider the edited translations of only the translation receiving 1227 tion data (and using only the calibration feature). [sent-292, score-1.085]

83 This would reflect a data collection process whereby the editing task is delayed until after the rank labels are collected, with the rank labels used to determine which translations are most promising to post-edit (in addition to using the rank labels for the ranking features). [sent-297, score-1.041]

84 Using this approach enables us to greatly reduce the number of edited translations collected, while maintaining good performance, obtaining a BLEU score of 38. [sent-298, score-0.58]

85 It is therefore our recommendation that crowdsourced translation efforts adhere to the following pipeline: collect multiple translations for each source sentence, collect rank labels for the translations, and finally collect edited versions of the top ranked translations. [sent-300, score-1.167]

86 One such method was to collect reference translations to score MT output. [sent-311, score-0.606]

87 It was only a pilot study (50 sentences in each of several languages), but it showed the possibility of obtaining high-quality translations from non-professionals. [sent-312, score-0.447]

88 As a followup, Bloodgood and Callison-Burch (2010) solicited a single translation of the NIST Urdu-to-English dataset we used. [sent-313, score-0.306]

89 Their evaluation was similar to our correlation experiments, examining how well the collected translations agreed with the professional translations when evaluating three MT systems. [sent-314, score-1.364]

90 Two relevant papers from that workshop were by Ambati and Vogel (2010), focusing on the design of the translation HIT, and by Irvine and Klementiev (2010), who created translation lexicons between English and 42 rare languages. [sent-316, score-0.507]

91 (2010) explore a very interesting way of creating translations on MTurk, relying only on monolingual speakers. [sent-318, score-0.489]

92 Speakers of the target language iteratively identified problems in machine translation output, and speakers of the source language paraphrased the corresponding source portion. [sent-319, score-0.312]

93 8 Conclusion and Future Work We have demonstrated that it is possible to obtain high-quality translations from non-professional translators, and that the cost is an order of magnitude cheaper than professional translation. [sent-321, score-0.906]

94 We believe that crowdsourcing can play a pivotal role in future efforts to create parallel translation datasets. [sent-322, score-0.374]

95 Beyond the cost and scalability, crowdsourcing provides access to languages that currently fall outside the scope of statistical machine translation research. [sent-323, score-0.422]

96 We have begun an ongoing effort to collect translations for several low resource languages, including Tamil, Yoruba, and dialectal Arabic. [sent-324, score-0.496]

97 We would like to thank Ben Bederson, Philip Resnik, and Alain D ´esilets for organizing workshops focused on crowdsourcing translation (Bederson and Resnik, 2010; D ´esilets, 2010). [sent-333, score-0.334]

98 Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. [sent-360, score-0.306]

99 A study of translation edit rate with targeted human annotation. [sent-451, score-0.322]

100 Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. [sent-473, score-0.306]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translations', 0.447), ('professional', 0.371), ('translation', 0.238), ('urdu', 0.236), ('turker', 0.202), ('calibration', 0.172), ('worker', 0.167), ('turkers', 0.167), ('translators', 0.162), ('bleu', 0.162), ('mechanical', 0.127), ('workers', 0.111), ('mturk', 0.109), ('rank', 0.102), ('amazon', 0.102), ('turk', 0.101), ('edited', 0.099), ('crowdsourcing', 0.096), ('hits', 0.091), ('editing', 0.09), ('cost', 0.088), ('hiring', 0.077), ('reference', 0.076), ('mt', 0.071), ('nist', 0.07), ('quality', 0.068), ('solicited', 0.068), ('collected', 0.065), ('resnik', 0.063), ('oracles', 0.058), ('bederson', 0.057), ('nonprofessional', 0.057), ('translator', 0.056), ('edit', 0.053), ('soliciting', 0.051), ('labels', 0.05), ('collect', 0.049), ('ranking', 0.048), ('crowdsourced', 0.047), ('professionals', 0.047), ('features', 0.046), ('bad', 0.045), ('discriminate', 0.045), ('four', 0.042), ('creating', 0.042), ('parallel', 0.04), ('hit', 0.039), ('eliminating', 0.039), ('pass', 0.038), ('calibrate', 0.038), ('dawid', 0.038), ('niessen', 0.038), ('probst', 0.038), ('redundantly', 0.038), ('residence', 0.038), ('schafer', 0.038), ('whitehill', 0.038), ('irvine', 0.038), ('source', 0.037), ('ldc', 0.037), ('native', 0.036), ('segment', 0.035), ('correlation', 0.034), ('score', 0.034), ('scores', 0.034), ('esilets', 0.034), ('payment', 0.034), ('recreate', 0.034), ('aidan', 0.034), ('pakistan', 0.034), ('joshua', 0.033), ('collecting', 0.033), ('chris', 0.033), ('reward', 0.032), ('redundant', 0.032), ('rate', 0.031), ('oracle', 0.031), ('lexicons', 0.031), ('disfluent', 0.031), ('ambati', 0.031), ('bloodgood', 0.031), ('alain', 0.031), ('oard', 0.031), ('solicit', 0.031), ('unedited', 0.031), ('feature', 0.031), ('omar', 0.031), ('select', 0.029), ('barak', 0.029), ('english', 0.029), ('philip', 0.029), ('operates', 0.028), ('perplexity', 0.028), ('bars', 0.028), ('pay', 0.028), ('germann', 0.028), ('ulrich', 0.028), ('smt', 0.028), ('aggregate', 0.027), ('bilingual', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

2 0.24575409 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

3 0.21595559 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

4 0.19121476 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

5 0.18189716 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

Author: Yanjun Ma ; Yifan He ; Andy Way ; Josef van Genabith

Abstract: We present a discriminative learning method to improve the consistency of translations in phrase-based Statistical Machine Translation (SMT) systems. Our method is inspired by Translation Memory (TM) systems which are widely used by human translators in industrial settings. We constrain the translation of an input sentence using the most similar ‘translation example’ retrieved from the TM. Differently from previous research which used simple fuzzy match thresholds, these constraints are imposed using discriminative learning to optimise the translation performance. We observe that using this method can benefit the SMT system by not only producing consistent translations, but also improved translation outputs. We report a 0.9 point improvement in terms of BLEU score on English–Chinese technical documents.

6 0.17880622 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

7 0.17557167 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

8 0.17014155 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

9 0.15988255 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

10 0.14692508 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

11 0.13453983 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

12 0.1315054 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

13 0.13138732 313 acl-2011-Two Easy Improvements to Lexical Weighting

14 0.13113862 264 acl-2011-Reordering Metrics for MT

15 0.12000754 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

16 0.11347043 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

17 0.11276019 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions

18 0.10923114 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

19 0.10913432 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

20 0.10795546 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.247), (1, -0.145), (2, 0.114), (3, 0.192), (4, 0.027), (5, 0.039), (6, 0.117), (7, -0.028), (8, 0.079), (9, -0.057), (10, -0.062), (11, -0.151), (12, 0.019), (13, -0.174), (14, 0.03), (15, 0.042), (16, -0.045), (17, -0.04), (18, 0.02), (19, -0.079), (20, 0.035), (21, 0.031), (22, 0.082), (23, -0.003), (24, -0.067), (25, 0.024), (26, -0.069), (27, 0.009), (28, 0.043), (29, -0.022), (30, -0.041), (31, -0.078), (32, -0.043), (33, 0.079), (34, 0.031), (35, 0.15), (36, 0.095), (37, 0.029), (38, -0.017), (39, -0.113), (40, 0.089), (41, 0.007), (42, 0.159), (43, 0.028), (44, -0.016), (45, 0.022), (46, 0.023), (47, 0.004), (48, 0.026), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95845652 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

2 0.82236075 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

3 0.74857169 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

Author: Yanjun Ma ; Yifan He ; Andy Way ; Josef van Genabith

4 0.73621029 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Author: Chi-kiu Lo ; Dekai Wu

Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1

5 0.72001475 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Author: Jonathan H. Clark ; Chris Dyer ; Alon Lavie ; Noah A. Smith

Abstract: In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately.

6 0.68885851 313 acl-2011-Two Easy Improvements to Lexical Weighting

7 0.67162913 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

8 0.66731983 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

9 0.66237956 264 acl-2011-Reordering Metrics for MT

10 0.6544351 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

11 0.6542809 151 acl-2011-Hindi to Punjabi Machine Translation System

12 0.65421289 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

13 0.64555347 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

14 0.64504564 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

15 0.61976379 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

16 0.61730194 220 acl-2011-Minimum Bayes-risk System Combination

17 0.61065716 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

18 0.59887141 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

19 0.59218484 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

20 0.5691517 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.024), (17, 0.049), (26, 0.063), (31, 0.02), (37, 0.071), (39, 0.04), (41, 0.05), (45, 0.184), (55, 0.026), (59, 0.039), (72, 0.096), (75, 0.03), (91, 0.036), (96, 0.181), (97, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90456122 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

Author: Roger Levy

Abstract: A system making optimal use of available information in incremental language comprehension might be expected to use linguistic knowledge together with current input to revise beliefs about previous input. Under some circumstances, such an error-correction capability might induce comprehenders to adopt grammatical analyses that are inconsistent with the true input. Here we present a formal model of how such input-unfaithful garden paths may be adopted and the difficulty incurred by their subsequent disconfirmation, combining a rational noisy-channel model of syntactic comprehension under uncertain input with the surprisal theory of incremental processing difficulty. We also present a behavioral experiment confirming the key empirical predictions of the theory.

same-paper 2 0.84607601 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

3 0.83502495 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

Abstract: Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.

4 0.82193929 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

5 0.77701181 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman

Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.

6 0.77193028 252 acl-2011-Prototyping virtual instructors from human-human corpora

7 0.76700968 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

8 0.7625944 261 acl-2011-Recognizing Named Entities in Tweets

9 0.75741756 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

10 0.7551651 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

11 0.75330889 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

12 0.7521106 299 acl-2011-The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

13 0.75157309 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

14 0.75075912 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

15 0.74973476 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

16 0.74932575 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

17 0.7490868 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

18 0.74768567 141 acl-2011-Gappy Phrasal Alignment By Agreement

19 0.74704802 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

20 0.74697757 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation