emnlp emnlp2011 emnlp2011-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
Reference: text
sentIndex sentText sentNum sentScore
1 s g Abstract Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. [sent-4, score-0.411]
2 It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. [sent-5, score-0.769]
3 However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. [sent-6, score-0.333]
4 In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. [sent-7, score-0.711]
5 It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems. [sent-10, score-0.577]
6 1 Introduction The dominant framework of machine translation (MT) today is statistical machine translation (SMT) (Hutchins, 2007). [sent-11, score-0.398]
7 In SMT, the parameter space is explored by a tuning algorithm, typically MERT (Minimum Error Rate Training) (Och, 2003), though the exact method is not important for our purpose. [sent-14, score-0.182]
8 The tuning algorithm carries out repeated experiments with different decoder parameter values over a development data set, for which reference translations are given. [sent-15, score-0.391]
9 An automatic MT evaluation metric compares the output of the decoder against the reference(s), and guides the tuning algorithm towards iteratively better decoder parameters and output translations. [sent-16, score-0.475]
10 The quality of the automatic MT evaluation metric therefore has an immediate effect on the whole system. [sent-17, score-0.201]
11 The first automatic MT evaluation metric to show a high correlation with human judgment is BLEU (Papineni et al. [sent-18, score-0.398]
12 Together with its close variant the NIST metric, they have quickly become the standard way of tuning statistical machine translation systems. [sent-20, score-0.381]
13 While BLEU is an impressively simple and effective metric, recent evaluations have shown that many new generation metrics can outperform BLEU in terms of correlation with human judgment (Callison-Burch et al. [sent-21, score-0.434]
14 Some of these new metrics include METEOR (Banerjee and Lavie, 2005; Lavie and Agarwal, 2007), TER (Snover et al. [sent-24, score-0.134]
15 Given the close relationship between automatic MT and automatic MT evaluation, the logical expec- tation is that a better MT evaluation metric would Proce Ed iningbsu orfg th ,e S 2c0o1tl1an Cdo,n UfeKr,en Jcuely on 27 E–m31p,ir 2ic0a1l1 M. [sent-27, score-0.244]
16 In the SMT community, MT tuning still uses BLEU almost exclusively. [sent-31, score-0.182]
17 Some researchers have investigated the use of better metrics for MT tuning, with mixed results. [sent-32, score-0.134]
18 (2009) reported improved human judgment using their entailment-based metric. [sent-34, score-0.127]
19 However, the metric is heavy weight and slow in practice, with an estimated runtime of 40 days on the NIST MT 2002/2006/2008 dataset, and the authors had to resort to a two-phase MERT process with a reduced n-best list. [sent-35, score-0.189]
20 (2010) compared tuning a phrase-based SMT system with BLEU, NIST, METEOR, and TER, and concluded that BLEU and NIST are still the best choices for MT tuning, despite the proven higher correlation of METEOR and TER with human judgment. [sent-38, score-0.291]
21 Our empirical study is carried out in the context of WMT 2010, for the French-English, Spanish-English, and German-English machine translation tasks. [sent-41, score-0.199]
22 We show that Joshua responds well to the change of evaluation metric, in that a system trained on metric M typically does well when judged by the same metric M. [sent-42, score-0.354]
23 We further evaluate the different systems with manual judgments and show that the TESLA family of metrics (both TESLA-M and TESLA-F) significantly outperforms BLEU when used to guide the MERT search. [sent-43, score-0.208]
24 In Section 2, we describe the four evaluation metrics used. [sent-45, score-0.134]
25 Section 3 outlines our experimental set up using the WMT 2010 machine translation tasks. [sent-46, score-0.199]
26 376 2 Evaluation metrics This section describes the metrics used in our experiments. [sent-49, score-0.268]
27 Given a reference text R and a translation candidate T, we generate the bag of all n-grams contained in R and T for n = 1, 2, 3, 4, and denote them as BNGnR and BNGnT respectively. [sent-54, score-0.263]
28 Its use of the brevity penalty is however questionable, as subsequent research on n-gram-based metrics has consistently found that recall is in fact a more potent indicator than precision (Banerjee and Lavie, 2005; Zhou et al. [sent-57, score-0.251]
29 The metric is defined as the minimum number of edits needed to change a candidate translation T to the reference R, normalized by the length of the reference, i. [sent-62, score-0.4]
30 TER is a strong contender as the leading new generation automatic metric and has been used in major evaluation campaigns such as GALE. [sent-66, score-0.233]
31 3 TESLA-M TESLA1 is a family of linear programming-based metrics proposed by Liu et al. [sent-70, score-0.171]
32 First, the metric emphasizes the content words by discounting the weight of an n-gram by 0. [sent-76, score-0.158]
33 The goal of the linear programming problem is to assign weights to the links between the two BNGs, so as to maximize the sum of the products of the link weights and their corresponding similarity scores. [sent-81, score-0.127]
34 Once the solution is found and let the maximized objective function value be S, the precision is computed as S over the sum of weights of the translation candidate n-grams. [sent-92, score-0.205]
35 Similarly, the recall is S over the sum of weights of the reference n-grams. [sent-93, score-0.129]
36 TESLA-M gains an edge over the previous two metrics by the use of lightweight linguistic features such as lemmas, synonym dictionaries, and POS Metric Spearman’s rho TESLA-F. [sent-97, score-0.225]
37 89 Table 1: Selected system-level Spearman’s rho correlation with the human judgment for the intoEnglish task, as reported in WMT 2010. [sent-102, score-0.288]
38 76 Table 2: Selected system-level Spearman’s rho correlation with the human judgment for the out-ofEnglish task, as reported in WMT 2010. [sent-108, score-0.288]
39 According to the system-level correlation with human judgments (Tables 1and 2), it ranks top for the out-of-English task and very close to the top for the into-English task (Callison-Burch et al. [sent-112, score-0.146]
40 Let R and T be the 3TESLA-F refers to the metric called TESLA in (Liu et al. [sent-121, score-0.158]
41 To minimize confusion, in this work we call the metric TESLA-F and refer to the whole family of metrics as TESLA. [sent-123, score-0.329]
42 reference and the translation candidate respectively, both in English. [sent-139, score-0.21]
43 and Hello , can be found in the English-French phrase table, and proper name Querrien is out-ofvocabulary, then a likely segmentation is: R: || | Good morning , sir . [sent-144, score-0.187]
44 | || Each English phrase is then mapped to a bag of weighted French phrases using the phrase table, transforming the English sentences into confusion networks resembling Figures 2 and 3. [sent-146, score-0.18]
45 French ngrams are extracted from these confusion network representations, known as pivot language n-grams. [sent-147, score-0.143]
46 The bag of pivot language n-grams generated by R is then matched against that generated by T with the same linear programming formulation used in TESLA-M. [sent-148, score-0.17]
47 Unlike BLEU and TESLA-M which rely on simple averages (geometric and arithmetic average respectively) to combine the component scores, TESLAF trains the weights over a set of human judgments using a linear ranking support vector machine (RSVM). [sent-150, score-0.164]
48 However, the added complexity, in particular the use of the language model score and the tuning of the component weights appear to make it less stable than TESLA-M in practice. [sent-154, score-0.229]
49 n1 3 Experimental setup We run our experiments in the setting of the WMT 2010 news commentary machine translation campaign, for three language pairs: 1. [sent-156, score-0.199]
50 Then, we create suffix arrays and extract translation grammars for the development and test set with Joshua in its default setting. [sent-174, score-0.158]
51 Parameter tuning is carried out using ZMERT (Zaidan, 2009). [sent-177, score-0.182]
52 The best score according to each metric is shown in bold. [sent-184, score-0.158]
53 4 We note that Joshua generally responds well to the change of tuning metric. [sent-186, score-0.22]
54 The order of the translation candidates is randomized so that the judges will not see any patterns. [sent-247, score-0.22]
55 1% Table 6: Percentage of times each system produces the best translation be the proportion of times that they would agree by chance. [sent-263, score-0.198]
56 6 are considered moderate, and our values are in line with those reported in the WMT 2010 translation campaign. [sent-270, score-0.158]
57 Table 6 shows the proportion of times each system produces the best translation among the four. [sent-273, score-0.198]
58 Note that the values in each column do not add up to 100%, since the candidate translations are often identical, and even a different translation can receive the same human judgment. [sent-275, score-0.308]
59 01, with the exception of TESLA-M vs TESLAF in the French-English task, BLEU vs TER in the Spanish-English task, and TESLA-M vs TESLA-F and BLEU vs TER in the German-English task. [sent-279, score-0.4]
60 The results provide strong evidence that tuning machine (a)T EhSeFLBrTAenE-\cMhBURF-Eng5l62is. [sent-280, score-0.223]
61 the strikeout BLEU vs TER, and TESLA-M vs TESLA-F. [sent-309, score-0.279]
62 Each cell shows the proportion of time the system tuned on A is preferred over the system tuned on B, and the proportion of time the opposite happens. [sent-311, score-0.143]
63 381 translation systems using the TESLA metrics leads to significantly better translation output. [sent-313, score-0.45]
64 5 Discussion We examined the results manually, and found that the relationship between the types of mistakes each system makes and the characteristics of the corresponding metric to be intricate. [sent-314, score-0.158]
65 As a metric, TESLA-M is certainly much more similar to BLEU than TER is, yet they behave very differently when used as a tuning metric. [sent-321, score-0.182]
66 Comparing the translations from the two groups, the tendency of BLEU and TER to pick shorter paraphrases and to drop function words is unmistakable, often to the detriment of the translation quality. [sent-328, score-0.33]
67 Interestingly, the human translations average only 22 words, so BLEU and TER translations are in fact much closer on average to the reference lengths, yet their translations often feel too short. [sent-330, score-0.424]
68 Interestingly, by placing much more emphasis on the recall, TESLA-M and TESLA-F produce translations that are statistically too long, 382 but feel much more ‘correct’ lengthwise. [sent-334, score-0.14]
69 Another reason may be that since the metric does not care much about function words, the language model is given more freedom to pick function words as it sees fit, without the fear of large penalties. [sent-338, score-0.158]
70 Paradoxically, by reducing the weights of function words, we end up making better translations for them. [sent-339, score-0.158]
71 TER is the only metric that allows cheap block movements, regardless of size or distance. [sent-340, score-0.158]
72 Since extra resources including bitexts are needed in using TESLA-F, TESLA-M emerges as the MT evaluation metric of choice for tuning SMT systems. [sent-345, score-0.34]
73 6 Future work have presented empirical evidence that the TESLA metrics outperform BLEU for MT tuning in a hierarchical phrase-based SMT system. [sent-346, score-0.393]
74 , 2010) investigated the effect of tuning a phrase-based SMT system and found that of the MT evaluation metrics that they tried, none of them could outperform like to verify whether TESLA BLEU. [sent-350, score-0.355]
75 We would tuning is still pre- ferred over BLEU tuning in a phrase-based SMT system. [sent-351, score-0.364]
76 Based on our observations, it may be possible to improve the performance of BLEU-based tuning by (1) increasing the brevity penalty; (2) introducing BLEU in the future , americans want a phone that allow the user to . [sent-352, score-0.425]
77 TER in the future , americans want a phone that allow the user to . [sent-355, score-0.193]
78 TESLA-M in the future , the americans want a cell phone , which allow the user to . [sent-358, score-0.222]
79 TESLA-F in the future , the americans want a phone that allow the user to . [sent-361, score-0.193]
80 also BLEUand it is TER and it is for interest on debt of the state . [sent-379, score-0.208]
81 BLEUit is not certain that the state can act without money . [sent-392, score-0.173]
82 TER it is not certain that the state can act without money . [sent-393, score-0.173]
83 TESLA-M it is not certain that the state can act without this money . [sent-394, score-0.173]
84 TESLA-F it is not certain that the state can act without this money . [sent-395, score-0.173]
85 TESLA-M but at the expense of a greater debt of the state . [sent-402, score-0.259]
86 TESLA-F but at the expense of a great debt of the state . [sent-405, score-0.259]
87 Figure 4: Comparison of selected translations from the French-English task a recall measure and emphasizing it over precision; and/or (3) introducing function word discounting. [sent-408, score-0.141]
88 In the ideal case, such a modified BLEU metric would deliver results similar to that of TESLA-M, yet with a runtime cost closer to BLEU. [sent-409, score-0.189]
89 It would also make porting existing tuning code easier. [sent-410, score-0.215]
90 7 Conclusion We demonstrate for the first time that a practical new generation MT evaluation metric can significantly improve the quality of automatic MT compared to BLEU, as measured by human judgment. [sent-411, score-0.272]
91 We hope this work will encourage the MT research community to finally move away from BLEU and to consider tuning their systems with a new generation metric. [sent-412, score-0.244]
92 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. [sent-422, score-0.31]
93 Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. [sent-431, score-0.374]
94 The best lexical metric for phrase-based statistical MT system optimization. [sent-436, score-0.158]
95 MaxSim: A maximum similarity metric for machine translation evaluation. [sent-440, score-0.357]
96 METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. [sent-469, score-0.31]
97 Measuring machine translation quality as semantic equivalence: A 384 metric based on entailment features. [sent-493, score-0.357]
98 A study of translation edit rate with targeted human annotation. [sent-501, score-0.228]
99 Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. [sent-509, score-0.199]
100 Remachine translation results with paraphrase In Proceedings of the 2006 Conference on Methods in Natural Language Processing. [sent-514, score-0.158]
wordName wordTfidf (topN-words)
[('bleu', 0.395), ('ter', 0.389), ('tesla', 0.226), ('wmt', 0.214), ('tuning', 0.182), ('metric', 0.158), ('translation', 0.158), ('debt', 0.152), ('mt', 0.143), ('metrics', 0.134), ('teslaf', 0.131), ('translations', 0.111), ('bng', 0.105), ('bonjour', 0.105), ('querrien', 0.105), ('vs', 0.1), ('meteor', 0.095), ('americans', 0.091), ('rho', 0.091), ('transfers', 0.091), ('smt', 0.09), ('judgment', 0.088), ('pivot', 0.084), ('sir', 0.082), ('strikeout', 0.079), ('tunbe', 0.079), ('joshua', 0.076), ('chan', 0.074), ('money', 0.071), ('morning', 0.071), ('phone', 0.071), ('correlation', 0.07), ('hello', 0.068), ('maxsim', 0.068), ('mert', 0.066), ('lavie', 0.064), ('french', 0.064), ('judges', 0.062), ('kappa', 0.062), ('cer', 0.061), ('tendency', 0.061), ('spearman', 0.059), ('confusion', 0.059), ('state', 0.056), ('hwee', 0.053), ('bag', 0.053), ('bleutuned', 0.053), ('monsieur', 0.053), ('salut', 0.053), ('teslam', 0.053), ('reference', 0.052), ('expense', 0.051), ('tou', 0.051), ('bp', 0.051), ('banerjee', 0.051), ('brevity', 0.05), ('weights', 0.047), ('decoder', 0.046), ('act', 0.046), ('cept', 0.045), ('omar', 0.044), ('automatic', 0.043), ('nist', 0.042), ('machine', 0.041), ('proportion', 0.04), ('human', 0.039), ('outperform', 0.039), ('monz', 0.038), ('responds', 0.038), ('seng', 0.038), ('hierarchical', 0.038), ('judgments', 0.037), ('family', 0.037), ('penalty', 0.037), ('christof', 0.036), ('matching', 0.035), ('preferred', 0.034), ('rr', 0.034), ('alon', 0.034), ('hardly', 0.034), ('synsets', 0.034), ('phrase', 0.034), ('code', 0.033), ('programming', 0.033), ('findings', 0.032), ('evaluations', 0.032), ('comp', 0.032), ('edits', 0.032), ('generation', 0.032), ('user', 0.031), ('edit', 0.031), ('lemmas', 0.031), ('bags', 0.031), ('rankings', 0.031), ('runtime', 0.031), ('recall', 0.03), ('hope', 0.03), ('cell', 0.029), ('yee', 0.029), ('produce', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
2 0.25373065 125 emnlp-2011-Statistical Machine Translation with Local Language Models
Author: Christof Monz
Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.
3 0.17145585 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
4 0.16577967 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao
Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –
5 0.1560545 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
Author: Yang Gao ; Philipp Koehn ; Alexandra Birch
Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.
6 0.14715932 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
7 0.13248159 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
9 0.1254736 138 emnlp-2011-Tuning as Ranking
10 0.1234337 136 emnlp-2011-Training a Parser for Machine Translation Reordering
11 0.11785474 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
12 0.11730335 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
13 0.10409559 38 emnlp-2011-Data-Driven Response Generation in Social Media
14 0.10402872 3 emnlp-2011-A Correction Model for Word Alignments
15 0.1006292 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
16 0.10020766 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
17 0.097261265 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
18 0.097240463 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
19 0.09434557 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
20 0.08614596 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
topicId topicWeight
[(0, 0.269), (1, 0.175), (2, 0.156), (3, -0.283), (4, 0.032), (5, -0.092), (6, 0.096), (7, 0.019), (8, -0.033), (9, -0.038), (10, 0.06), (11, -0.041), (12, -0.02), (13, -0.019), (14, -0.029), (15, 0.25), (16, -0.03), (17, 0.034), (18, 0.032), (19, -0.026), (20, -0.122), (21, 0.065), (22, 0.067), (23, 0.124), (24, 0.066), (25, 0.057), (26, -0.074), (27, 0.091), (28, -0.128), (29, -0.134), (30, -0.097), (31, 0.011), (32, 0.133), (33, 0.006), (34, -0.073), (35, 0.035), (36, 0.104), (37, -0.078), (38, -0.052), (39, -0.028), (40, 0.067), (41, 0.021), (42, 0.056), (43, 0.025), (44, 0.073), (45, -0.016), (46, -0.018), (47, -0.03), (48, 0.004), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.9773407 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
2 0.7891233 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
3 0.74963182 125 emnlp-2011-Statistical Machine Translation with Local Language Models
Author: Christof Monz
Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.
4 0.66735375 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao
Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –
5 0.61921632 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk
Abstract: Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.
7 0.55783182 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
8 0.53881246 138 emnlp-2011-Tuning as Ranking
9 0.51929265 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
10 0.51296937 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
11 0.50547564 38 emnlp-2011-Data-Driven Response Generation in Social Media
12 0.45942378 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
13 0.41556323 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
14 0.41149592 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
15 0.40992787 110 emnlp-2011-Ranking Human and Machine Summarization Systems
16 0.39380783 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
17 0.38680831 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
18 0.38301528 3 emnlp-2011-A Correction Model for Word Alignments
19 0.3648904 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
20 0.36150002 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
topicId topicWeight
[(15, 0.01), (23, 0.119), (28, 0.22), (36, 0.021), (37, 0.019), (45, 0.073), (53, 0.074), (54, 0.047), (57, 0.016), (62, 0.034), (64, 0.028), (65, 0.012), (66, 0.03), (69, 0.015), (79, 0.086), (82, 0.021), (87, 0.02), (90, 0.011), (94, 0.01), (96, 0.053), (98, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.79854929 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
2 0.63498229 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
Author: Yang Gao ; Philipp Koehn ; Alexandra Birch
Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.
3 0.62287253 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
Author: Tom Kwiatkowski ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman
Abstract: We consider the problem of learning factored probabilistic CCG grammars for semantic parsing from data containing sentences paired with logical-form meaning representations. Traditional CCG lexicons list lexical items that pair words and phrases with syntactic and semantic content. Such lexicons can be inefficient when words appear repeatedly with closely related lexical content. In this paper, we introduce factored lexicons, which include both lexemes to model word meaning and templates to model systematic variation in word usage. We also present an algorithm for learning factored CCG lexicons, along with a probabilistic parse-selection model. Evaluations on benchmark datasets demonstrate that the approach learns highly accurate parsers, whose generalization performance greatly from the lexical factoring. benefits
4 0.62019265 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
5 0.61779183 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.
6 0.61569196 136 emnlp-2011-Training a Parser for Machine Translation Reordering
7 0.61158925 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
8 0.6109401 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
9 0.61062431 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
10 0.61029935 128 emnlp-2011-Structured Relation Discovery using Generative Models
11 0.61028773 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
12 0.60912609 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
13 0.60752678 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
14 0.60707074 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
15 0.60706198 38 emnlp-2011-Data-Driven Response Generation in Social Media
16 0.60619897 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
17 0.60510969 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
18 0.60175318 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
19 0.60169464 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
20 0.60163516 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation