acl acl2012 acl2012-158 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. [sent-5, score-0.258]
2 In principle, tuning on these metrics should yield better systems than tuning on BLEU. [sent-6, score-0.594]
3 This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. [sent-8, score-0.667]
4 PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45. [sent-12, score-0.504]
5 1 Introduction Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems. [sent-17, score-0.161]
6 930 roles: to allow rapid (though sometimes inaccurate) comparisons between different systems or between different versions of the same system, and to perform tuning of parameter values during system training. [sent-19, score-0.291]
7 The latter has become important since the invention of minimum error rate training (MERT) (Och, 2003) and related tuning methods. [sent-20, score-0.264]
8 These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations. [sent-21, score-0.176]
9 Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-33, score-0.22]
10 However, BLEU remains the de facto standard tuning metric, for two reasons. [sent-38, score-0.264]
11 First, there is no evidence that any other tuning metric yields better MT systems. [sent-39, score-0.371]
12 (2010) showed that BLEU tuning is more robust than tuning with other metrics (METEOR, TER, etc. [sent-41, score-0.594]
13 Second, though a tuning metric should correlate strongly with human judgment, MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly. [sent-43, score-0.529]
14 (201 1) claimed that TESLA tuning performed better than BLEU tuning according to human judgment. [sent-45, score-0.556]
15 In this work, our goal is to devise a metric that, like BLEU, is computationally cheap and language-independent, but that yields better MT systems than BLEU when used for tuning. [sent-51, score-0.107]
16 The final version, PORT, combines precision, recall, strict brevity penalty (Chiang et al. [sent-53, score-0.23]
17 , 2008) and strict redundancy penalty (Chen and Kuhn, 2011) in a quadratic mean expression. [sent-54, score-0.25]
18 This expression is then further combined with a new measure of word ordering, v, designed to reflect long-distance as well as short-distance word reordering (BLEU only reflects short-distance reordering). [sent-55, score-0.148]
19 Results given below show that PORT correlates better with human judgments of translation quality than BLEU does, and sometimes outperforms METEOR in this respect, based on data from WMT (2008-2010). [sent-58, score-0.131]
20 If there are multiple references, we use closest reference length for each translation hypothesis to compute the numbers of the reference n-grams. [sent-64, score-0.19]
21 0 ,e1−len(R )/len(T)) (5) PORT PORT has five components: precision, recall, strict brevity penalty (Chiang et al. [sent-68, score-0.23]
22 , 2008), strict redundancy penalty (Chen and Kuhn, 2011) and an ordering measure v. [sent-69, score-0.384]
23 1 Precision and Recall The average precision and average recall used in PORT (unlike those used in BLEU) are the arithmetic average of n-gram precisions Pa(N) and recalls Ra(N): Pa(N)=N1∑nN=1p(n) Ra(N)=N1∑nN=1r(n) (6) (7) We use two penalties to avoid too long or too short MT outputs. [sent-76, score-0.111]
24 The first, the strict brevity penalty (SBP), is proposed in (Chiang et al. [sent-77, score-0.23]
25 2 Ordering Measure Word ordering measures for MT compare two permutations of the original source-language word sequence: the permutation represented by the sequence of corresponding words in the MT output, and the permutation in the reference. [sent-84, score-0.257]
26 Several ordering measures have been integrated into MT evaluation metrics recently. [sent-85, score-0.231]
27 Birch and Osborne (201 1) use either Hamming Distance or Kendall’s τ Distance (Kendall, 1938) in their metric LRscore, thus obtaining two versions of LRscore. [sent-86, score-0.107]
28 We use word alignment to compute the two permutations (LRscore also uses word alignment). [sent-90, score-0.183]
29 The ordering metric v is computed from two distance measures. [sent-102, score-0.282]
30 This metric is similar to Spearman’s ρ (Spearman, 1904). [sent-104, score-0.107]
31 For instance, ν1 is more tolerant than ρ of the movement of “recently” in this example: Ref: Recently, I visited Paris Hyp: I visited Paris recently v1 Inspired by HMM word alignment (Vogel et al. [sent-106, score-0.18]
32 In the following, only two groups of words have moved, so the jump width punishment is light: Ref: In the winter of 2010, I visited Paris Hyp: I visited Paris in the winter of 2010 So the second distance measure is DIST2(P1,P2)=∑i=n1 |(p1i−p1i−1)−(p2i−p2i−1 )| where we setp10 = 0 and p20 = 0 . [sent-109, score-0.21]
33 The ordering measure vs is the harmonic mean of v1 and v2: vs = 2/(1/v1 + 1/v2) . [sent-111, score-0.29]
34 For multiple references, we compute vs for each, and then choose the biggest one as the segment level ordering similarity. [sent-113, score-0.218]
35 We compute document level ordering with a weighted arithmetic mean: v=∑∑ls=1lsv=s1l×elnes(nRs()R) (16) where l is the number of segments of the document, and len(R) is the length of the reference. [sent-114, score-0.206]
36 (10) and the word ordering measure v are combined in a harmonic mean: PORT=1/Qmean2(N)+1/vα (17) Here α is a free parameter that is tuned on heldout data. [sent-118, score-0.238]
37 As it increases, the importance of the ordering measure v goes up. [sent-119, score-0.187]
38 1 Experiments PORT as an Evaluation Metric We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions. [sent-124, score-0.153]
39 933 We used Spearman’s rank correlation coefficient ρ to measure correlation of the metric with systemlevel human judgments of translation. [sent-129, score-0.269]
40 5W02MT PORT achieved the best segment level correlation with human judgment on both the “into English” and “out of English” tasks. [sent-150, score-0.127]
41 This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric, but rather, to optimize system tuning performance. [sent-152, score-0.287]
42 Most WMT submissions involve language pairs with similar word order, so the ordering factor v in PORT won’t play a big role. [sent-155, score-0.168]
43 Also, v depends on source-target word alignments for reference and test sets. [sent-156, score-0.12]
44 1 Experimental details The first set of experiments to study PORT as a tuning metric involved Chinese-to-English (zh-en); there were two data conditions. [sent-161, score-0.371]
45 The first is the small data condition where FBIS2 is used to train the translation and reordering models. [sent-162, score-0.175]
46 All allowed bilingual corpora except UN, Hong Kong Laws and Hong Kong Hansard were used to train the translation model and reordering models. [sent-169, score-0.123]
47 The dev set comprised mainly data from the NIST 2005 test set, and also some balanced-genre web-text from NIST. [sent-173, score-0.105]
48 There is one reference for all dev and test sets. [sent-181, score-0.151]
49 News test 2008 set is used as dev set; News test 2009, 2010, 2011 are used as test sets. [sent-188, score-0.151]
50 One reference is provided for all dev and test sets. [sent-189, score-0.151]
51 In all tuning experiments, both BLEU and PORT performed lower case matching of n-grams up to n = 4. [sent-198, score-0.264]
52 We also conducted experiments with tuning on a version of BLEU that incorporates SBP (Chiang et al. [sent-199, score-0.264]
53 2 Comparisons with automatic metrics First, let us see if BLEU-tuning and PORT-tuning yield systems with different translations for the same input. [sent-204, score-0.13]
54 The first row of Table 3 shows the percentage of identical sentence outputs for the two tuning types on test data. [sent-205, score-0.357]
55 , for the two zhen tasks, the two tuning types give systems whose outputs are about 25-30% different at the word level. [sent-208, score-0.36]
56 934 Table 4 shows translation quality for BLEU- and PORT-tuned systems, as assessed by automatic metrics. [sent-223, score-0.114]
57 Table 3 shows that fr-en outputs are very similar for both tuning types, so the fr-en results are perhaps less informative than the others. [sent-232, score-0.334]
58 Overall, PORT tuning has a striking advantage over BLEU tuning. [sent-233, score-0.264]
59 , 2011) showed that with MERT, if you want the best possible score for a system’s translations according to metric M, then you should tune with M. [sent-236, score-0.15]
60 This doesn’t appear to be true when PORT and BLEU tuning are compared in Table 4. [sent-237, score-0.264]
61 For the two Chinese-to-English tasks in the table, PORT tuning yields a better BLEU score than BLEU tuning, with significance at p < 0. [sent-238, score-0.264]
62 We are currently investigating why PORT tuning gives higher BLEU scores than BLEU tuning for ChineseEnglish and German-English. [sent-240, score-0.528]
63 3 Human Evaluation We conducted a human evaluation on outputs from BLEU- and PORT-tuned systems. [sent-244, score-0.121]
64 First, we eliminated examples where the reference had fewer than 10 words or more than 50 words, or where outputs of the BLEU-tuned and PORT-tuned systems were identical. [sent-251, score-0.116]
65 The evaluators (colleagues not involved with this paper) objected to comparing two bad translations, so we then selected for human evaluation only translations that had high sentence-level (1-TER) scores. [sent-252, score-0.141]
66 To be fair to both metrics, for each 935 condition, we took the union of examples whose BLEU-tuned output was in the top n% of BLEU outputs and those whose PORT-tuned output was in the top n% of PORT outputs (based on (1TER)). [sent-253, score-0.14]
67 PORT tuning seems to have a bigger advantage over BLEU tuning when the translation task is hard. [sent-272, score-0.6]
68 Of the Table 5 language pairs, the one where PORT tuning helps most has the lowest BLEU in Table 4 (German-English); the one where it helps least in Table 5 has the highest BLEU in Table 4 (French-English). [sent-273, score-0.264]
69 3 (below) for Qmean, a version of PORT without word ordering factor v, suggest v may be defined suboptimally for French-English. [sent-279, score-0.168]
70 4 Computation time A good tuning metric should run very fast; this is one of the advantages of BLEU. [sent-284, score-0.371]
71 Table 6 shows the time required to score the 100-best hypotheses for the dev set for each data condition during MERT for BLEU and PORT in similar implementations. [sent-285, score-0.134]
72 5 as long to compute as BLEU, which is reasonable for a tuning metric. [sent-289, score-0.29]
73 5 Robustness to word alignment errors PORT, unlike BLEU, depends on word alignments. [sent-293, score-0.118]
74 How does quality of word alignment between source and reference affect PORT tuning? [sent-294, score-0.138]
75 The table shows tuning with BLEU, PORT with human word alignment (PORT + HWA), and PORT with GIZA++ word alignment (PORT + GWA); the condition is zh-en small. [sent-303, score-0.528]
76 32 for automatic word alignment, PORT tuning works about as well with this alignment as for the gold standard CTB one. [sent-305, score-0.377]
77 Table 8 compares tuning with BLEU, PORT, and Qmean. [sent-320, score-0.264]
78 (18) for Chinese-English, making the influence of word ordering measure v in PORT too strong for the European pairs, which have similar word order. [sent-324, score-0.239]
79 What would results be on that language pair if we were to replace v in PORT with another ordering measure? [sent-326, score-0.142]
80 Table 9 gives a partial answer, with Spearman’s ρ and Kendall’s τ replacing v with ρ or τ in PORT for the zh-en small condition (CTB with human word alignment is the dev set). [sent-327, score-0.254]
81 A related question is how much word ordering improvement we obtained from tuning with PORT. [sent-337, score-0.432]
82 We evaluate Chinese-English word ordering with three measures: Spearman’s ρ, Kendall’s τ distance as applied to two permutations (see section 2. [sent-338, score-0.24]
83 Table 10 shows the effects of BLEU and PORT tuning on these three measures, for three test sets in the zh-en large condition. [sent-341, score-0.287]
84 A large value of ρ, τ, or v implies outputs have ordering similar to that in the reference. [sent-343, score-0.212]
85 From the table, we see that the PORT-tuned system yielded better word order than the BLEU-tuned system in all nine combinations of test sets and ordering measures. [sent-344, score-0.191]
86 The advantage of PORT tuning is 937 particularly noticeable on the most reliable test set: the hand-aligned CTB data. [sent-345, score-0.287]
87 What is the impact of the strict redundancy penalty on PORT? [sent-346, score-0.197]
88 Note that in Table 8, even though Qmean has no ordering measure, it outperforms BLEU. [sent-347, score-0.142]
89 Table 11 shows the BLEU brevity penalty (BP) and (number of matching 1& 4- grams)/(number of total 1- & 4- grams) for the translations. [sent-348, score-0.163]
90 We believe this is because of the strict redundancy penalty in Qmean. [sent-354, score-0.197]
91 As usual, French-English is the outlier: the two outputs here are typically so similar that BLEU and Qmean tuning yield very similar n-gram statistics. [sent-355, score-0.334]
92 4 Conclusions In this paper, we have proposed a new tuning metric for SMT systems. [sent-356, score-0.371]
93 PORT incorporates precision, recall, strict brevity penalty and strict redundancy penalty, plus a new word ordering measure v. [sent-357, score-0.549]
94 Most important, our results show that PORTtuned MT systems yield better translations than BLEU-tuned systems on several language pairs, according both to automatic metrics and human evaluations. [sent-359, score-0.158]
95 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. [sent-365, score-0.208]
96 MAXSIM: A maximum similarity metric for machine translation evaluation. [sent-420, score-0.179]
97 Decomposability of translation metrics for improved evaluation and efficient algorithms. [sent-438, score-0.161]
98 Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. [sent-444, score-0.153]
99 The METEOR metric for automatic evaluation of machine translation. [sent-507, score-0.151]
100 MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles. [sent-529, score-0.179]
wordName wordTfidf (topN-words)
[('port', 0.765), ('bleu', 0.283), ('tuning', 0.264), ('ordering', 0.142), ('qmean', 0.132), ('metric', 0.107), ('meteor', 0.105), ('penalty', 0.091), ('dev', 0.082), ('mt', 0.076), ('sbp', 0.074), ('brevity', 0.072), ('translation', 0.072), ('wmt', 0.071), ('outputs', 0.07), ('spearman', 0.069), ('ctb', 0.067), ('strict', 0.067), ('alignment', 0.066), ('metrics', 0.066), ('kendall', 0.062), ('kuhn', 0.054), ('condition', 0.052), ('pg', 0.051), ('reordering', 0.051), ('evaluators', 0.047), ('judgment', 0.046), ('reference', 0.046), ('osborne', 0.045), ('measure', 0.045), ('lrscore', 0.044), ('porttuned', 0.044), ('punishes', 0.044), ('bp', 0.044), ('visited', 0.044), ('translations', 0.043), ('len', 0.041), ('permutations', 0.039), ('redundancy', 0.039), ('giza', 0.039), ('srp', 0.038), ('arithmetic', 0.038), ('birch', 0.038), ('mert', 0.038), ('tesla', 0.035), ('snover', 0.035), ('paris', 0.034), ('distance', 0.033), ('conditions', 0.033), ('ter', 0.031), ('tunable', 0.031), ('files', 0.031), ('judgments', 0.031), ('amber', 0.029), ('parton', 0.029), ('correlation', 0.029), ('metricsmatr', 0.028), ('monz', 0.028), ('cer', 0.028), ('human', 0.028), ('precision', 0.027), ('comparisons', 0.027), ('lms', 0.027), ('quadratic', 0.027), ('vs', 0.026), ('word', 0.026), ('mean', 0.026), ('compute', 0.026), ('maxsim', 0.026), ('hyp', 0.026), ('hansard', 0.026), ('madnani', 0.026), ('target', 0.025), ('harmonic', 0.025), ('chen', 0.025), ('alignments', 0.025), ('permutation', 0.025), ('segment', 0.024), ('lavie', 0.024), ('recall', 0.024), ('grams', 0.023), ('denkowski', 0.023), ('boxing', 0.023), ('ref', 0.023), ('fifth', 0.023), ('test', 0.023), ('correlate', 0.023), ('evaluation', 0.023), ('chan', 0.023), ('chiang', 0.022), ('lm', 0.022), ('winter', 0.022), ('fleiss', 0.022), ('picked', 0.022), ('precisions', 0.022), ('automatic', 0.021), ('preferred', 0.021), ('koehn', 0.021), ('pado', 0.021), ('assessed', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
2 0.20717347 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
Author: Xiaodong He ; Li Deng
Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.
3 0.13221845 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
Author: Chang Liu ; Hwee Tou Ng
Abstract: In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks. –
4 0.12325128 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
5 0.11047187 140 acl-2012-Machine Translation without Words through Substring Alignment
Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara
Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.
8 0.10363259 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
9 0.1014908 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
10 0.097530402 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
11 0.097481772 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation
12 0.097438119 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
13 0.093504235 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
14 0.090622105 136 acl-2012-Learning to Translate with Multiple Objectives
15 0.089631729 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation
16 0.085764065 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
17 0.084066361 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation
18 0.083801635 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation
19 0.083383858 178 acl-2012-Sentence Simplification by Monolingual Machine Translation
20 0.078111425 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes
topicId topicWeight
[(0, -0.187), (1, -0.183), (2, 0.084), (3, 0.066), (4, 0.075), (5, -0.045), (6, -0.021), (7, -0.009), (8, -0.026), (9, 0.023), (10, -0.028), (11, -0.01), (12, -0.008), (13, -0.006), (14, -0.031), (15, 0.04), (16, 0.021), (17, 0.017), (18, -0.028), (19, 0.003), (20, 0.106), (21, -0.117), (22, 0.092), (23, -0.004), (24, 0.084), (25, 0.025), (26, 0.006), (27, 0.127), (28, 0.079), (29, 0.063), (30, -0.036), (31, -0.005), (32, 0.095), (33, -0.065), (34, 0.159), (35, -0.105), (36, -0.08), (37, -0.066), (38, 0.033), (39, -0.013), (40, 0.053), (41, 0.14), (42, 0.027), (43, 0.055), (44, 0.107), (45, 0.155), (46, -0.06), (47, 0.112), (48, -0.071), (49, 0.183)]
simIndex simValue paperId paperTitle
same-paper 1 0.94405872 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
2 0.81971741 136 acl-2012-Learning to Translate with Multiple Objectives
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
3 0.75408381 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
Author: Chang Liu ; Hwee Tou Ng
Abstract: In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks. –
4 0.58426845 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
Author: Xiaodong He ; Li Deng
Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.
5 0.55606169 163 acl-2012-Prediction of Learning Curves in Machine Translation
Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy
Abstract: Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios.
6 0.52375817 34 acl-2012-Automatically Learning Measures of Child Language Development
7 0.50543135 178 acl-2012-Sentence Simplification by Monolingual Machine Translation
8 0.48967579 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
10 0.47820163 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
11 0.47199482 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis
12 0.43136021 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
13 0.3983292 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
14 0.39143732 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
15 0.38412258 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
16 0.3827965 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
17 0.37913471 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
18 0.37853587 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation
19 0.3691929 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation
20 0.36631674 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes
topicId topicWeight
[(25, 0.021), (26, 0.035), (28, 0.078), (30, 0.034), (37, 0.034), (39, 0.034), (57, 0.024), (59, 0.014), (73, 0.19), (74, 0.088), (82, 0.021), (84, 0.017), (85, 0.075), (90, 0.11), (92, 0.047), (94, 0.043), (99, 0.045)]
simIndex simValue paperId paperTitle
1 0.76550448 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale
Author: Michael Wick ; Sameer Singh ; Andrew McCallum
Abstract: Sameer Singh Andrew McCallum University of Massachusetts University of Massachusetts 140 Governor’s Drive 140 Governor’s Drive Amherst, MA Amherst, MA s ameer@ cs .umas s .edu mccal lum@ c s .umas s .edu Hamming” who authored “The unreasonable effectiveness of mathematics.” Features of the mentions Methods that measure compatibility between mention pairs are currently the dominant ap- proach to coreference. However, they suffer from a number of drawbacks including difficulties scaling to large numbers of mentions and limited representational power. As these drawbacks become increasingly restrictive, the need to replace the pairwise approaches with a more expressive, highly scalable alternative is becoming urgent. In this paper we propose a novel discriminative hierarchical model that recursively partitions entities into trees of latent sub-entities. These trees succinctly summarize the mentions providing a highly compact, information-rich structure for reasoning about entities and coreference uncertainty at massive scales. We demonstrate that the hierarchical model is several orders of magnitude faster than pairwise, allowing us to perform coreference on six million author mentions in under four hours on a single CPU.
same-paper 2 0.75126767 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
3 0.65037787 136 acl-2012-Learning to Translate with Multiple Objectives
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
Author: Joern Wuebker ; Hermann Ney ; Richard Zens
Abstract: In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased efficiency of decoding by minimizing the number of language model computations and hypothesis expansions. Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2. Two look-ahead methods are shown to further increase translation speed by a factor of2 without changing the search space and a factor of 4 with the side-effect of some additional search errors. We compare our ap- proach with Moses and observe the same performance, but a substantially better trade-off between translation quality and speed. At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU, whereas our approach yields 20.0% with identical models.
5 0.63314313 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.
6 0.63298589 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
8 0.63183701 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
10 0.62927812 140 acl-2012-Machine Translation without Words through Substring Alignment
11 0.6279887 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
12 0.62631613 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation
13 0.62391144 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
14 0.62252629 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
15 0.62144023 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability
16 0.61777753 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
17 0.61678571 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing
18 0.61606818 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
19 0.61529636 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
20 0.61312628 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages