acl acl2011 acl2011-264 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexandra Birch ; Miles Osborne
Abstract: One of the major challenges facing statistical machine translation is how to model differences in word order between languages. Although a great deal of research has focussed on this problem, progress is hampered by the lack of reliable metrics. Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. We also show that the LRscore can successfully be used as the objective function when training translation model parameters. Training with the LRscore leads to output which is preferred by humans. Moreover, the translations incur no penalty in terms of BLEU scores.
Reference: text
sentIndex sentText sentNum sentScore
1 Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. [sent-9, score-0.392]
2 This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. [sent-10, score-0.278]
3 We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. [sent-11, score-0.423]
4 1 Introduction Research in machine translation has focused broadly on two main goals, improving word choice and improving word order in translation output. [sent-15, score-0.292]
5 Current machine translation metrics rely upon indirect methods for measuring the quality of the word order, and their ability to capture the quality of word order is poor (Birch et al. [sent-16, score-0.425]
6 This method does not consider the position of matching words, and only captures ordering differences if there is an exact match between the words in the translation and the reference. [sent-22, score-0.229]
7 They both search for an alignment be- tween the translation and the reference, and from this they calculate a penalty based on the number of differences in order between the two sentences. [sent-25, score-0.267]
8 Importantly, none of these metrics capture the distance by which words are out of order. [sent-27, score-0.258]
9 Also, they conflate reordering performance with the quality of the lexical items in the translation, making it difficult to tease apart the impact of changes. [sent-28, score-0.401]
10 This results in a simple, decomposable metric which makes it easy for researchers to pinpoint the effect of their changes. [sent-34, score-0.218]
11 In this paper we show that the LRscore is more consistent with human judgements ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. [sent-35, score-0.262]
12 We also apply the LRscore during Minimum Error Rate Training (MERT) to see whether information on reordering allows the translation model to produce better reorderings. [sent-38, score-0.463]
13 Section 2 describes the reordering and lexical metrics that are used and how they are combined. [sent-44, score-0.53]
14 Section 3 presents the experiments on consistency with human judgements and describes how to train the language independent parameter of the LRscore. [sent-45, score-0.333]
15 2 The LRscore In this section we present the LRscore which measures reordering using permutation distance metrics. [sent-48, score-0.559]
16 These reordering metrics have been demonstrated to correlate strongly with human judgements of word order quality (Birch et al. [sent-49, score-0.802]
17 The LRscore combines the reordering metrics with lexical metrics to provide a complete metric for evaluating machine translations. [sent-51, score-0.928]
18 1 Reordering metrics The relative ordering of words in the source and target sentences is encoded in alignments. [sent-53, score-0.307]
19 We can interpret alignments as permutations which allows us to apply research into metrics for ordered encodings to measuring and evaluating reorderings. [sent-54, score-0.32]
20 We use distance metrics over permutations to evaluate reordering performance. [sent-55, score-0.665]
21 In Figure 1 (a) represents the identity permutation, which would result from a monotone alignment, (b) represents a small reordering consisting of two words whose orders are inverted, and (c) represents a large reordering where the two halves of the sentence are inverted in the target. [sent-58, score-0.844]
22 Three permutations: (a) monotone (b) with a small reordering and (b) with a large reordering. [sent-62, score-0.42]
23 We choose permutation distance metrics which are sensitive to the number of words that are out of order, as humans are assumed to be sensitive to the number of words that are out of order in a sentence. [sent-77, score-0.463]
24 The metrics are normalised so that 0 means that the permutations are completely inverted, and 1means that they are identical. [sent-79, score-0.232]
25 All metrics are adjusted so that 100 is the best score and 0 the worst. [sent-93, score-0.244]
26 The Hamming distance is the simplest permutation distance metric and is useful as a baseline. [sent-95, score-0.449]
27 2 Kendall’s Tau Distance Kendall’s tau distance is the minimum number of transpositions of two adjacent symbols necessary to transform one permutation into another (Kendall, 1938). [sent-99, score-0.368]
28 10 iofth πe(riw)i σ(j) = (n22− n) Kendalls tau seems particularly appropriate for measuring word order differences as the relative ordering words is taken into account. [sent-102, score-0.28]
29 However, most human and machine ordering differences are much closer to monotone than to inverted. [sent-103, score-0.232]
30 This adjusted dk is also more correlated with human judgements of reordering quality (Birch et al. [sent-106, score-0.729]
31 We use the example in Figure 1 to highlight the problem with current MT metrics, and to demonstrate how the permutation distance metrics are calculated. [sent-108, score-0.36]
32 The metrics are calculated by comparing the permutation string with the monotone permutation. [sent-110, score-0.374]
33 BLEU and METEOR fail to recognise that (b) represents a small reordering and (c) a large reordering and they 1029 assign a lower score to (b). [sent-112, score-0.738]
34 Both the Hamming distance dh and the Kendall’s tau distance dk correctly assign (c) a worse score than (b). [sent-118, score-0.461]
35 Note that for (c), the Hamming distance was not able to reward the permutation for the correct relative ordering of words within the two large blocks and gave (c) a score of 0, whereas Kendall’s tau takes relative ordering into account. [sent-119, score-0.595]
36 The reordering component is the average difference of absolute and relative word positions which has no clear meaning. [sent-124, score-0.384]
37 This score is not intuitive or easily decomposable and it is more similar to METEOR, with synonym and stem functionality mixed with a reordering penalty, than to our metric. [sent-125, score-0.436]
38 2 Combined Metric The LRscore consists of a reordering distance metric which is linearly interpolated with a lexical score to form a complete machine translation evaluation metric. [sent-127, score-0.804]
39 The metric is decomposable because the individual lexical and reordering components can be looked at individually. [sent-128, score-0.583]
40 The following formula describes how to calculate the LRscore: LRscore = αR + (1 − α)L (1) The metric contains only one parameter, α, which balances the contribution of the reordering metric, R, and the lexical metric, L. [sent-129, score-0.574]
41 R is the average permutation distance metric adjusted by the brevity penalty and it is calculated as follows: R =Ps∈S|Sd|sBPs (2) Where S is a set of test sentences, ds is the reordering distance for a sentence and BP is the brevity penalty. [sent-131, score-1.019]
42 The brevity penalty within the reordering component is necessary as the distance-based metric would provide the same score for a one word translation as it would for a longer monotone translation. [sent-136, score-0.918]
43 The 4-gram BLEU score includes some measure of the local reordering success in the precision of the longer n-grams. [sent-140, score-0.402]
44 BLEU is an important baseline, and improving on it by including more reordering information is an interesting result. [sent-141, score-0.34]
45 The lexical component of the system can be any meaningful metric for a particular target language. [sent-142, score-0.232]
46 3 Consistency with Human Judgements Automatic metrics must be validated by comparing their scores with human judgements. [sent-145, score-0.232]
47 We train the metric parameter to optimise consistency with human preference judgements across different language pairs and then we show that the LRscore is 1030 more consistent with humans than other commonly used metrics. [sent-146, score-0.638]
48 In total there were 52,265 pairwise rank judgements collected. [sent-152, score-0.214]
49 Our reordering metric relies upon word alignments that are generated between the source and the reference sentences, and the source and the translated sentences. [sent-153, score-0.625]
50 In an ideal scenario, the translation system outputs the alignments and the reference set can be selected to have gold standard human alignments. [sent-154, score-0.288]
51 However, the data that we use to evaluate metrics does not have any gold standard alignments and we must train automatic alignment models to generate them. [sent-155, score-0.267]
52 The metric scores are calculated for the test set from the 2009 workshop on machine translation. [sent-162, score-0.235]
53 METEOR has 3 parameters which have been trained for human judgements of rank (Lavie and Agarwal, 2008). [sent-168, score-0.261]
54 We ascertained how consistent the automatic metrics were with the human judgements by calculating consistency in the following manner. [sent-182, score-0.489]
55 We take each pairwise comparison of translation output for single sentences by a particular judge, and we recorded whether or not the metrics were consistent with the human rank. [sent-183, score-0.423]
56 we counted cases where both the metric and the human judge agreed that one system is better than another. [sent-186, score-0.227]
57 The average Kendall’s tau reordering distance between the test and reference sentences. [sent-194, score-0.618]
58 Using multiple language pairs, we train the parameter according to the amount of reordering seen in each test set. [sent-197, score-0.401]
59 They can simply calculate the amount of reordering in the test set and adjust the parameter accordingly. [sent-199, score-0.424]
60 The amount of reordering is calculated as the Kendall’s tau distance between the source and the reference sentences as compared to dummy monotone sentences. [sent-200, score-0.792]
61 The amount of reordering for the test sentences is reported in Table 2. [sent-201, score-0.39]
62 GermanEnglish shows more reordering than other language pairs as it has a lower dk score of 73. [sent-202, score-0.472]
63 The language independent parameter (θ) is adjusted by applying the reordering amount (dk) as an exponent. [sent-204, score-0.441]
64 α represents the percentage contribution of the reordering component in 1031 the LRscore: α = θdk (4) The language independent parameter θ is trained once, over multiple language pairs. [sent-208, score-0.442]
65 2 Results Table 3 reports the optimal consistency of the LRscore and baseline metrics with human judgements for each language pair. [sent-213, score-0.488]
66 The LRscore variations are named as follows: LR refers to the LRscore, “H” refers to the Hamming distance and “K” to Kendall’s tau distance. [sent-214, score-0.228]
67 This is an important result which shows that combining lexical and reordering information makes for a stronger metric than the baseline metrics which do not have a strong reordering component. [sent-217, score-1.031]
68 Here LR-KB4 is the best metric, which shows that metrics which are sensitive to the distance words are out of order are more appropriate for situations with a reasonable amount of reordering. [sent-220, score-0.329]
69 MERT minimises translation errors according to some automatic evaluation metric while searching for the best parameter settings over the N-best output. [sent-223, score-0.317]
70 the metric rewards, but will be blind to aspects of translation quality that are not directly captured by the metric. [sent-226, score-0.32]
71 We apply the LRscore in order to improve the reordering performance of a phrase-based translation model. [sent-227, score-0.486]
72 1 Experimental Design We hypothesise that the LRscore is a good metric for training translation models. [sent-229, score-0.301]
73 We used the Moses translation toolkit, including a lexicalised reordering model. [sent-243, score-0.463]
74 The parameter setting representing the % impact of the reordering component for the different versions of the LRscore metric. [sent-250, score-0.4]
75 The reordering metrics require alignments which were created using the Berkeley word alignment package version 1. [sent-255, score-0.575]
76 We first extracted the LRscore Kendall’s tau distance from the monotone for the Chinese-English test set and this value was 66. [sent-259, score-0.308]
77 This is far more reordering than the other language pairs shown in Table 2. [sent-261, score-0.358]
78 We then calculated the optimal parameter setting, using the reordering amount as a power exponent. [sent-262, score-0.428]
79 The optimal amount of reordering for LR-HB4 is low, but the results show it still makes an important contribution. [sent-264, score-0.368]
80 2 Human Evaluation Setup Human judgements of translation quality are necessary to determine whether humans prefer sentences from models trained with the BLEU score or with the LRscore. [sent-267, score-0.519]
81 Workers who got less than 60% of these gold questions correct were disqualified and their judgements discarded. [sent-281, score-0.227]
82 Three judgements were collected from the trusted workers for each of the 120 test sentences. [sent-289, score-0.286]
83 1 Automatic Evaluation of MERT In this experiment we demonstrate that the reordering metrics can be used as learning criterion in minimum error rate training to improve parameter estimation for machine translation. [sent-293, score-0.597]
84 isolation, and also as part of the LRscore together with the Hamming distance and Kendall’s tau distance. [sent-306, score-0.228]
85 The first thing we note in Table 5 is that we would expect the highest scores when training with the same metric as that used for evaluation as MERT maximises the objective function on the development data set. [sent-308, score-0.241]
86 The reordering component is more discerning than the BLEU score. [sent-310, score-0.367]
87 This might make the reordering metric easier to optimise, leading to the joint best scores at test time. [sent-312, score-0.525]
88 Although it is interesting to look at the model weights, any final conclusion on the impact of the metrics on training must depend on human evaluation of translation quality. [sent-320, score-0.348]
89 2 Human Evaluation We collect human preference judgements for output from systems trained using the BLEU score and the LRscore in order to determine whether training with the LRscore leads to genuine improvements in translation quality. [sent-327, score-0.529]
90 In order to judge how reliable our judgements are we calculate the inter-annotator agreement. [sent-338, score-0.264]
91 We expect that more substantial gains can be made in the future by using models which have more powerful reordering capabilities. [sent-346, score-0.34]
92 A richer set of reordering features, and a model capable of longer distance reordering would better leverage metrics which reward good word orderings. [sent-347, score-0.979]
93 5 Conclusion We introduced the LRscore which combines a lexical and a reordering metric. [sent-348, score-0.385]
94 The main motivation for this metric is the fact that it measures the reordering quality of MT output by using permutation distance metrics. [sent-349, score-0.783]
95 It is a simple, decomposable metric which interpolates the reordering component with a lexical component, the BLEU score. [sent-350, score-0.633]
96 This paper demonstrates that the LRscore metric is more con- sistent with human preference judgements of machine translation quality than other machine translation metrics. [sent-351, score-0.766]
97 Ultimately, the availability of a metric which reliably measures reordering performance should accelerate progress towards developing more powerful reordering models. [sent-354, score-0.865]
98 Meteor, m-BLEU and m-TER: Evaluation metrics for highcorrelation with human rankings of machine translation output. [sent-415, score-0.354]
99 ORANGE: a method for evaluating automatic evaluation metrics for machine translation. [sent-424, score-0.217]
100 Measuring machine translation quality as semantic equivalence: A metric based on entailment features. [sent-434, score-0.343]
wordName wordTfidf (topN-words)
[('lrscore', 0.726), ('reordering', 0.34), ('judgements', 0.195), ('bleu', 0.182), ('metrics', 0.165), ('metric', 0.161), ('tau', 0.135), ('translation', 0.123), ('meteor', 0.109), ('kendall', 0.107), ('permutation', 0.102), ('hamming', 0.098), ('distance', 0.093), ('monotone', 0.08), ('dk', 0.075), ('workers', 0.068), ('mert', 0.067), ('ordering', 0.067), ('permutations', 0.067), ('consistency', 0.062), ('brevity', 0.057), ('decomposable', 0.057), ('reference', 0.05), ('penalty', 0.049), ('ter', 0.046), ('human', 0.043), ('adjusted', 0.04), ('humans', 0.04), ('alignments', 0.04), ('score', 0.039), ('objective', 0.039), ('preference', 0.039), ('quality', 0.036), ('preferred', 0.036), ('mechanical', 0.036), ('birch', 0.033), ('parameter', 0.033), ('gold', 0.032), ('hillclimbing', 0.032), ('kittur', 0.032), ('preferring', 0.032), ('unmatched', 0.032), ('aligned', 0.031), ('cer', 0.031), ('alignment', 0.03), ('mt', 0.03), ('evaluating', 0.029), ('null', 0.028), ('amount', 0.028), ('component', 0.027), ('output', 0.027), ('inverted', 0.027), ('calculated', 0.027), ('dh', 0.026), ('translations', 0.025), ('lexical', 0.025), ('judgement', 0.025), ('balances', 0.025), ('consistent', 0.024), ('measures', 0.024), ('scores', 0.024), ('reports', 0.023), ('worker', 0.023), ('interpolates', 0.023), ('subtracting', 0.023), ('trusted', 0.023), ('optimise', 0.023), ('calculate', 0.023), ('turk', 0.023), ('trained', 0.023), ('machine', 0.023), ('longer', 0.023), ('order', 0.023), ('lavie', 0.023), ('judge', 0.023), ('block', 0.022), ('sentences', 0.022), ('blocks', 0.022), ('prefer', 0.022), ('miles', 0.02), ('combines', 0.02), ('matching', 0.02), ('sensitive', 0.02), ('bp', 0.02), ('minimum', 0.019), ('moses', 0.019), ('measuring', 0.019), ('represents', 0.019), ('banerjee', 0.019), ('pairwise', 0.019), ('target', 0.019), ('differences', 0.019), ('necessary', 0.019), ('bulletin', 0.019), ('reward', 0.018), ('pairs', 0.018), ('whereas', 0.018), ('relative', 0.017), ('training', 0.017), ('source', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 264 acl-2011-Reordering Metrics for MT
Author: Alexandra Birch ; Miles Osborne
Abstract: One of the major challenges facing statistical machine translation is how to model differences in word order between languages. Although a great deal of research has focussed on this problem, progress is hampered by the lack of reliable metrics. Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. We also show that the LRscore can successfully be used as the objective function when training translation model parameters. Training with the LRscore leads to output which is preferred by humans. Moreover, the translations incur no penalty in terms of BLEU scores.
2 0.23815753 266 acl-2011-Reordering with Source Language Collocations
Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Ting Liu ; Sheng Li
Abstract: This paper proposes a novel reordering model for statistical machine translation (SMT) by means of modeling the translation orders of the source language collocations. The model is learned from a word-aligned bilingual corpus where the collocated words in source sentences are automatically detected. During decoding, the model is employed to softly constrain the translation orders of the source language collocations, so as to constrain the translation orders of those source phrases containing these collocated words. The experimental results show that the proposed method significantly improves the translation quality, achieving the absolute improvements of 1.1~1.4 BLEU score over the baseline methods. 1
3 0.17171964 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering
Author: Nadir Durrani ; Helmut Schmid ; Alexander Fraser
Abstract: We present a novel machine translation model which models translation by a linear sequence of operations. In contrast to the “N-gram” model, this sequence includes not only translation but also reordering operations. Key ideas of our model are (i) a new reordering approach which better restricts the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings in a unified way, and (ii) a joint sequence model for the translation and reordering probabilities which is more flexible than standard phrase-based MT. We observe statistically significant improvements in BLEU over Moses for German-to-English and Spanish-to-English tasks, and comparable results for a French-to-English task.
4 0.16603106 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
Author: Rafael E. Banchs ; Haizhou Li
Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1
5 0.15769219 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
Author: Markos Mylonakis ; Khalil Sima'an
Abstract: While it is generally accepted that many translation phenomena are correlated with linguistic structures, employing linguistic syntax for translation has proven a highly non-trivial task. The key assumption behind many approaches is that translation is guided by the source and/or target language parse, employing rules extracted from the parse tree or performing tree transformations. These approaches enforce strict constraints and might overlook important translation phenomena that cross linguistic constituents. We propose a novel flexible modelling approach to introduce linguistic information of varying granularity from the source side. Our method induces joint probability synchronous grammars and estimates their parameters, by select- ing and weighing together linguistically motivated rules according to an objective function directly targeting generalisation over future data. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.
6 0.15697135 263 acl-2011-Reordering Constraint Based on Document-Level Context
8 0.1445553 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
9 0.13963225 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
10 0.13113862 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
11 0.12591407 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices
12 0.12248671 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
13 0.10751865 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
14 0.10690181 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
15 0.10270482 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
16 0.10249069 69 acl-2011-Clause Restructuring For SMT Not Absolutely Helpful
17 0.09804792 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
18 0.093490772 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
19 0.088302016 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
20 0.079741009 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation
topicId topicWeight
[(0, 0.187), (1, -0.154), (2, 0.109), (3, 0.146), (4, 0.035), (5, 0.045), (6, 0.019), (7, -0.0), (8, 0.049), (9, -0.001), (10, 0.015), (11, -0.103), (12, -0.019), (13, -0.178), (14, -0.091), (15, 0.046), (16, -0.057), (17, 0.035), (18, -0.103), (19, -0.019), (20, -0.017), (21, 0.056), (22, -0.039), (23, -0.087), (24, -0.062), (25, 0.05), (26, 0.058), (27, -0.014), (28, 0.043), (29, 0.093), (30, -0.01), (31, -0.019), (32, 0.117), (33, 0.066), (34, 0.05), (35, -0.08), (36, -0.025), (37, 0.053), (38, 0.012), (39, -0.001), (40, 0.161), (41, -0.117), (42, -0.039), (43, 0.074), (44, 0.027), (45, -0.006), (46, -0.033), (47, 0.093), (48, 0.042), (49, -0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.93461639 264 acl-2011-Reordering Metrics for MT
Author: Alexandra Birch ; Miles Osborne
Abstract: One of the major challenges facing statistical machine translation is how to model differences in word order between languages. Although a great deal of research has focussed on this problem, progress is hampered by the lack of reliable metrics. Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. We also show that the LRscore can successfully be used as the objective function when training translation model parameters. Training with the LRscore leads to output which is preferred by humans. Moreover, the translations incur no penalty in terms of BLEU scores.
2 0.82801741 266 acl-2011-Reordering with Source Language Collocations
Author: Zhanyi Liu ; Haifeng Wang ; Hua Wu ; Ting Liu ; Sheng Li
Abstract: This paper proposes a novel reordering model for statistical machine translation (SMT) by means of modeling the translation orders of the source language collocations. The model is learned from a word-aligned bilingual corpus where the collocated words in source sentences are automatically detected. During decoding, the model is employed to softly constrain the translation orders of the source language collocations, so as to constrain the translation orders of those source phrases containing these collocated words. The experimental results show that the proposed method significantly improves the translation quality, achieving the absolute improvements of 1.1~1.4 BLEU score over the baseline methods. 1
3 0.77626216 263 acl-2011-Reordering Constraint Based on Document-Level Context
Author: Takashi Onishi ; Masao Utiyama ; Eiichiro Sumita
Abstract: One problem with phrase-based statistical machine translation is the problem of longdistance reordering when translating between languages with different word orders, such as Japanese-English. In this paper, we propose a method of imposing reordering constraints using document-level context. As the documentlevel context, we use noun phrases which significantly occur in context documents containing source sentences. Given a source sentence, zones which cover the noun phrases are used as reordering constraints. Then, in decoding, reorderings which violate the zones are restricted. Experiment results for patent translation tasks show a significant improvement of 1.20% BLEU points in JapaneseEnglish translation and 1.41% BLEU points in English-Japanese translation.
4 0.73457789 69 acl-2011-Clause Restructuring For SMT Not Absolutely Helpful
Author: Susan Howlett ; Mark Dras
Abstract: There are a number of systems that use a syntax-based reordering step prior to phrasebased statistical MT. An early work proposing this idea showed improved translation performance, but subsequent work has had mixed results. Speculations as to cause have suggested the parser, the data, or other factors. We systematically investigate possible factors to give an initial answer to the question: Under what conditions does this use of syntax help PSMT?
5 0.70851678 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering
Author: Nadir Durrani ; Helmut Schmid ; Alexander Fraser
Abstract: We present a novel machine translation model which models translation by a linear sequence of operations. In contrast to the “N-gram” model, this sequence includes not only translation but also reordering operations. Key ideas of our model are (i) a new reordering approach which better restricts the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings in a unified way, and (ii) a joint sequence model for the translation and reordering probabilities which is more flexible than standard phrase-based MT. We observe statistically significant improvements in BLEU over Moses for German-to-English and Spanish-to-English tasks, and comparable results for a French-to-English task.
6 0.67940378 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
7 0.65392417 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
8 0.64353192 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
9 0.60142595 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
10 0.58603907 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices
11 0.57607132 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
12 0.55798376 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
14 0.53775746 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
15 0.51271355 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
16 0.5124321 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
17 0.51088566 220 acl-2011-Minimum Bayes-risk System Combination
18 0.5007152 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
19 0.49720445 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
20 0.48095816 151 acl-2011-Hindi to Punjabi Machine Translation System
topicId topicWeight
[(5, 0.028), (17, 0.054), (26, 0.018), (31, 0.015), (37, 0.073), (39, 0.044), (41, 0.047), (55, 0.036), (59, 0.04), (72, 0.048), (88, 0.22), (91, 0.026), (96, 0.252)]
simIndex simValue paperId paperTitle
1 0.95092803 194 acl-2011-Language Use: What can it tell us?
Author: Marjorie Freedman ; Alex Baron ; Vasin Punyakanok ; Ralph Weischedel
Abstract: For 20 years, information extraction has focused on facts expressed in text. In contrast, this paper is a snapshot of research in progress on inferring properties and relationships among participants in dialogs, even though these properties/relationships need not be expressed as facts. For instance, can a machine detect that someone is attempting to persuade another to action or to change beliefs or is asserting their credibility? We report results on both English and Arabic discussion forums. 1
same-paper 2 0.86974823 264 acl-2011-Reordering Metrics for MT
Author: Alexandra Birch ; Miles Osborne
Abstract: One of the major challenges facing statistical machine translation is how to model differences in word order between languages. Although a great deal of research has focussed on this problem, progress is hampered by the lack of reliable metrics. Most current metrics are based on matching lexical items in the translation and the reference, and their ability to measure the quality of word order has not been demonstrated. This paper presents a novel metric, the LRscore, which explicitly measures the quality of word order by using permutation distance metrics. We show that the metric is more consistent with human judgements than other metrics, including the BLEU score. We also show that the LRscore can successfully be used as the objective function when training translation model parameters. Training with the LRscore leads to output which is preferred by humans. Moreover, the translations incur no penalty in terms of BLEU scores.
3 0.82692409 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
Author: Ivan Titov
Abstract: We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain. One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains. Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain. The danger is that these predictive clusters will consist of features specific to the source domain only and, consequently, a classifier relying on such clusters would perform badly on the target domain. We introduce a constraint enforcing that marginal distributions of each cluster (i.e., each latent variable) do not vary significantly across domains. We show that this constraint is effec- tive on the sentiment classification task (Pang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.
4 0.8079983 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
Author: Chung-chi Huang ; Mei-hua Chen ; Shih-ting Huang ; Jason S. Chang
Abstract: We introduce a new method for learning to detect grammatical errors in learner’s writing and provide suggestions. The method involves parsing a reference corpus and inferring grammar patterns in the form of a sequence of content words, function words, and parts-of-speech (e.g., “play ~ role in Ving” and “look forward to Ving”). At runtime, the given passage submitted by the learner is matched using an extended Levenshtein algorithm against the set of pattern rules in order to detect errors and provide suggestions. We present a prototype implementation of the proposed method, EdIt, that can handle a broad range of errors. Promising results are illustrated with three common types of errors in nonnative writing. 1
5 0.78438711 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
Author: Lane Schwartz ; Chris Callison-Burch ; William Schuler ; Stephen Wu
Abstract: This paper describes a novel technique for incorporating syntactic knowledge into phrasebased machine translation through incremental syntactic parsing. Bottom-up and topdown parsers typically require a completed string as input. This requirement makes it difficult to incorporate them into phrase-based translation, which generates partial hypothesized translations from left-to-right. Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporat- ing syntax into phrase-based translation. We give a formal definition of one such lineartime syntactic language model, detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perplexity.
6 0.78253162 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
7 0.78129637 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation
8 0.78049695 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment
9 0.78044021 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
10 0.78039682 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
11 0.78031337 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
12 0.78019339 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
13 0.77996969 76 acl-2011-Comparative News Summarization Using Linear Programming
14 0.7794919 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
15 0.77929491 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers
16 0.77888632 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
17 0.77838427 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages
18 0.77805322 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
19 0.77767366 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
20 0.77746189 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents