acl acl2012 acl2012-136 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
Reference: text
sentIndex sentText sentNum sentScore
1 j p ab Abstract We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. [sent-7, score-0.202]
2 It is simple to implement on top of existing single-objective optimization methods (e. [sent-12, score-0.142]
3 We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization. [sent-15, score-0.287]
4 1 Introduction Weight optimization is an important step in build- ing machine translation (MT) systems. [sent-16, score-0.199]
5 Discriminative optimization methods such as MERT (Och, 2003), MIRA (Crammer et al. [sent-17, score-0.142]
6 These methods are effective because they tune the system to maximize an automatic evaluation metric such as BLEU, which serve as surrogate objective for translation quality. [sent-19, score-0.184]
7 However, we know that a single metric such as BLEU is not enough. [sent-20, score-0.099]
8 Ideally, we want to tune towards an automatic metric that has perfect correlation with human judgments of translation quality. [sent-21, score-0.208]
9 *∗Now at Nara Institute of Science & Technology (NAIST) 1 While many alternatives have been proposed, such a perfect evaluation metric remains elusive. [sent-22, score-0.099]
10 As a result, many MT evaluation campaigns now report multiple evaluation metrics (Callison-Burch et al. [sent-23, score-0.121]
11 Different evaluation metrics focus on different aspects of translation quality. [sent-25, score-0.158]
12 , 2006) allows arbitrary chunk movements, while permutation metrics like RIBES (Isozaki et al. [sent-29, score-0.101]
13 Arguably, all these metrics correspond to our intuitions on what is a good translation. [sent-35, score-0.101]
14 The current approach of optimizing MT towards a single metric runs the risk of sacrificing other metrics. [sent-36, score-0.165]
15 Our goal is to propose a multi-objective optimization method that avoids “overfitting to a single metric”. [sent-39, score-0.142]
16 In general, we cannot expect to improve multiple metrics jointly if there are some inherent tradeoffs. [sent-41, score-0.101]
17 Ac s2so0c1i2at Aiosnso fcoira Ctio nm fpourt Catoimonpaulta Lti nognuails Lticinsg,u piasgtiecs 1–10, hypothesis is pareto-optimal if there exist no other hypothesis better in all metrics. [sent-45, score-0.132]
18 We show that PMO outperforms the alternatWivee (single-objective optimization tohfe linearlycombined metrics) in multi-objective space, and especially obtains stronger results for metrics that may be difficult to tune individually. [sent-48, score-0.271]
19 1 Definitions and Concepts The idea of Pareto optimality comes originally from economics (Pareto, 1906), where the goal is to characterize situations when a change in allocation of goods does not make anybody worse off. [sent-53, score-0.138]
20 Here, we will explain it in terms of MT: Let h ∈ L be a hypothesis from an N-best list L. [sent-54, score-0.087]
21 Without loss of generality, we assume metric scores are bounded between 0 and 1, with 1 being perfect. [sent-57, score-0.099]
22 l For two hypotheses h1, h2, we write M(h1) > M(h2) if h1 is better than h2 in all metrics, and M(h1) ≥ M(h2) if h1 is better than or equal to h2 in a≥ll metrics. [sent-64, score-0.087]
23 Ten hypotheses are plotted by their scores in two metrics. [sent-68, score-0.087]
24 The line shows the convex hull, which attains only a subset of pareto-optimal points. [sent-70, score-0.105]
25 The triangle (4) is a point that is weakly pareto-optimal bTuhte en torita pareto-optimal. [sent-71, score-0.091]
26 Pareto Optimal: A hypothesis h∗ ∈ L is pareto-optimal iff there does not exist anothe∈r hypothesis h ∈ L such that M(h) M(h∗). [sent-74, score-0.132]
27 In Figure 1, the hypotheses indicated by circle (o) are pareto-optimal, while those with plus (+) are not. [sent-75, score-0.109]
28 To visualize this, take for instance the paretooptimal point (0. [sent-76, score-0.108]
29 Weakly Pareto Optimal: A hypothesis h∗ ∈ L is weakly pareto-optimal iff there is no other hypothesis he ∈ Ly psuarceht oth-oatp M(h) > M(h∗). [sent-97, score-0.195]
30 A hypothesis is weakly pareto-optimal if there is no other hypothesis that improves all the metrics; a hypothesis is paretooptimal if there is no other hypothesis that improves at least one metric without detriment to other metrics. [sent-99, score-0.484]
31 8) is weakly paretooptimal but not pareto-optimal, because of the competing point (0. [sent-102, score-0.173]
32 The Pareto Frontier has two desirable properties from the multi-objective optimization perspective: 1. [sent-108, score-0.142]
33 optimizing towards points on the Frontier and away from those that are not, and giving no preference to different pareto-optimal hypotheses. [sent-114, score-0.11]
34 Multi-objective problems can be formulated as: argwmax [M1(h);M2(h); ; Mk(h)] (1) where h = Decode (w, f) Here, the MT system’s Decode function, parameterized by weight vector w, takes in a foreign sentence f and returns a translated hypothesis h. [sent-118, score-0.117]
35 The argmax operates in vector space and our goal is to find w leading to hypotheses on the Pareto Frontier. [sent-119, score-0.108]
36 ; Mk (h)] with a linear combination: XK argwmax kX=1pkMk(h) (2) where h = Decode (w, f) Here, pk are positive real numbers indicating the relative importanceP of each metric (without loss of gen- erality, assume Pk pk = 1). [sent-124, score-0.344]
37 2 attains only pareto-optimal points that are on the convex hull. [sent-140, score-0.127]
38 1 This may have ramifications for issues like metric tunability and local optima. [sent-155, score-0.186]
39 Pareto Optimality and multi-objective optimization is a deep field with active inquiry in engineering, operations research, economics, etc. [sent-159, score-0.142]
40 Given a dominant point, it is easy to filter out many points that are dominated by it. [sent-167, score-0.109]
41 After successive rounds, any remaining points that are not fil- 1We note that scalarization by exponentiated-combination Pk pkMk (h)q, for a suitable q > 0, does satisfy necessary cPonditions for pareto optimality. [sent-168, score-0.836]
42 In line 3, we take a point h∗ and check if it is dominating or dominated in the for- loop (lines 4-8). [sent-176, score-0.187]
43 The second loop (lines 9-11) further filters the list for points that are dominated by h∗ but iterated before h∗ in the first for-loop. [sent-178, score-0.154]
44 The outer while-loop stops exactly after P iterations, where P is the actual number of paretooptimal points in L. [sent-179, score-0.125]
45 2 PMO-PRO Algorithm We are now ready to present an algorithm for multiobjective optimization. [sent-187, score-0.113]
46 As we will see, it can be seen as a generalization of the pairwise ranking optimization (PRO) of (Hopkins and May, 2011), so we call it PMO-PRO. [sent-188, score-0.142]
47 All dominated points can be filtered by one-pass by comparing with the most-recent dominating point. [sent-193, score-0.136]
48 The main difference is that rather than trying to maximize a single metric, we maximize the number of pareto points, in order to expand the Pareto Frontier We will explain PMO-PRO in terms of the pseudo-code shown in Algorithm 2. [sent-195, score-0.79]
49 I n≡ li {nhe} 6, we etvhaelu cuatrere enatc whe hypothesis h with respect to the K metrics, giving a set of Kdimensional vectors {M(h)}. [sent-197, score-0.09]
50 In particular, first we call F indP aret oF ront ie r (Algorithm 1), which returns a set of pareto hypotheses; pareto-optimal hypotheses will get label 1while non-optimal hypotheses will get label 0. [sent-199, score-0.972]
51 Wtime iwzeildl follow PRO in using a pairwise classifier in line 10, which finds w∗ that separates hypotheses with labels 1 vs. [sent-201, score-0.132]
52 In line 13 we evaluate each weight w on K metrics across the entire corpus and call F indP aret oFront ier in line 14. [sent-206, score-0.242]
53 3 Discussion Variants: In practice we find that a slight modification of line 8 in Algorithm 2 leads to more sta3Note this is the same FindParetoFrontier algorithm as used in line 7. [sent-210, score-0.116]
54 Both operate on sets of points in K-dimensional space, induced from either weights {w} or hypotheses {h}. [sent-211, score-0.154]
55 Algorithm 2 Proposed PMO-PRO algorithm Input: Devset, max number of iterations I Output: A set of (pareto-optimal) weight vectors 1: Initialize w. [sent-212, score-0.104]
56 2 can be easily applied to other MT optimization techniques. [sent-223, score-0.142]
57 For example, by replacing the optimization subroutine (line 10, Algorithm 2) with a Powell search (Och, 2003), one can get PMO-MERT4. [sent-224, score-0.171]
58 Virtually all MT optimization algorithms have a place where metric scores feedback into the optimization procedure; the idea of PMO is to replace these raw scores with labels derived from Pareto optimality. [sent-227, score-0.383]
59 We use sentenceBLEU for optimization but corpus-BLEU for evaluation here. [sent-231, score-0.142]
60 As metrics we use BLEU and RIBES (which demonstrated good human correlation in this language pair (Goto et al. [sent-233, score-0.125]
61 For each method, this generates 5x20=100 results, and we plot the Pareto Frontier of these points in a 2-dimensional metric space (e. [sent-264, score-0.187]
62 We report devset results here; testset trends are similar but not included due to space constraints. [sent-268, score-0.174]
63 edu / ˜snove r /te rcom 7An aside: For comparing optimization methods, we believe devset comparison is preferable to testset since data mismatch may confound results. [sent-275, score-0.295]
64 If one worries about generalization, we advocate to re-decode the devset with final weights and evaluate its 1-best output (which is done here). [sent-276, score-0.131]
65 This is preferable to simply reporting the achieved scores on devset N-best (as done in some open-source scripts) since the learned weight may pick out good hypotheses in the N-best but perform poorly when re-decoding the same devset. [sent-277, score-0.24]
66 The re-decode devset approach avoids being overly optimistic while accurately measuring optimization performance. [sent-278, score-0.273]
67 k6evkset#81F4eatB ML eE triU cs, RNITBE RS Table 1: Task characteristics: #sentences in Train/Dev, # of features, and metrics used. [sent-281, score-0.101]
68 We use SVMRank (Joachims, 2006) as optimization subroutine for PRO, which efficiently handle all pairwise samples without the need for sampling. [sent-284, score-0.171]
69 The third observation relates to the issue of metric tunability (Liu et al. [sent-308, score-0.186]
70 703 Pareto (PMO−PRO) Figure 3: NIST Results not to optimize it directly, but jointly with a more tunable metric BLEU. [sent-318, score-0.163]
71 The learning curve in Figure 4 show that single-objective optimization of RIBES quickly falls into local optimum (at iteration 3) whereas PMO can zigzag and sacrifice RIBES in intermediate iterations (e. [sent-319, score-0.232]
72 This finding suggests that multi-objective ap- proaches may be preferred, especially when dealing with new metrics that may be difficult to tune. [sent-323, score-0.101]
73 While FindParetoFrontier scales quadratically by size of N-best list, Figure 5 shows that the runtime is triv- Figure 4: Learning Curve on RIBES: comparing singleobjective optimization and PMO. [sent-327, score-0.186]
74 The number of pareto 7 Figure 6: Average number of Pareto points hypotheses gives a rough indication of the diversity of hypotheses that can be exploited by PMO. [sent-338, score-1.01]
75 Nevertheless, we note that tens of Pareto points is far few compared to the large size of N-best lists used at later iterations of PMO-PRO. [sent-341, score-0.099]
76 Theoretically, the number will eventually level off as it gets increasingly harder to generate new Pareto points in a crowded space (Bentley et al. [sent-343, score-0.088]
77 Practical recommendation: We present the Pareto approach as a way to agnostically optimize multiple metrics jointly. [sent-345, score-0.174]
78 However, in practice, one may have intuitions about metric tradeoffs even if one cannot specify {pk}. [sent-346, score-0.121]
79 In this case, we recommend the following trick: Set up a multi-objective problem where one metric is BLEU and the other is 3/4BLEU+1/4RIBES. [sent-348, score-0.099]
80 This encourages PMO to explore the joint metric space but avoid solutions that sacrifice too much BLEU, and should also outperform Linear Combination that searches only on the (3/4,1/4) direction. [sent-349, score-0.18]
81 5 Related Work Multi-objective optimization for MT is a relatively new area. [sent-350, score-0.142]
82 As far as we known, the only work that directly proposes a multi-objective technique is (He and Way, 2009), which modifies MERT to optimize a single metric subject to the constraint that it does not degrade others. [sent-353, score-0.143]
83 r The tunability of metrics is a problem that is gaining recognition (Liu et al. [sent-357, score-0.188]
84 If a good evaluation metric could not be used for tuning, it would be a pity. [sent-359, score-0.099]
85 One unsolved question is whether metric tunability is a problem inherent to the metric only, or depends also on the underlying optimization algorithm. [sent-365, score-0.427]
86 Our positive results with PMO suggest that the choice of optimization algorithm can help. [sent-366, score-0.168]
87 , 2011) investigates joint optimization of a supervised parsing objective and some extrinsic objectives based on downstream applications. [sent-371, score-0.178]
88 Leveraging the diverse perspectives of different evaluation metrics has the potential to improve overall quality. [sent-377, score-0.101]
89 Based 8 on Pareto Optimality, PMO is easy to implement and achieves better solutions compared to linearcombination baselines, for any setting of combination weights. [sent-378, score-0.095]
90 Further we observe that multiobjective approaches can be helpful for optimizing difficult-to-tune metrics; this is beneficial for quickly introducing new metrics developed in MT evaluation into MT optimization, especially when good {pk} are not yet known. [sent-379, score-0.231]
91 Small N-best lists lead to sparsely-sampled Pareto Frontiers, and a much better approach would be to enlarge the hypothesis space using lattices (Macherey et al. [sent-382, score-0.087]
92 non-pareto points ignores the fact that 2nd-place non-pareto points may also lead to good practical solutions. [sent-386, score-0.134]
93 A better approach may be to adopt a graded definition of Pareto optimality as done in some multi-objective works (Deb et al. [sent-387, score-0.115]
94 Opportunities: (1) There is still much we do not understand about metric tunability; we can learn much by looking at joint metric-spaces and examining how new metrics correlate with established ones. [sent-392, score-0.2]
95 Can we learn to jointly optimize cascaded systems, such as as speech translation or pivot translation? [sent-397, score-0.101]
96 A re-examination of machine learning approaches for sentence-level mt evaluation. [sent-409, score-0.102]
97 The best lexical metric for phrase-based statistical MT system optimization. [sent-440, score-0.099]
98 METEOR: An automatic metric for mt evaluation with high levels of correlation with human judgments. [sent-516, score-0.225]
99 Better evaluation metrics lead to better machine translation. [sent-532, score-0.101]
100 Measuring machine translation quality as semantic equivalence: A metric based on entailment features. [sent-572, score-0.156]
wordName wordTfidf (topN-words)
[('pareto', 0.769), ('pmo', 0.232), ('ribes', 0.164), ('frontier', 0.162), ('optimization', 0.142), ('devset', 0.131), ('optimality', 0.115), ('pk', 0.108), ('mt', 0.102), ('metrics', 0.101), ('metric', 0.099), ('bleu', 0.088), ('multiobjective', 0.087), ('pubmed', 0.087), ('tunability', 0.087), ('hypotheses', 0.087), ('pro', 0.075), ('mk', 0.068), ('points', 0.067), ('hypothesis', 0.066), ('weakly', 0.063), ('findparetofrontier', 0.058), ('paretooptimal', 0.058), ('translation', 0.057), ('agarwal', 0.046), ('line', 0.045), ('loop', 0.045), ('optimize', 0.044), ('linearcombination', 0.044), ('marler', 0.044), ('singleobjective', 0.044), ('optimizing', 0.043), ('dominated', 0.042), ('theorem', 0.039), ('designer', 0.038), ('convex', 0.038), ('objectives', 0.036), ('opportunities', 0.035), ('meteor', 0.034), ('decode', 0.033), ('iterations', 0.032), ('solutions', 0.031), ('mert', 0.03), ('agnostically', 0.029), ('albrecht', 0.029), ('aret', 0.029), ('argwmax', 0.029), ('bentley', 0.029), ('godfrey', 0.029), ('indp', 0.029), ('miettinen', 0.029), ('nter', 0.029), ('pkmk', 0.029), ('precn', 0.029), ('sacrifice', 0.029), ('sawaragi', 0.029), ('skyline', 0.029), ('subroutine', 0.029), ('curve', 0.029), ('nist', 0.028), ('tune', 0.028), ('point', 0.028), ('cer', 0.028), ('dominating', 0.027), ('scotland', 0.026), ('algorithm', 0.026), ('nelder', 0.025), ('gimnez', 0.025), ('arora', 0.025), ('ihn', 0.025), ('nyi', 0.025), ('rhe', 0.025), ('ter', 0.025), ('hopkins', 0.024), ('competing', 0.024), ('correlation', 0.024), ('vectors', 0.024), ('economics', 0.023), ('trick', 0.023), ('sacrificing', 0.023), ('weight', 0.022), ('optimizer', 0.022), ('goto', 0.022), ('dominates', 0.022), ('visualize', 0.022), ('circle', 0.022), ('owczarzak', 0.022), ('testset', 0.022), ('attains', 0.022), ('bp', 0.022), ('tradeoffs', 0.022), ('explain', 0.021), ('space', 0.021), ('combination', 0.02), ('edinburgh', 0.02), ('spitkovsky', 0.02), ('tunable', 0.02), ('pado', 0.02), ('campaigns', 0.02), ('recommendation', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 136 acl-2012-Learning to Translate with Multiple Objectives
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
2 0.10485422 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
Author: Xiaodong He ; Li Deng
Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.
3 0.090622105 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
4 0.076467901 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li
Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1
5 0.074496306 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
Author: Chang Liu ; Hwee Tou Ng
Abstract: In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks. –
6 0.071981311 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
7 0.070346959 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
8 0.069919638 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis
9 0.067760214 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
10 0.067465723 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
11 0.065951891 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT
12 0.062794395 140 acl-2012-Machine Translation without Words through Substring Alignment
13 0.060058679 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation
14 0.0600572 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation
15 0.058152433 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
16 0.056921322 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
17 0.052923806 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation
18 0.052670568 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation
19 0.052474756 163 acl-2012-Prediction of Learning Curves in Machine Translation
20 0.052295476 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
topicId topicWeight
[(0, -0.142), (1, -0.087), (2, 0.035), (3, 0.023), (4, 0.027), (5, -0.003), (6, -0.004), (7, 0.014), (8, -0.019), (9, 0.017), (10, -0.04), (11, -0.002), (12, -0.014), (13, 0.012), (14, 0.002), (15, 0.019), (16, 0.01), (17, 0.038), (18, -0.015), (19, -0.01), (20, 0.069), (21, -0.131), (22, 0.087), (23, -0.026), (24, 0.038), (25, -0.004), (26, 0.032), (27, 0.068), (28, 0.013), (29, 0.058), (30, -0.017), (31, -0.055), (32, 0.093), (33, -0.044), (34, 0.067), (35, -0.063), (36, -0.048), (37, -0.045), (38, 0.011), (39, -0.072), (40, 0.095), (41, 0.091), (42, 0.124), (43, 0.052), (44, 0.02), (45, 0.058), (46, -0.097), (47, 0.087), (48, -0.056), (49, 0.062)]
simIndex simValue paperId paperTitle
same-paper 1 0.91669303 136 acl-2012-Learning to Translate with Multiple Objectives
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
2 0.81747413 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
3 0.67611426 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
Author: Chang Liu ; Hwee Tou Ng
Abstract: In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks. –
4 0.63281053 163 acl-2012-Prediction of Learning Curves in Machine Translation
Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy
Abstract: Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios.
5 0.61695528 34 acl-2012-Automatically Learning Measures of Child Language Development
Author: Sam Sahakian ; Benjamin Snyder
Abstract: We propose a new approach for the creation of child language development metrics. A set of linguistic features is computed on child speech samples and used as input in two age prediction experiments. In the first experiment, we learn a child-specific metric and predicts the ages at which speech samples were produced. We then learn a more general developmental index by applying our method across children, predicting relative temporal orderings of speech samples. In both cases we compare our results with established measures of language development, showing improvements in age prediction performance.
6 0.59568173 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis
7 0.53424472 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models
8 0.46894488 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
10 0.44793776 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
11 0.42658326 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
12 0.41844118 178 acl-2012-Sentence Simplification by Monolingual Machine Translation
13 0.40014553 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation
14 0.39434698 131 acl-2012-Learning Translation Consensus with Structured Label Propagation
15 0.37772676 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors
16 0.37255904 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation
17 0.36505929 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets
18 0.34596282 186 acl-2012-Structuring E-Commerce Inventory
19 0.34541798 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations
20 0.33932495 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT
topicId topicWeight
[(25, 0.019), (26, 0.034), (28, 0.057), (30, 0.022), (37, 0.037), (39, 0.038), (57, 0.024), (59, 0.015), (62, 0.252), (74, 0.059), (82, 0.024), (84, 0.029), (85, 0.061), (90, 0.088), (92, 0.052), (94, 0.045), (99, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.73447597 136 acl-2012-Learning to Translate with Multiple Objectives
Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata
Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.
Author: Weiwei Sun ; Xiaojun Wan
Abstract: We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. We empirically analyze the diversity between two representative corpora, i.e. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible. The analysis is further exploited to improve processing accuracy by (1) integrating systems that are respectively trained on heterogeneous annotations to reduce the approximation error, and (2) re-training models with high quality automatically converted data to reduce the estimation error. Evaluation on the CTB and PPD data shows that our novel model achieves a relative error reduction of 11% over the best reported result in the literature.
3 0.49596879 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
Author: Danilo Croce ; Alessandro Moschitti ; Roberto Basili ; Martha Palmer
Abstract: In this paper, we propose innovative representations for automatic classification of verbs according to mainstream linguistic theories, namely VerbNet and FrameNet. First, syntactic and semantic structures capturing essential lexical and syntactic properties of verbs are defined. Then, we design advanced similarity functions between such structures, i.e., semantic tree kernel functions, for exploiting distributional and grammatical information in Support Vector Machines. The extensive empirical analysis on VerbNet class and frame detection shows that our models capture mean- ingful syntactic/semantic structures, which allows for improving the state-of-the-art.
4 0.49406561 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.
5 0.49333885 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin
Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1
6 0.49161267 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning
8 0.48926651 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
9 0.48911682 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
10 0.48821807 140 acl-2012-Machine Translation without Words through Substring Alignment
11 0.48800081 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
12 0.48749676 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities
13 0.48682263 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
14 0.48646528 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation
15 0.48585206 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
17 0.48283601 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
18 0.48274669 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API
19 0.48173621 83 acl-2012-Error Mining on Dependency Trees
20 0.48099235 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition