acl acl2013 acl2013-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
Reference: text
sentIndex sentText sentNum sentScore
1 English→Russian MT evaluation campaign Pavel Braslavski Kontur Labs / Ural Federal University,Russia pbras @ yandex . [sent-1, score-0.309]
2 ru Alexander Beloborodov Ural Federal University Russia xander-beloborodov @ yandex . [sent-2, score-0.201]
3 ru Abstract This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. [sent-3, score-0.065]
4 t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. [sent-4, score-0.11]
5 We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations. [sent-5, score-0.728]
6 1 Introduction Machine Translation (MT) between English and Russian was one of the first translation directions tested at the dawn of MT research in the 1950s (Hutchins, 2000). [sent-6, score-0.1]
7 Since then the MT paradigms changed many times, many systems for this language pair appeared (and disappeared), but as far as we know there was no systematic quantitative evaluation of a range of systems, analogous to DARPA’94 (White et al. [sent-7, score-0.131]
8 The Workshop on Statistical MT (WMT) in 2013 has announced a Russian evaluation track for the first time. [sent-9, score-0.075]
9 1 However, this evaluation is currently ongoing, it should include new methods for building statistical MT (SMT) systems for Russian from the data provided in this track, but it will not cover the performance of existing systems, especially rule-based (RBMT) or hybrid ones. [sent-10, score-0.221]
10 Recently, there have been a number of MT shared tasks for combinations of several EuMaxim Khalilov Serge Sharoff TAUS Labs University of Leeds The Netherlands UK maxim s . [sent-12, score-0.059]
11 One of the main challenges in developing MT systems for Russian and for evaluating them is the need to deal with its free word order and complex morphology. [sent-17, score-0.115]
12 Long-distance dependencies are common, and this creates problems for both RBMT and SMT systems (especially for phrasebased ones). [sent-18, score-0.056]
13 The language direction was chosen to be English→Russian, first because of the availability olifs hn→atiRveu speakers fto bre evaluation, ese acvoanidla bbiel-cause the systems taking part in this evaluation are mostly used in translation of English texts for the Russian readers. [sent-20, score-0.365]
14 2 Corpus preparation In designing the set of texts for evaluation, we had two issues in mind. [sent-21, score-0.112]
15 First, it is known that the domain and genre can influence MT performance (Langlais, 2002; Babych et al. [sent-22, score-0.074]
16 Second, we were aiming at using sources allowing distribution of texts under a Creative Commons licence. [sent-24, score-0.114]
17 The newswire texts were collected from the English Wikinews website. [sent-26, score-0.065]
18 4 The second genre was represented by ‘regulations’ (laws, contracts, rules, etc), which were collected from the Web using a genre classification method described in (Sharoff, 2010). [sent-27, score-0.148]
19 The method provided a sufficient accuracy (74%) for the initial selection of texts un- ridFBnoeurpd1echatsrniocg,epnt:Ai. [sent-28, score-0.065]
20 c A2s0 1o23ci6Aa2tsios ncfioartioCnofmorpuCtoamtiopnuat l Lion gauliLsitncgsu,ipsatgices 262–267, The initial corpus consists of 8,356 original English texts that make up 148,864 sentences. [sent-41, score-0.065]
21 We chose to retain the entire texts in the corpus rather than individual sentences, since some MT systems may use information beyond isolated sentences. [sent-42, score-0.121]
22 100,889 sentences originated from Wikinews; 47,975 sentences came from the ‘regulations’ corpus. [sent-43, score-0.241]
23 The first 1,002 sentences were published in advance to allow potential participants time to adjust their systems to the corpus format. [sent-44, score-0.23]
24 The remaining 147,862 sentences were the corpus for testing translation into Russian. [sent-45, score-0.16]
25 Two examples of texts in the corpus: 90237 Ambassadors from the United States of America, Australia and Britain have all met with Fijian military officers to seek insurances that there wasn ’t going to be a coup. [sent-46, score-0.065]
26 102835 If you are given a discount for booking more than one person onto the same date and you later wish to transfer some of the delegates to another event, the fees will be recalculated and you will be asked to pay additional fees due as well as any administrative charge. [sent-47, score-0.16]
27 For automatic evaluation we randomly selected 947 ‘clean’ sentences, i. [sent-48, score-0.125]
28 759 sentences originated from the ‘news’ part of the corpus, the remaining 188 came from the ‘regulations’ part. [sent-52, score-0.181]
29 The sentences came from sources without published translations into Russian, so that some of the participating systems do not get unfair advantage by using them for training. [sent-53, score-0.336]
30 For manual evaluation, we randomly selected 330 sentences out of 947 used for automatic evaluation, specifically, 190 from the ‘news’ part and 140 from the ‘regulations’ part. [sent-55, score-0.11]
31 6 These resources are not related to the test corpus of the evaluation campaign. [sent-57, score-0.075]
32 to make it easier to participate in the shared task for teams without sufficient data for this language pair. [sent-63, score-0.081]
33 3 Evaluation methodology The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous. [sent-64, score-0.277]
34 This is different from simultaneous ranking of several MT outputs, as commonly used in WMT evaluation campaigns. [sent-66, score-0.169]
35 In case of a large number of participating systems each assessor ranks only a subset of MT outputs. [sent-67, score-0.311]
36 However, a fair overall ranking cannot be always derived from such partial rankings (CallisonBurch et al. [sent-68, score-0.174]
37 The pairwise comparisons we used can be directly converted into unambiguous overall rankings. [sent-70, score-0.271]
38 This task is also much simpler for human judges to complete. [sent-71, score-0.112]
39 On the other hand, pairwise comparisons require a larger number of evaluation decisions, which is feasible only for few participants (and we indeed had relatively few submissions in this campaign). [sent-72, score-0.46]
40 Below we also discuss how to reduce the amount of human efforts for evaluation. [sent-73, score-0.099]
41 In our case the assessors were asked to make a pairwise comparison of two sentences translated by two different MT systems against a gold standard translation. [sent-74, score-0.5]
42 The question for them was to judge translation adequacy, i. [sent-75, score-0.167]
43 , which MT output conveys information from the reference translation better. [sent-77, score-0.1]
44 The translator also had access to the entire text, while the assessors could only see a single sentence. [sent-79, score-0.237]
45 For human evaluation we employed the multifunctional TAUS DQF tool7 in the ‘Quick Comparison’ mode. [sent-80, score-0.135]
46 Assessors’ judgements resulted in rankings for each sentence in the test set. [sent-81, score-0.169]
47 when the ranks of the systems in positions 2-4 and 7-8 were tied, their ranks became: 1 3 3 3 5 6 7 . [sent-84, score-0.336]
48 To produce the final ranking, the sentence-level ranks were averaged over all sentences. [sent-87, score-0.19]
49 Pairwise comparisons are time-consuming: n 7http // s : t au s lab s dqf-t oo l -mt s . [sent-88, score-0.163]
50 In this study we also simulated a ‘human-assisted’ insertion sort algorithm and its variant with binary search. [sent-115, score-0.257]
51 The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required. [sent-116, score-0.255]
52 This assumes that human perception of quality is transitive: if we know that A < B and B < C, we can spare evaluation of A and C. [sent-117, score-0.135]
53 This approach also implies that sen- tence pairs to judge are generated and presented to assessors on the fly; each decision contributes to selection of the pairs to be judged in the next step. [sent-118, score-0.304]
54 If the systems are pre-sorted in a reasonable way (e. [sent-119, score-0.056]
55 by an MT metric, under assumption that automatic pre-ranking is closer to the ‘ideal’ ranking than a random one), then we can potentially save even more pairwise comparison operations. [sent-121, score-0.291]
56 Presorting makes ranking somewhat biased in favour of the order established by an MT metric. [sent-122, score-0.129]
57 For example, if it favours one system against another, while in human judgement they are equal, the final ranking will preserve the initial order. [sent-123, score-0.154]
58 Insertion sort of n sentences requires n − 1 comparisons in the best case of already sorted data and in the worst case (reversely ordered data). [sent-124, score-0.312]
59 Insertion sort with binary search requires ∼ n log n comparissoornts w regardless soefa trcheh rienqitiuailr eosr ∼der n. [sent-125, score-0.128]
60 Floogr nth ciso study we ran exhaustive pairwise evaluation and used its results to simulate human-assisted sorting. [sent-126, score-0.322]
61 In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al. [sent-127, score-0.11]
62 We also wanted to estimate the correlations of these metrics with human judgements for the English→Russian pair on the corpus level and on t Ehen glelivsehl→ →ofR inudssiviaindu paali sentences. [sent-131, score-0.145]
63 4 Results We received results from five teams, two teams submitted two runs each, which totals seven participants’ runs (referred to as P1. [sent-132, score-0.289]
64 The evaluation runs also included the translations of the 947 test sentences produced by four free online systems in their default modes (referred to as OS 1. [sent-137, score-0.397]
65 For 11 runs automatic evaluation measures were calculated; eight runs underwent manual evaluation (four online systems plus four participants’ runs; no manual evaluation was done by agreement with the participants for the runs P3, P6, and P7 to reduce the workload). [sent-140, score-0.871]
66 P1 is a hybrid system with analysis and generation driven by statistical evaluation of hypotheses. [sent-142, score-0.165]
67 l81ow2est) Table 1 gives the automatic scores for each of participating runs and four online systems. [sent-151, score-0.269]
68 14 assessors were recruited for evaluation (participating team members and volunteers); the total volume of evaluation is 10,920 pairwise sentence comparisons. [sent-153, score-0.534]
69 Table 2 presents the rankings of the participating systems using averaged ranks from the human evaluation. [sent-154, score-0.501]
70 05) in the overall ranks within the following groups: (OS1, OS3, P1) < (OS2, OS4) < wP5< (P2, P4). [sent-156, score-0.14]
71 OS3 (mostly RBMT) belongs to the troika of leaders in human evaluation contrary to the results of its automatic scores (Table 1). [sent-157, score-0.185]
72 To investigate applicability of the automatic measures to the English-Russian language direction, we computed Spearman’s ρ correlation between the ranks given by the evaluators and by the respective measures. [sent-163, score-0.292]
73 All measures exhibit reasonable correlation on the corpus level (330 sentences), but the sentence-level results are less impressive. [sent-166, score-0.102]
74 While TER and GTM are known to provide better correlation with postediting efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian. [sent-167, score-0.199]
75 METEOR (with its built-in Russian lemmatisation) and GTM offer the best correlation with human judgements. [sent-168, score-0.121]
76 The lower part of Table 2 also reports the results of simulated dynamic ranking (using the NIST rankings as the initial order for the sort operation). [sent-169, score-0.343]
77 It resulted in a slightly different final ranking of the systems since we did not account for ties and ‘averaged ranks’ . [sent-170, score-0.236]
78 However, the ranking is practically the same up to the statistically significant rank differences in reference ranking (see above). [sent-171, score-0.188]
79 The advantage is that it requires a significantly lower number of pairwise comparisons. [sent-172, score-0.147]
80 5 per sentence; 56% of exhaustive comparisons for 330 sentences and 8 systems); binary insertion sort yielded 4,327 comparisons (13. [sent-174, score-0.581]
81 Out of the original set of 330 sentences for human evaluation, 60 sentences were evaluated by two annotators (which resulted in 60*28=1680 pairwise comparisons), so we were able to calculate the standard Kohen’s κ and Krippendorff’s α scores (Artstein and Poesio, 2008). [sent-176, score-0.376]
82 48, which is simi- 265 lar to sentence ranking reported in other evaluation campaigns (Callison-Burch et al. [sent-180, score-0.216]
83 It was interesting to see the agreement results distinguishing the top three systems against the rest, i. [sent-183, score-0.095]
84 53, which indicates that the judges agree on the difference in quality between the top three systems and the rest. [sent-186, score-0.108]
85 On the other hand, the agreement results within the top three systems are low: κ = 0. [sent-187, score-0.095]
86 5 Conclusions and future plans This was the first attempt at making proper quantitative and qualitative evaluation of the English→Russian MT systems. [sent-191, score-0.075]
87 In the future editions, we wusilsl a bne aiming amt developing a new test corpus with a wider genre palette. [sent-192, score-0.123]
88 We will probably complement the campaign with Russian→English translation direction. [sent-193, score-0.198]
89 We will also address the problem of tailoring automatic evaluation measures to Russian accounting for complex morphology and free word order. [sent-195, score-0.225]
90 To this end we will re-use human evaluation data gathered within the 2013 campaign. [sent-196, score-0.135]
91 While the campaign was based exclusively on data in one language direction, the correlation results for automatic MT quality measures should be applicable to other languages with free word order and com— plex morphology. [sent-197, score-0.309]
92 We have made the corpus comprising the source sentences, their human translations, translations by participating MT systems and the human evaluation data publicly available. [sent-198, score-0.409]
93 ru /mteval / Acknowledgements We would like to thank the translators, assessors, as well as Anna Tsygankova, Maxim Gubin, and Marina Nekrestyanova for project coordination and organisational help. [sent-200, score-0.065]
94 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. [sent-215, score-0.246]
95 An evaluation of statistical post-editing systems applied to RBMT and SMT systems. [sent-219, score-0.131]
96 Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. [sent-236, score-0.175]
97 Improving a general-purpose statistical translation engine by terminological lexicons. [sent-258, score-0.1]
98 BLEU: a method for automatic evaluation of machine translation. [sent-271, score-0.125]
99 Exploring different human judgments with a tunable MT metric. [sent-282, score-0.06]
100 The ARPA MT evaluation methodologies: Evolution, lessons, and further approaches. [sent-292, score-0.075]
wordName wordTfidf (topN-words)
[('rbmt', 0.347), ('russian', 0.313), ('assessors', 0.237), ('mt', 0.206), ('regulations', 0.193), ('pairwise', 0.147), ('ranks', 0.14), ('yandex', 0.136), ('sort', 0.128), ('sharoff', 0.126), ('comparisons', 0.124), ('serge', 0.119), ('gtm', 0.116), ('participating', 0.115), ('participants', 0.114), ('runs', 0.104), ('translation', 0.1), ('campaign', 0.098), ('smt', 0.094), ('ranking', 0.094), ('hybrid', 0.09), ('insertion', 0.088), ('teams', 0.081), ('rankings', 0.08), ('meteor', 0.078), ('hutchins', 0.077), ('leeds', 0.077), ('romip', 0.077), ('taus', 0.077), ('wikinews', 0.077), ('evaluation', 0.075), ('genre', 0.074), ('abbyy', 0.068), ('char', 0.068), ('judge', 0.067), ('texts', 0.065), ('ru', 0.065), ('babych', 0.063), ('fees', 0.063), ('came', 0.062), ('correlation', 0.061), ('human', 0.06), ('sentences', 0.06), ('ural', 0.059), ('maxim', 0.059), ('originated', 0.059), ('free', 0.059), ('exhaustive', 0.057), ('federal', 0.056), ('systems', 0.056), ('marina', 0.054), ('translators', 0.054), ('labs', 0.052), ('judges', 0.052), ('genres', 0.052), ('ter', 0.05), ('averaged', 0.05), ('automatic', 0.05), ('resulted', 0.049), ('aiming', 0.049), ('preparation', 0.047), ('turian', 0.047), ('campaigns', 0.047), ('artstein', 0.046), ('os', 0.046), ('adequacy', 0.046), ('wanted', 0.045), ('snover', 0.044), ('callisonburch', 0.044), ('monz', 0.043), ('simulate', 0.043), ('translations', 0.043), ('moses', 0.042), ('simulated', 0.041), ('measures', 0.041), ('banerjee', 0.041), ('judgements', 0.04), ('sparseness', 0.04), ('christof', 0.04), ('wmt', 0.04), ('professional', 0.039), ('au', 0.039), ('efforts', 0.039), ('agreement', 0.039), ('white', 0.038), ('ties', 0.037), ('iwslt', 0.037), ('pe', 0.036), ('direction', 0.035), ('bleu', 0.035), ('established', 0.035), ('booking', 0.034), ('organisers', 0.034), ('trimmed', 0.034), ('bre', 0.034), ('jungle', 0.034), ('underwent', 0.034), ('memoirs', 0.034), ('assisting', 0.034), ('aus', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
2 0.17548069 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu
Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.
3 0.13191749 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
4 0.12698819 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
5 0.12263224 255 acl-2013-Name-aware Machine Translation
Author: Haibo Li ; Jing Zheng ; Heng Ji ; Qi Li ; Wen Wang
Abstract: We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora, extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding. Additionally, we also propose a new MT metric to appropriately evaluate the translation quality of informative words, by assigning different weights to different words according to their importance values in a document. Experiments on Chinese-English translation demonstrated the effectiveness of our approach on enhancing the quality of overall translation, name translation and word alignment over a high-quality MT baseline1 .
6 0.11331266 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
7 0.096639737 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
8 0.093142152 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
10 0.089493498 289 acl-2013-QuEst - A translation quality estimation framework
11 0.087770097 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
12 0.082523182 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
13 0.08105436 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
14 0.078819729 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
15 0.077595875 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
16 0.07722453 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
17 0.076862276 312 acl-2013-Semantic Parsing as Machine Translation
18 0.076317571 253 acl-2013-Multilingual Affect Polarity and Valence Prediction in Metaphor-Rich Texts
19 0.075114638 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
20 0.073613882 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
topicId topicWeight
[(0, 0.188), (1, -0.052), (2, 0.146), (3, 0.014), (4, -0.007), (5, -0.01), (6, 0.024), (7, 0.013), (8, 0.057), (9, 0.058), (10, -0.053), (11, 0.083), (12, -0.102), (13, 0.038), (14, -0.068), (15, 0.029), (16, -0.054), (17, -0.04), (18, 0.007), (19, 0.026), (20, 0.079), (21, -0.01), (22, -0.084), (23, -0.034), (24, -0.073), (25, 0.041), (26, -0.084), (27, 0.085), (28, 0.045), (29, 0.039), (30, -0.054), (31, -0.132), (32, 0.048), (33, -0.02), (34, 0.014), (35, -0.083), (36, 0.017), (37, 0.095), (38, -0.054), (39, -0.029), (40, -0.028), (41, 0.039), (42, -0.015), (43, -0.092), (44, -0.021), (45, -0.044), (46, 0.045), (47, -0.025), (48, 0.032), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.95019102 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
2 0.79888147 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
3 0.75512826 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
4 0.74037611 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu
Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.
5 0.72110248 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
6 0.68623286 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
7 0.68441188 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
8 0.67736769 289 acl-2013-QuEst - A translation quality estimation framework
9 0.64250177 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
10 0.63446003 255 acl-2013-Name-aware Machine Translation
11 0.62887561 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
13 0.5973503 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
14 0.59294885 322 acl-2013-Simple, readable sub-sentences
15 0.58381498 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
16 0.56858635 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
17 0.56120265 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
18 0.54884166 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
19 0.53877664 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
20 0.5365023 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
topicId topicWeight
[(0, 0.053), (6, 0.042), (11, 0.055), (15, 0.023), (24, 0.032), (26, 0.051), (35, 0.063), (42, 0.049), (48, 0.044), (70, 0.036), (80, 0.267), (88, 0.024), (90, 0.078), (95, 0.118)]
simIndex simValue paperId paperTitle
1 0.86303133 14 acl-2013-A Novel Classifier Based on Quantum Computation
Author: Ding Liu ; Xiaofang Yang ; Minghu Jiang
Abstract: In this article, we propose a novel classifier based on quantum computation theory. Different from existing methods, we consider the classification as an evolutionary process of a physical system and build the classifier by using the basic quantum mechanics equation. The performance of the experiments on two datasets indicates feasibility and potentiality of the quantum classifier.
2 0.83848631 227 acl-2013-Learning to lemmatise Polish noun phrases
Author: Adam Radziszewski
Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.
3 0.78419495 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning
Author: Song Feng ; Jun Seok Kang ; Polina Kuznetsova ; Yejin Choi
Abstract: Understanding the connotation of words plays an important role in interpreting subtle shades of sentiment beyond denotative or surface meaning of text, as seemingly objective statements often allude nuanced sentiment of the writer, and even purposefully conjure emotion from the readers’ minds. The focus of this paper is drawing nuanced, connotative sentiments from even those words that are objective on the surface, such as “intelligence ”, “human ”, and “cheesecake ”. We propose induction algorithms encoding a diverse set of linguistic insights (semantic prosody, distributional similarity, semantic parallelism of coordination) and prior knowledge drawn from lexical resources, resulting in the first broad-coverage connotation lexicon.
same-paper 4 0.77094984 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
5 0.72914803 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning
Author: Emmanuel Lassalle ; Pascal Denis
Abstract: This paper proposes a new method for significantly improving the performance of pairwise coreference models. Given a set of indicators, our method learns how to best separate types of mention pairs into equivalence classes for which we construct distinct classification models. In effect, our approach finds an optimal feature space (derived from a base feature set and indicator set) for discriminating coreferential mention pairs. Although our approach explores a very large space of possible feature spaces, it remains tractable by exploiting the structure of the hierarchies built from the indicators. Our exper- iments on the CoNLL-2012 Shared Task English datasets (gold mentions) indicate that our method is robust relative to different clustering strategies and evaluation metrics, showing large and consistent improvements over a single pairwise model using the same base features. Our best system obtains a competitive 67.2 of average F1 over MUC, and CEAF which, despite its simplicity, places it above the mean score of other systems on these datasets. B3,
6 0.68210888 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures
7 0.58714551 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
8 0.57953119 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
9 0.57650703 250 acl-2013-Models of Translation Competitions
10 0.57400823 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
11 0.57386637 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
12 0.5737114 240 acl-2013-Microblogs as Parallel Corpora
13 0.56613505 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
14 0.56515837 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
15 0.56381077 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
16 0.56338012 97 acl-2013-Cross-lingual Projections between Languages from Different Families
17 0.56305999 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
18 0.5617801 24 acl-2013-A Tale about PRO and Monsters
19 0.56176299 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation
20 0.56150091 267 acl-2013-PARMA: A Predicate Argument Aligner