acl acl2013 acl2013-250 acl2013-250-reference knowledge-graph by maker-knowledge-mining

250 acl-2013-Models of Translation Competitions

Source: pdf

Author: Mark Hopkins ; Jonathan May

Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia   n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.

reference text

F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.