acl acl2013 acl2013-250 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
Reference: text
sentIndex sentText sentNum sentScore
1 Models of Translation Competitions Mark Hopkins and Jonathan May SDL Research 6060 Drive, Suite 150 Angeles, CA 90045 Center Los {mhopkins Abstract What do we want to learn from a translation competition and how do we learn it with confidence? [sent-1, score-0.386]
2 We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. [sent-2, score-0.491]
3 In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. [sent-3, score-0.3]
4 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. [sent-5, score-0.386]
5 The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. [sent-6, score-0.154]
6 For each track, the organizers also assemble a panel of judges, typically machine translation specialists. [sent-7, score-0.154]
7 1 The role of a judge is to repeatedly rank five different translations of the same source text. [sent-8, score-0.268]
8 In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas. [sent-10, score-0.235]
9 For each competition track, WMT typically elicits between 5000 and 20000 comparisons. [sent-12, score-0.338]
10 Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? [sent-13, score-0.295]
11 y”asking judges to simultaneously rank five translations, with ties permitted. [sent-20, score-0.167]
12 A preference of 0 means neither translation was preferred. [sent-23, score-0.398]
13 The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. [sent-32, score-0.171]
14 Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. [sent-33, score-0.3]
15 It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. [sent-45, score-0.51]
16 Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. [sent-47, score-0.154]
17 Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. [sent-53, score-0.522]
18 While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. [sent-64, score-0.225]
19 The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. [sent-66, score-0.206]
20 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. [sent-69, score-0.645]
21 a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. [sent-73, score-0.492]
22 While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al. [sent-86, score-0.173]
23 Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. [sent-95, score-0.177]
24 Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. [sent-97, score-0.749]
25 J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). [sent-98, score-0.312]
26 Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. [sent-99, score-1.017]
27 We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. [sent-100, score-0.558]
28 We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. [sent-103, score-0.288]
29 Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). [sent-105, score-0.269]
30 With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. [sent-106, score-0.569]
31 We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. [sent-108, score-0.36]
32 , Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. [sent-112, score-0.36]
33 data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. [sent-117, score-0.441]
34 The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? [sent-123, score-0.697]
35 Recall that a WMT competition produces pairwise comparisons like those in Table 2. [sent-124, score-0.556]
36 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. [sent-126, score-0.21]
37 fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. [sent-130, score-0.153]
38 data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. [sent-135, score-0.243]
39 We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. [sent-136, score-0.554]
40 Table 3 shows the resulting dataset sizes for each competition track. [sent-137, score-0.3]
41 Unlike with raw rankings, the claim that one preference model is better than another has testable implications. [sent-138, score-0.312]
42 We can then publish a system ranking based on the most trustworthy preference model. [sent-141, score-0.415]
43 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. [sent-142, score-0.344]
44 1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i. [sent-144, score-0.684]
45 2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. [sent-148, score-0.245]
46 L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. [sent-149, score-0.522]
47 C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. [sent-152, score-0.256]
48 1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. [sent-157, score-0.173]
49 In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. [sent-158, score-0.616]
50 The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. [sent-159, score-0.2]
51 To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. [sent-160, score-0.346]
52 It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. [sent-164, score-1.8]
53 QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. [sent-167, score-0.526]
54 Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. [sent-170, score-0.325]
55 iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. [sent-171, score-0.363]
56 Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. [sent-178, score-0.498]
57 But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. [sent-179, score-0.795]
58 A model that only works when the students are all about the same level is not one we should rely on. [sent-182, score-0.256]
59 who a student has been paired with) to influence the estimation of the student abilities. [sent-185, score-0.612]
60 First, each student’s ability distribution is drawn from a common prior distribution. [sent-187, score-0.179]
61 Each item is authored by a student and has a quality drawn from the student’s ability distribution. [sent-189, score-0.677]
62 The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. [sent-192, score-0.846]
63 The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. [sent-195, score-0.592]
64 In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0. [sent-196, score-0.459]
65 3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0. [sent-201, score-0.916]
66 A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. [sent-205, score-0.373]
67 For the first option (item 43), the judge draws a noise of -0. [sent-206, score-0.29]
68 For the second option (item 205), the judge draws a noise of 0. [sent-209, score-0.29]
69 rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, . [sent-227, score-0.174]
70 , Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. [sent-230, score-0.459]
71 Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, . [sent-231, score-0.282]
72 Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp. [sent-240, score-0.312]
73 In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. [sent-254, score-0.486]
74 From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine. [sent-255, score-0.406]
75 ” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. [sent-256, score-0.373]
76 An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. [sent-257, score-0.213]
77 The perplexity of the uniform preference model is 3. [sent-262, score-0.428]
78 8 Experiments We organized the competition data as described at the end of Section 4. [sent-264, score-0.3]
79 To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}. [sent-265, score-0.312]
80 For each model and training size, we averaged the perplexities from 5 trials of each competition track. [sent-268, score-0.38]
81 This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. [sent-276, score-0.154]
82 In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. [sent-290, score-1.316]
83 Figure 5: WMT10 model perplexities sourced versus expert training). [sent-291, score-0.157]
84 (crowd- The IRT models proved the most robust at handling judge noise. [sent-292, score-0.213]
85 We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. [sent-293, score-0.28]
86 Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. [sent-295, score-0.171]
87 This is rather impressive, since the crowdsourced judges agree only 46. [sent-297, score-0.19]
88 Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. [sent-304, score-0.322]
89 For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. [sent-305, score-0.387]
90 The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). [sent-306, score-0.177]
91 According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. [sent-308, score-0.421]
92 Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. [sent-310, score-0.344]
93 Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. [sent-312, score-0.851]
94 In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. [sent-314, score-0.909]
95 Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. [sent-316, score-0.326]
96 Rather than the current anecdotal approach to comparing competition analyses (e. [sent-318, score-0.3]
97 How much of an impact does judge noise have on my conclusions? [sent-322, score-0.25]
98 We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. [sent-323, score-0.291]
99 Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. [sent-327, score-0.374]
100 This suggests that we can elicit far fewer comparisons and still derive confident conclusions. [sent-328, score-0.21]
wordName wordTfidf (topN-words)
[('preference', 0.312), ('competition', 0.3), ('student', 0.29), ('students', 0.256), ('wmt', 0.252), ('comparisons', 0.21), ('irt', 0.195), ('judge', 0.181), ('ability', 0.135), ('item', 0.131), ('origwmt', 0.122), ('terrific', 0.122), ('judges', 0.12), ('ranking', 0.103), ('gaussian', 0.101), ('mislevy', 0.098), ('obs', 0.098), ('skews', 0.098), ('rankings', 0.088), ('translation', 0.086), ('perplexity', 0.085), ('lopez', 0.082), ('pj', 0.08), ('bojar', 0.08), ('thurstone', 0.073), ('crowdsourced', 0.07), ('px', 0.07), ('noise', 0.069), ('abilities', 0.068), ('parameterization', 0.068), ('organizers', 0.068), ('preferences', 0.063), ('competitors', 0.06), ('geman', 0.06), ('versus', 0.06), ('radius', 0.056), ('translations', 0.054), ('elicitation', 0.054), ('independent', 0.054), ('geometric', 0.052), ('heuristic', 0.051), ('ll', 0.05), ('lighter', 0.049), ('win', 0.049), ('almond', 0.049), ('hambleton', 0.049), ('heatmap', 0.049), ('hpx', 0.049), ('linden', 0.049), ('luce', 0.049), ('rasch', 0.049), ('sdl', 0.049), ('tshpea', 0.049), ('expert', 0.049), ('estimate', 0.048), ('perplexities', 0.048), ('bradley', 0.048), ('ties', 0.047), ('pairwise', 0.046), ('universal', 0.046), ('favor', 0.046), ('drawn', 0.044), ('kullback', 0.043), ('impress', 0.043), ('deviation', 0.042), ('penalized', 0.041), ('quality', 0.041), ('estimates', 0.041), ('draws', 0.04), ('competitions', 0.04), ('distributions', 0.039), ('relative', 0.038), ('elicits', 0.038), ('darker', 0.038), ('response', 0.036), ('concern', 0.036), ('cc', 0.036), ('authored', 0.036), ('suppose', 0.035), ('marginal', 0.035), ('track', 0.034), ('strength', 0.034), ('independence', 0.033), ('reconstruct', 0.033), ('tso', 0.033), ('source', 0.033), ('mass', 0.032), ('paired', 0.032), ('models', 0.032), ('trials', 0.032), ('reserve', 0.032), ('plate', 0.032), ('uniform', 0.031), ('question', 0.031), ('theoretic', 0.031), ('analytical', 0.031), ('judged', 0.031), ('findings', 0.031), ('close', 0.03), ('mean', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
2 0.13773829 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
3 0.12698819 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
4 0.10054207 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez
Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1
Author: Trevor Cohn ; Lucia Specia
Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.
6 0.086297892 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
7 0.083465345 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation
8 0.082853891 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics
9 0.080978006 107 acl-2013-Deceptive Answer Prediction with User Preference Graph
10 0.077933133 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
11 0.069220662 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
12 0.069134198 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
13 0.067293376 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
14 0.066450126 297 acl-2013-Recognizing Partial Textual Entailment
15 0.065910339 121 acl-2013-Discovering User Interactions in Ideological Discussions
16 0.063182376 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
17 0.062629342 289 acl-2013-QuEst - A translation quality estimation framework
18 0.061466079 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
19 0.059402183 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
20 0.058497265 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
topicId topicWeight
[(0, 0.17), (1, -0.014), (2, 0.093), (3, -0.031), (4, -0.005), (5, -0.027), (6, 0.026), (7, -0.056), (8, 0.007), (9, 0.024), (10, -0.006), (11, 0.042), (12, -0.066), (13, -0.011), (14, -0.06), (15, -0.011), (16, -0.059), (17, 0.008), (18, 0.034), (19, 0.016), (20, 0.055), (21, -0.019), (22, -0.033), (23, -0.039), (24, -0.047), (25, 0.013), (26, -0.076), (27, 0.054), (28, 0.053), (29, 0.021), (30, -0.003), (31, -0.136), (32, 0.075), (33, 0.015), (34, -0.002), (35, -0.039), (36, 0.057), (37, 0.021), (38, -0.131), (39, -0.006), (40, 0.014), (41, -0.02), (42, -0.066), (43, -0.028), (44, -0.051), (45, 0.02), (46, -0.058), (47, 0.068), (48, 0.01), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.93363369 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
2 0.79941446 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
3 0.7217114 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
4 0.64684796 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
5 0.59270579 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
6 0.58409077 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
8 0.57742137 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
9 0.54776734 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
10 0.54173374 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
11 0.52713025 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
12 0.51945585 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
13 0.50845695 225 acl-2013-Learning to Order Natural Language Texts
14 0.50533181 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
15 0.5039264 322 acl-2013-Simple, readable sub-sentences
16 0.50355399 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
17 0.49758613 390 acl-2013-Word surprisal predicts N400 amplitude during reading
18 0.494243 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints
19 0.49074483 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
20 0.4905276 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
topicId topicWeight
[(0, 0.058), (6, 0.052), (9, 0.014), (11, 0.045), (14, 0.011), (15, 0.039), (24, 0.031), (26, 0.046), (28, 0.01), (35, 0.075), (42, 0.044), (48, 0.045), (56, 0.011), (64, 0.014), (70, 0.068), (79, 0.147), (88, 0.029), (90, 0.087), (95, 0.079)]
simIndex simValue paperId paperTitle
1 0.88111633 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms
Author: Elijah Mayfield ; David Adamson ; Carolyn Penstein Rose
Abstract: Automated annotation of social behavior in conversation is necessary for large-scale analysis of real-world conversational data. Important behavioral categories, though, are often sparse and often appear only in specific subsections of a conversation. This makes supervised machine learning difficult, through a combination of noisy features and unbalanced class distributions. We propose within-instance content selection, using cue features to selectively suppress sections of text and biasing the remaining representation towards minority classes. We show the effectiveness of this technique in automated annotation of empowerment language in online , support group chatrooms. Our technique is significantly more accurate than multiple baselines, especially when prioritizing high precision.
same-paper 2 0.87938255 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
3 0.75155622 139 acl-2013-Entity Linking for Tweets
Author: Xiaohua Liu ; Yitong Li ; Haocheng Wu ; Ming Zhou ; Furu Wei ; Yi Lu
Abstract: We study the task of entity linking for tweets, which tries to associate each mention in a tweet with a knowledge base entry. Two main challenges of this task are the dearth of information in a single tweet and the rich entity mention variations. To address these challenges, we propose a collective inference method that simultaneously resolves a set of mentions. Particularly, our model integrates three kinds of similarities, i.e., mention-entry similarity, entry-entry similarity, and mention-mention similarity, to enrich the context for entity linking, and to address irregular mentions that are not covered by the entity-variation dictionary. We evaluate our method on a publicly available data set and demonstrate the effectiveness of our method.
4 0.74168384 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics
Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin
Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.
5 0.73607647 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
Author: Graham Neubig
Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar
6 0.73162389 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
7 0.73136622 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
8 0.73088062 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
10 0.72078657 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
11 0.72041452 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization
13 0.71960789 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation
14 0.71916461 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
15 0.7163952 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
16 0.7154634 312 acl-2013-Semantic Parsing as Machine Translation
17 0.71410763 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation
18 0.71344101 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization
19 0.71315426 275 acl-2013-Parsing with Compositional Vector Grammars
20 0.71225214 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning