acl acl2013 acl2013-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
Reference: text
sentIndex sentText sentNum sentScore
1 dk Abstract In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. [sent-6, score-0.195]
2 We first define and quantify translation difficulty in terms of TDI. [sent-7, score-0.333]
3 We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. [sent-8, score-0.106]
4 We, rather, rely on cognitive evidences from eye tracking. [sent-9, score-0.112]
5 TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. [sent-10, score-0.22]
6 We then establish that TDI is correlated with three properties of the input sentence, viz. [sent-11, score-0.088]
7 length (L), degree of polysemy (DP) and structural complexity (SC). [sent-12, score-0.259]
8 The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. [sent-14, score-0.047]
9 The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. [sent-15, score-0.071]
10 This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. [sent-16, score-0.303]
11 This can also provide a way of monitoring progress of second language learners. [sent-17, score-0.033]
12 1 Introduction Difficulty in translation stems from the fact that most words are polysemous and sentences can be long and have complex structure. [sent-18, score-0.168]
13 While length of sentence is commonly used as a translation difficulty indicator, lexical and structural properties of a sentence also contribute to translation difficulty. [sent-19, score-0.715]
14 (length-8) Clearly, sentence 1 is more difficult to process and translate than sentence 2, since it has lexical ambiguity ( “Shoot” as an act of firing a shot or taking a photograph? [sent-25, score-0.33]
15 ) and structural ambiguity (Shot with a gun or policeman with a gun? [sent-26, score-0.225]
16 To produce fluent and adequate translations, efforts have to be put to analyze both the lexical and syntactic properties of the sentences. [sent-28, score-0.041]
17 The most recent work on studying translation difficulty is by Campbell and Hale (1999) who identified several areas of difficulty in lexis and grammar. [sent-29, score-0.528]
18 “Reading” researchers have focused on developing readability formulae, since 1970. [sent-30, score-0.122]
19 , 1975), the Fry Readability Formula (Fry, 1977) and the Dale-Chall readability formula (Chall and Dale, 1999) are popular and influential. [sent-32, score-0.174]
20 These formulae use factors such as vocabulary difficulty (or semantic factors) and sentence length (or syntactic factors). [sent-33, score-0.355]
21 (2012) correlate eye fixations and scanpaths of readers with sentence processing. [sent-35, score-0.299]
22 While these approaches are successful in quantifying readability, they may not be applicable to translation scenarios. [sent-36, score-0.183]
23 The reason is that, translation is not merely a reading activity. [sent-37, score-0.22]
24 Currently, for domain specific Machine Translation systems, parallel corpora are gathered through translation crowdsourcing/outsourcing. [sent-41, score-0.138]
25 Our proposed Translation Difficulty Index (TDI) quantifies the translation difficulty of a sentence considering both lexical and structural properties. [sent-45, score-0.451]
26 This measure can, in turn, be used to cluster sentences according to their difficulty levels (viz. [sent-46, score-0.225]
27 For example, appropriate examples at particular levels of difficulty can be chosen for giving assignments and monitoring progress. [sent-50, score-0.228]
28 Section 2 describes TDI as function of translation processing time. [sent-52, score-0.138]
29 Section 3 is on measuring translation processing time through eye tracking. [sent-53, score-0.315]
30 Section 4 gives the correlation of linguistic complexity with observed TDI. [sent-54, score-0.097]
31 In section 5, we describe a technique for predicting TDIs and ranking unseen sentences using Support Vector Machines. [sent-55, score-0.063]
32 2 Quantifying Translation Difficulty As a first approximation, TDI of a sentence can be the time taken to translate the sentence, which can be measured through simple translation experiments. [sent-57, score-0.373]
33 This is based on the assumption that more difficult sentences will require more time to trans- late. [sent-58, score-0.096]
34 However, “time taken to translate” may not be strongly related to the translation difficulty for two reasons. [sent-59, score-0.372]
35 First, it is difficult to know what fraction of the total translation time is actually spent on the translation-related-thinking. [sent-60, score-0.204]
36 For example, translators may spend considerable amount of time typing/writing translations, which is irrelevant to the translation difficulty. [sent-61, score-0.318]
37 Second, the translation time is sensitive to distractions from the environment. [sent-62, score-0.177]
38 So, instead of the “time taken to translate”, we are more interested in the “time for which translation related processing is carried out by the brain”. [sent-63, score-0.177]
39 Mathematically, Tp = Tp comp + Tp gen (1) Where Tp comp and Tp gen are the processing times for source text comprehension and target text generation respectively. [sent-65, score-0.219]
40 The empirical TDI, is computed by normalizing Tp with sentence length. [sent-66, score-0.047]
41 TDI =sentenTceplength (2) Measuring Tp is a difficult task as translators often switch between thinking and writing activities. [sent-67, score-0.133]
42 3 Measuring Tp by eye-tracking We measure Tp by analyzing the gaze behavior of translators through eye-tracking. [sent-69, score-0.293]
43 The rationale behind using eye-tracking is that, humans spend time on what they see, and this “time” is correlated with the complexity of the information being processed, as shown in Figure 1. [sent-70, score-0.165]
44 Two fundamental components of eye behavior are (a) Gaze-fixation or simply, Fixation and (b) Saccade. [sent-71, score-0.143]
45 The former is a long stay of the visual gaze on a single location. [sent-72, score-0.156]
46 The latter is a very rapid movement of the eyes between positions of rest. [sent-73, score-0.055]
47 An intuitive feel for these two concepts can be had by considering the example of translating the sentence The camera-man shot the policeman with a gun mentioned in the introduction. [sent-74, score-0.325]
48 It is conceivable that the eye will linger long on the word “shot” which is ambiguous and will rapidly move across “shot”, “camera-man” and “gun” to ascertain the clue for disambiguation. [sent-75, score-0.112]
49 The terms Tp comp and Tp gen in (1) can now be looked upon as the sum of fixation and saccadic durations for both source and target sentences respectively. [sent-76, score-0.297]
50 Here, Fs and Ss correspond to sets of fixations and saccades for source sentence and Ft and St corre- spond to those for the target sentence respectively. [sent-79, score-0.183]
51 dur is a function returning the duration of fixations and saccades. [sent-80, score-0.165]
52 1 Computing TDI using eye-tracking database We obtained TDIs for a set of sentences from the Translation Process Research Database (TPR 1. [sent-82, score-0.057]
53 The database contains translation studies for which gaze data is recorded through the Translog software1(Carl, 2012). [sent-84, score-0.321]
54 Out of the 57 available sessions, we selected 40 translation sessions comprising 80 sentence translations2. [sent-86, score-0.217]
55 The translators were young professional linguists or students pursuing PhD in linguistics. [sent-89, score-0.106]
56 To correct this, we applied automatic error correction technique (Mishra et al. [sent-91, score-0.062]
57 Note that, gaze and saccadic durations may also depend on the translator’s reading speed. [sent-93, score-0.328]
58 We tried to rule out this effect by sampling out translations for which the variance in participant’s reading speed is minimum. [sent-94, score-0.122]
59 Variance in reading speed was calculated after taking a samples of source text for each participant and measuring the time taken to read the text. [sent-95, score-0.213]
60 After preprocessing the data, TDI was computed for each sentence by using (2) and (3). [sent-96, score-0.047]
61 dk og 220% of the translation sessions were discarded as it was difficult to rectify the gaze logs for these sessions. [sent-102, score-0.353]
62 3Anything beyond the upper bound is hard to translate and can be assigned with the maximum score. [sent-103, score-0.113]
63 If the “time taken to translate” and Tp were strongly correlated, we would have rather opted “time taken to translate” for the measurement of TDI. [sent-105, score-0.078]
64 The reason is that “time taken to translate” is relatively easy to compute and does not require expensive setup for conducting “eye-tracking” experiments. [sent-106, score-0.068]
65 But our experiments show that there is a weak correlation (coefficient = 0. [sent-107, score-0.053]
66 4 Relating TDI to sentence features Our claim is that translation difficulty is mainly caused by three features: Length, Degree of Polysemy and Structural Complexity. [sent-110, score-0.38]
67 2 Degree of Polysemy (DP) The degree ofpolysemy of a sentence is the sum of senses possessed by each word in the Wordnet normalized by the sentence length. [sent-114, score-0.162]
68 If the attachment units lie far from each other, the sentence has higher structural complexity. [sent-119, score-0.143]
69 Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence. [sent-120, score-0.038]
70 348 Figure 4: Prediction of TDI using linguistic properties such as Length(L), Degree of Polysemy (DP) and Structural Complexity (SC) Example: The man who the boy attacked escaped. [sent-121, score-0.041]
71 Using Lin’s formula, the SC score for the example sentence turns out to be 15. [sent-124, score-0.047]
72 Lin’s way of computing SC is affected by sentence length since the number of dependency links for a sentence depends on its length. [sent-125, score-0.132]
73 So we normalize SC by the length of the sentence. [sent-126, score-0.038]
74 4 How are TDI and linguistic features related To validate that translation difficulty depends on the above mentioned linguistic features, we tried to find out the correlation coefficients between each feature and empirical TDI. [sent-129, score-0.46]
75 For each sample, sentence selection was done with a view to varying one feature, keeping the other two constant. [sent-131, score-0.047]
76 These positive correlation coefficients indicate that all the features contribute to the translation difficulty. [sent-136, score-0.225]
77 5 Predicting TDI Our system predicts TDI from the linguistic properties of a sentence as shown in Figure 4. [sent-137, score-0.088]
78 ing translator’s behavior (using equations (1) and (2))instead of asking people to rate sentences with TDI. [sent-151, score-0.061]
79 We are now prepared to give the regression scenario for predicting TDI. [sent-152, score-0.092]
80 1 Preparing the dataset Our dataset contains 80 sentences for which TDI have been measured (Section 3. [sent-154, score-0.055]
81 We also measured the Pearson correlation coefficient between the empirical and predicted TDI for our test-sets. [sent-163, score-0.119]
82 Table 1 indicates Mean Square Error percentages for different kernel methods used for SVR. [sent-164, score-0.04]
83 MSE (%) indicates by what percentage the predicted TDIs differ from the observed TDIs. [sent-165, score-0.041]
84 The predicted TDIs are well correlated with the empirical TDIs. [sent-168, score-0.088]
85 This tells us that even if the predicted scores are not as accurate as desired, the system is capable of ranking sentences in correct order. [sent-169, score-0.071]
86 Table 2 presents examples from the test dataset for which the observed TDI (TDIO) and the TDI predicted by polynomial kernel based SVR (TDIP) are shown. [sent-170, score-0.106]
87 For that, we tried to manually assign three different class labels to sentences viz. [sent-190, score-0.07]
88 easy, medium and hard based on the empirical TDI scores. [sent-191, score-0.087]
89 The ranges of scores chosen for easy, medium and hard categories were [0-0. [sent-192, score-0.087]
90 Then we trained a Support Vector Rank (Joachims, 2006) with default parameters using different kernel methods. [sent-198, score-0.04]
91 6 Conclusion This paper introduces an approach to quantifying translation difficulty and automatically assigning difficulty levels to unseen sentences. [sent-202, score-0.573]
92 , length (L), degree of polysemy (DP) and structural complexity (SC), on one hand and the Translation Difficulty Index (TDI), on the other. [sent-204, score-0.259]
93 Future work includes deeper investigation into other linguistic factors such as presence of domain specific terms, target language properties etc. [sent-205, score-0.075]
94 We would like to make use of inter-annotator agree- ment to decide the boundaries for the translation difficulty categories. [sent-207, score-0.333]
95 Readability revisited: the new Dale-Chall readability formula Cambridge, Mass. [sent-226, score-0.174]
96 1977 Fry’s readability graph: Clarification, validity, and extension to level 17 Journal of Reading, 21(3), 242-252. [sent-236, score-0.122]
97 2002 Cleaning up systematic error in eye-tracking data by using required fixation locations. [sent-240, score-0.139]
98 1996 On the structural complexity of natural language sentences. [sent-265, score-0.115]
99 2012 A heuristic-based approach for systematic error correction of gaze datafor reading. [sent-270, score-0.244]
100 2012 Scanpaths in reading are informative about sentence processing. [sent-279, score-0.129]
wordName wordTfidf (topN-words)
[('tdi', 0.715), ('difficulty', 0.195), ('tp', 0.192), ('gaze', 0.156), ('sc', 0.156), ('translation', 0.138), ('tdis', 0.126), ('shot', 0.124), ('readability', 0.122), ('eye', 0.112), ('translators', 0.106), ('dp', 0.096), ('fixations', 0.089), ('translate', 0.085), ('carl', 0.083), ('fixation', 0.083), ('fry', 0.083), ('reading', 0.082), ('gun', 0.078), ('critt', 0.076), ('dur', 0.076), ('policeman', 0.076), ('structural', 0.071), ('mse', 0.071), ('mishra', 0.067), ('polysemy', 0.066), ('medium', 0.059), ('correlation', 0.053), ('svr', 0.053), ('formula', 0.052), ('chall', 0.051), ('dragsted', 0.051), ('halverson', 0.051), ('hornof', 0.051), ('ibc', 0.051), ('malsburg', 0.051), ('michaelcarl', 0.051), ('saccadic', 0.051), ('scanpaths', 0.051), ('tdio', 0.051), ('tdip', 0.051), ('correlated', 0.047), ('comp', 0.047), ('gen', 0.047), ('sentence', 0.047), ('quantifying', 0.045), ('kincaid', 0.045), ('choudhary', 0.045), ('hale', 0.045), ('tpr', 0.045), ('complexity', 0.044), ('joachims', 0.043), ('formulae', 0.041), ('cbs', 0.041), ('predicted', 0.041), ('properties', 0.041), ('tried', 0.04), ('kernel', 0.04), ('degree', 0.04), ('taken', 0.039), ('time', 0.039), ('durations', 0.039), ('poly', 0.039), ('length', 0.038), ('deg', 0.037), ('mathematically', 0.035), ('spend', 0.035), ('campbell', 0.034), ('coefficients', 0.034), ('factors', 0.034), ('predicting', 0.033), ('monitoring', 0.033), ('regression', 0.032), ('correction', 0.032), ('sessions', 0.032), ('comprehension', 0.031), ('bhattacharyya', 0.031), ('behavior', 0.031), ('error', 0.03), ('sentences', 0.03), ('danish', 0.03), ('easy', 0.029), ('hard', 0.028), ('movement', 0.028), ('translator', 0.028), ('senses', 0.028), ('organizing', 0.028), ('index', 0.028), ('database', 0.027), ('scenario', 0.027), ('rapid', 0.027), ('dale', 0.027), ('committee', 0.027), ('difficult', 0.027), ('participant', 0.027), ('measuring', 0.026), ('systematic', 0.026), ('measured', 0.025), ('polynomial', 0.025), ('lie', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
2 0.07014107 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
Author: Yang Feng ; Trevor Cohn
Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.
Author: Trevor Cohn ; Lucia Specia
Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.
Author: Guangyou Zhou ; Fang Liu ; Yang Liu ; Shizhu He ; Jun Zhao
Abstract: Community question answering (CQA) has become an increasingly popular research topic. In this paper, we focus on the problem of question retrieval. Question retrieval in CQA can automatically find the most relevant and recent questions that have been solved by other users. However, the word ambiguity and word mismatch problems bring about new challenges for question retrieval in CQA. State-of-the-art approaches address these issues by implicitly expanding the queried questions with additional words or phrases using monolingual translation models. While useful, the effectiveness of these models is highly dependent on the availability of quality parallel monolingual corpora (e.g., question-answer pairs) in the absence of which they are troubled by noise issue. In this work, we propose an alternative way to address the word ambiguity and word mismatch problems by taking advantage of potentially rich semantic information drawn from other languages. Our proposed method employs statistical machine translation to improve question retrieval and enriches the question representation with the translated words from other languages via matrix factorization. Experiments conducted on a real CQA data show that our proposed approach is promising.
5 0.063284062 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre
Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.
6 0.062982082 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
7 0.062952273 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
8 0.062888347 322 acl-2013-Simple, readable sub-sentences
9 0.054126453 255 acl-2013-Name-aware Machine Translation
10 0.053216986 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
11 0.051712856 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
12 0.051542338 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations
13 0.051125959 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
14 0.049629867 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
15 0.048687629 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
16 0.047482688 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora
17 0.04723575 289 acl-2013-QuEst - A translation quality estimation framework
18 0.045773134 16 acl-2013-A Novel Translation Framework Based on Rhetorical Structure Theory
19 0.044511702 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
20 0.042900246 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
topicId topicWeight
[(0, 0.126), (1, -0.03), (2, 0.062), (3, 0.004), (4, -0.023), (5, -0.012), (6, 0.009), (7, 0.0), (8, 0.022), (9, 0.021), (10, -0.015), (11, 0.023), (12, -0.051), (13, -0.005), (14, -0.009), (15, -0.016), (16, -0.048), (17, 0.007), (18, -0.02), (19, 0.011), (20, 0.024), (21, -0.023), (22, -0.031), (23, 0.013), (24, -0.05), (25, 0.028), (26, -0.013), (27, 0.032), (28, 0.008), (29, 0.043), (30, -0.036), (31, -0.05), (32, 0.04), (33, 0.043), (34, 0.02), (35, 0.023), (36, -0.005), (37, -0.006), (38, -0.057), (39, -0.03), (40, -0.008), (41, 0.002), (42, -0.011), (43, -0.031), (44, -0.028), (45, 0.065), (46, -0.036), (47, -0.051), (48, -0.027), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.89384764 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
2 0.79215175 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu
Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
3 0.74758196 250 acl-2013-Models of Translation Competitions
Author: Mark Hopkins ; Jonathan May
Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.
4 0.73106647 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
5 0.69674301 322 acl-2013-Simple, readable sub-sentences
Author: Sigrid Klerke ; Anders Sgaard
Abstract: We present experiments using a new unsupervised approach to automatic text simplification, which builds on sampling and ranking via a loss function informed by readability research. The main idea is that a loss function can distinguish good simplification candidates among randomly sampled sub-sentences of the input sentence. Our approach is rated as equally grammatical and beginner reader appropriate as a supervised SMT-based baseline system by native speakers, but our setup performs more radical changes that better resembles the variation observed in human generated simplifications.
6 0.68544048 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
7 0.66777694 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
8 0.64261359 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
9 0.64217567 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
10 0.63949859 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
11 0.627231 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
12 0.62225175 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
13 0.61979938 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
14 0.6182496 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
15 0.61482245 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
16 0.61266536 289 acl-2013-QuEst - A translation quality estimation framework
17 0.61102402 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
18 0.60678911 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
19 0.58883029 255 acl-2013-Name-aware Machine Translation
20 0.58208561 312 acl-2013-Semantic Parsing as Machine Translation
topicId topicWeight
[(0, 0.049), (6, 0.018), (11, 0.037), (15, 0.015), (24, 0.028), (26, 0.039), (28, 0.011), (35, 0.049), (42, 0.513), (48, 0.027), (70, 0.036), (88, 0.023), (90, 0.021), (95, 0.055)]
simIndex simValue paperId paperTitle
Author: Sina Zarriess ; Jonas Kuhn
Abstract: We suggest a generation task that integrates discourse-level referring expression generation and sentence-level surface realization. We present a data set of German articles annotated with deep syntax and referents, including some types of implicit referents. Our experiments compare several architectures varying the order of a set of trainable modules. The results suggest that a revision-based pipeline, with intermediate linearization, significantly outperforms standard pipelines or a parallel architecture.
2 0.98119098 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation
Author: Isao Goto ; Masao Utiyama ; Eiichiro Sumita ; Akihiro Tamura ; Sadao Kurohashi
Abstract: This paper proposes new distortion models for phrase-based SMT. In decoding, a distortion model estimates the source word position to be translated next (NP) given the last translated source word position (CP). We propose a distortion model that can consider the word at the CP, a word at an NP candidate, and the context of the CP and the NP candidate simultaneously. Moreover, we propose a further improved model that considers richer context by discriminating label sequences that specify spans from the CP to NP candidates. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data. In our experiments, our model improved 2.9 BLEU points for Japanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
3 0.97886872 372 acl-2013-Using CCG categories to improve Hindi dependency parsing
Author: Bharat Ram Ambati ; Tejaswini Deoskar ; Mark Steedman
Abstract: We show that informative lexical categories from a strongly lexicalised formalism such as Combinatory Categorial Grammar (CCG) can improve dependency parsing of Hindi, a free word order language. We first describe a novel way to obtain a CCG lexicon and treebank from an existing dependency treebank, using a CCG parser. We use the output of a supertagger trained on the CCGbank as a feature for a state-of-the-art Hindi dependency parser (Malt). Our results show that using CCG categories improves the accuracy of Malt on long distance dependencies, for which it is known to have weak rates of recovery.
same-paper 4 0.9553411 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
Author: Abhijit Mishra ; Pushpak Bhattacharyya ; Michael Carl
Abstract: In this paper we introduce Translation Difficulty Index (TDI), a measure of difficulty in text translation. We first define and quantify translation difficulty in terms of TDI. We realize that any measure of TDI based on direct input by translators is fraught with subjectivity and adhocism. We, rather, rely on cognitive evidences from eye tracking. TDI is measured as the sum of fixation (gaze) and saccade (rapid eye movement) times of the eye. We then establish that TDI is correlated with three properties of the input sentence, viz. length (L), degree of polysemy (DP) and structural complexity (SC). We train a Support Vector Regression (SVR) system to predict TDIs for new sentences using these features as input. The prediction done by our framework is well correlated with the empirical gold standard data, which is a repository of < L, DP, SC > and TDI pairs for a set of sentences. The primary use of our work is a way of “binning” sentences (to be translated) in “easy”, “medium” and “hard” categories as per their predicted TDI. This can decide pricing of any translation task, especially useful in a scenario where parallel corpora for Machine Translation are built through translation crowdsourcing/outsourcing. This can also provide a way of monitoring progress of second language learners.
5 0.94600058 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
6 0.92917246 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations
7 0.92662066 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features
8 0.9198153 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
9 0.74283463 166 acl-2013-Generalized Reordering Rules for Improved SMT
10 0.73135549 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
11 0.72236675 281 acl-2013-Post-Retrieval Clustering Using Third-Order Similarity Measures
12 0.71617866 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction
13 0.69885606 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
14 0.69333458 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources
15 0.67779672 69 acl-2013-Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation
16 0.67603153 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
17 0.6611464 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts
18 0.65482259 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
19 0.65082568 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
20 0.63839042 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation