acl acl2013 acl2013-289 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
Reference: text
sentIndex sentText sentNum sentScore
1 QuEst - A translation quality estimation framework Lucia Specia§, Kashif Shah§, Jose G. [sent-1, score-0.299]
2 †Fondazione Bruno Kessler †UFnoinvdearzsiitoyn oef B Trurennoto K, eItssallyer de s ou z a@ fbk . [sent-8, score-0.045]
3 eu Abstract We describe QUEST, an open source framework for machine translation quality estimation. [sent-9, score-0.357]
4 The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc. [sent-10, score-0.273]
5 It also provides machine learning algorithms to build quality estimation models. [sent-13, score-0.248]
6 We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms. [sent-14, score-0.232]
7 1 Introduction As Machine Translation (MT) systems become widely adopted both for gisting purposes and to produce professional quality translations, automatic methods are needed for predicting the quality of a translated segment. [sent-15, score-0.398]
8 Different from standard MT evaluation metrics, QE metrics do not have access to reference (human) translations; they are aimed at MT systems in use. [sent-17, score-0.075]
9 Work in QE for MT started in the early 2000’s, inspired by the confidence scores used in Speech Recognition: mostly the estimation of word posterior probabilities. [sent-19, score-0.105]
10 Back then it was called confi- dence estimation, which we believe is a narrower term. [sent-20, score-0.048]
11 , 2004) had as goal to estimate automatic metrics such as BLEU (Papineni et al. [sent-22, score-0.038]
12 These metrics are difficult to interpret, particularly at the sentence-level, and results oftheir very many trials proved unsuccessful. [sent-24, score-0.038]
13 The overall quality of MT was considerably lower at the time, and therefore pinpointing the very few good quality segments was a hard problem. [sent-25, score-0.435]
14 No software nor datasets were made available after the workshop. [sent-26, score-0.078]
15 A new surge of interest in the field started recently, motivated by the widespread used of MT systems in the translation industry, as a consequence of better translation quality, more userfriendly tools, and higher demand for translation. [sent-27, score-0.245]
16 In order to make MT maximally useful in this scenario, a quantification of the quality of translated segments similar to “fuzzy match scores” from translation memory systems is needed. [sent-28, score-0.51]
17 QE work addresses this problem by using more complex metrics that go beyond matching the source segment with previously translated data. [sent-29, score-0.318]
18 QE can also be useful for end-users reading translations for gisting, particularly those who cannot read the source language. [sent-30, score-0.222]
19 Examples include improving post-editing efficiency by filtering out low quality segments which would require more effort or time to correct than translating from scratch (Specia et al. [sent-34, score-0.376]
20 , 2009; Specia, 2011), selecting high quality segments to be published as they are, without post-editing (Soricut and Echihabi, 2010), selecting a translation from either an MT system or a translation memory for postediting (He et al. [sent-35, score-0.631]
21 , 2010), selecting the best translation from multiple MT systems (Specia et al. [sent-36, score-0.166]
22 c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 79–84, 2010), and highlighting sub-segments that need revision (Bach et al. [sent-39, score-0.038]
23 QE is generally addressed as a supervised machine learning task using a variety of algorithms to induce models from examples of translations described through a number of features and annotated for quality. [sent-41, score-0.256]
24 For an overview of various algorithms and features we refer the reader to the WMT12 shared task on QE (Callison-Burch et al. [sent-42, score-0.134]
25 Most of the research work lies on deciding which aspects of quality are more relevant for a given task and designing feature extractors for them. [sent-44, score-0.326]
26 While simple features such as counts oftokens and language model scores can be easily extracted, feature engineering for more advanced and useful information can be quite labourintensive. [sent-45, score-0.2]
27 Different language pairs or optimisation against specific quality scores (e. [sent-46, score-0.227]
28 , post-editing time vs translation adequacy) can benefit from very different feature sets. [sent-48, score-0.165]
29 QUEST, our framework for quality estimation, provides a wide range of feature extractors from source and translation texts and external resources and tools (Section 2). [sent-49, score-0.557]
30 They include features that rely on information from the MT system that generated the translations, and features that are oblivious to the way translations were produced (Section 2. [sent-51, score-0.31]
31 In Section 3 we present experiments using the framework with nine QE datasets. [sent-55, score-0.065]
32 In addition to providing a practical platform for quality estimation, by freeing researchers from feature engineering, QUEST will facilitate work on the learning aspect of the problem. [sent-56, score-0.181]
33 Moreover, QE is highly 1http : / / s cikit -learn . [sent-58, score-0.139]
34 org/ non-linear: unlike many other problems in language processing, considerable improvements can be achieved using non-linear kernel techniques. [sent-59, score-0.04]
35 Also, different applications for the quality predictions may benefit from different machine learning techniques, an aspect that has been mostly neglected so far. [sent-60, score-0.106]
36 Finally, the framework will also facilitate research on ways of using quality predictions in novel extrinsic tasks, such as self-training of statistical machine translation systems, and for estimating quality in other text output applications such as text summarisation. [sent-61, score-0.377]
37 2 The QUEST framework QUEST consists of two main modules: a feature extraction module and a machine learning module. [sent-62, score-0.207]
38 The first module provides a number of feature extractors, including the most commonly used fea- tures in the literature and by systems submitted to the WMT12 shared task on QE (Callison-Burch et al. [sent-63, score-0.243]
39 2 It is implemented in Java and provides abstract classes for features, resources and preprocessing steps so that extractors for new features can be easily added. [sent-66, score-0.248]
40 The basic functioning of the feature extraction module requires raw text files with the source and translation texts, and a few resources (where available) such as the source MT training corpus and language models of source and target. [sent-67, score-0.669]
41 Configuration files are used to indicate the resources available and a list of features that should be extracted. [sent-68, score-0.17]
42 The machine learning module provides scripts connecting the feature files with the s cikit -learn toolkit. [sent-69, score-0.417]
43 It also uses GPy, a Python toolkit for Gaussian Processes regression, which outperformed algorithms commonly used for the task such as SVM regressors. [sent-70, score-0.04]
44 1 Feature sets In Figure 1 we show the types of features that can be extracted in QUEST. [sent-72, score-0.094]
45 Although the text unit for which features are extracted can be of any length, most features are more suitable for sentences. [sent-73, score-0.188]
46 From the source segments QUEST can extract features that attempt to quantify the complexity 2http : / /www . [sent-75, score-0.417]
47 html 80 iAnddeicqautaocrsy Source textMT systemTranslation Cinodmicpaletoxirtsy Cinodnifcidaetonrcse inFdliuceantocrys Figure 1: Families of features in QUEST. [sent-80, score-0.094]
48 From the translated segments QUEST can extract features that attempt to measure the fluency of such translations. [sent-83, score-0.371]
49 Examples of features include: • number of tokens in the target segment; • average onfu mtokbeern so ifn occurrences gofm tehnet target awvoerrda gweit nhuinm tbheer target segment; • LM probability of target segment using a large corpus oityf t ohef target language utos ibnugild a the LM. [sent-84, score-0.684]
50 From the comparison between the source and target segments, QUEST can extract adequacy features, which attempt to measure whether the structure and meaning of the source are preserved in the translation. [sent-85, score-0.326]
51 r When available, information from the MT system used to produce the translations can be very useful, particularly for statistical machine translation (SMT). [sent-88, score-0.212]
52 These features can provide an indication of the confidence of the MT system in the translations. [sent-89, score-0.128]
53 To extract these features, QUEST assumes the output of Moses-like SMT systems, taking into account word- and phrasealignment information, a dump of the decoder’s standard output (search graph information), global model score and feature values, n-best lists, etc. [sent-91, score-0.075]
54 Some word-level features have also been implemented: they include standard word posterior probabilities and n-gram probabilities for each tar81 get word. [sent-95, score-0.156]
55 The complete list of features available is given as part of QUEST’s documentation. [sent-97, score-0.094]
56 At the current stage, the number of BB features varies from 80 to 123 depending on the language pair, while GB features go from 39 to 48 depending on the SMT system used (see Section 3). [sent-98, score-0.188]
57 2 Machine learning QUEST provides a command-line interface module for the s cikit-learn library implemented in Python. [sent-100, score-0.131]
58 This module is completely independent from the feature extraction code and it uses the extracted feature sets to build QE models. [sent-101, score-0.287]
59 The dependencies are the s cikit -learn library and all its dependencies (such as NumPy3 and SciPy4). [sent-102, score-0.139]
60 The module can be configured to run different regression and classification algorithms, feature selection methods and grid search for hyper-parameter optimisation. [sent-103, score-0.369]
61 The pipeline with feature selection and hyperparameter optimisation can be set using a con- figuration file. [sent-104, score-0.254]
62 Currently, the module has an interface for Support Vector Regression (SVR), Support Vector Classification, and Lasso learning algorithms. [sent-105, score-0.1]
63 They can be used in conjunction with the feature selection algorithms (Randomised Lasso and Randomised decision trees) and the grid search implementation of s cikit -learn to fit an optimal model of a given dataset. [sent-106, score-0.388]
64 Additionally, QUEST includes Gaussian Process (GP) regression (Rasmussen and Williams, 2006) using the GPy toolkit. [sent-107, score-0.06]
65 5 GPs are an advanced machine learning framework incorporating Bayesian non-parametrics and kernel machines, and are widely regarded as state of the art for regression. [sent-108, score-0.103]
66 6 In contrast to SVR, inference in GP regression can be expressed analytically and the model hyperparameters optimised directly using gradient ascent, thus avoiding the need for costly grid search. [sent-110, score-0.136]
67 This also makes the method very suitable for feature selection. [sent-111, score-0.075]
68 com/ She f fie ldML / GPy 6This follows from the optimisation objective: GPs use a quadratic loss (the log-likelihood of a Gaussian) compared to SVR which penalises absolute margin violations. [sent-117, score-0.197]
69 3 Benchmarking In this section we benchmark QUEST on nine existing datasets using feature selection and learning algorithms known to perform well in the task. [sent-119, score-0.312]
70 1 Datasets The statistics of the datasets used in the experiments are shown in Table 1. [sent-121, score-0.078]
71 7 WMT12 English-Spanish sentence translations produced by an SMT system and judged for post-editing effort in 1-5 (worst-best), taking a weighted average of three annotators. [sent-122, score-0.169]
72 EAMT11 English-Spanish (EAMT1 1-en-es) and French-English (EAMT1 1-fr-en) sentence translations judged for post-editing effort in 1-4. [sent-123, score-0.169]
73 EAMT09 English sentences translated by four SMT systems into Spanish and scored for postediting effort in 1-4. [sent-124, score-0.182]
74 GALE11 Arabic sentences translated by two SMT systems into English and scored for adequacy in 1-4. [sent-126, score-0.148]
75 2 Settings Amongst the various learning algorithms available in QUEST, to make our results comparable we selected SVR with radial basis function (RBF) kernel, which has been shown to perform very well in this task (Callison-Burch et al. [sent-129, score-0.04]
76 The optimisation of parameters is done with grid search using the following ranges of values: penalty parameter C: [1, 10, 10] • γ: [0. [sent-131, score-0.197]
77 For feature selection, we have experimented with two techniques: Randomised Lasso and • 7The datasets can be downloaded from http : / /www . [sent-140, score-0.182]
78 Randomised Lasso (Meinshausen and B ¨uhlmann, 2010) repeatedly resamples the training data and fits a Lasso regression model on each sample. [sent-146, score-0.088]
79 A feature is said to be selected if it was selected (i. [sent-147, score-0.075]
80 Feature selection with Gaussian Processes is done by fitting per-feature RBF widths (also known as the automatic relevance determination kernel). [sent-151, score-0.058]
81 The RBF width denotes the importance of a feature, the narrower the RBF the more important a change in the feature value is to the model prediction. [sent-152, score-0.123]
82 To make the results comparable with our baseline systems we select the 17 top ranked features and then train a SVR on these features. [sent-153, score-0.131]
83 8 As feature sets, we select all features available in QUEST for each of our datasets. [sent-154, score-0.169]
84 We differentiate between black-box (BB) and glass-box (GB) features, as only BB are available for all datasets (we did not have access to the MT systems that produced the other datasets). [sent-155, score-0.115]
85 For each dataset we build four systems: • BL: 17 baseline features that performed well across languages eina previous wpeorfrkor amnedd were used as baseline in the WMT12 QE task. [sent-157, score-0.094]
86 3 Results The error scores for all datasets with BB features are reported in Table 2, while Table 3 shows the results with GB features, and Table 4 the results with BB and GB features together. [sent-163, score-0.266]
87 It can be seen from the results that adding more BB features (systems AF) improves the results in most cases as compared to the baseline systems 8More features resulted in further performance gains on most tasks, with 25–35 features giving the best results. [sent-166, score-0.319]
88 This behaviour is to be expected as adding more features may bring more relevant information, but at the same time it makes the representation more sparse and the learning prone to overfitting. [sent-180, score-0.094]
89 In most cases, feature selection with both or either RL and GP improves over all features (AF). [sent-181, score-0.227]
90 It should be noted that RL automatically selects the number of features used for training while FS(GP) was limited to selecting the top 17 features in order to make the results comparable with our baseline feature set. [sent-182, score-0.302]
91 It is interesting to note that system FS(GP) outperformed the other systems in spite of using fewer features. [sent-183, score-0.037]
92 This technique is promising as it reduces the time requirements and overall computational complexity for training the model, while achieving similar results compared to systems with many more features. [sent-184, score-0.037]
93 Another interesting question is whether these feature selection techniques identify a common subset of features from the various datasets. [sent-185, score-0.227]
94 Interestingly, not all top ranked features are among the baseline 17 features which are reportedly best in literature. [sent-187, score-0.216]
95 GB features on their own perform worse than BB features, but in all three datasets, the combination of GB and BB followed by feature selection resulted in significantly lower errors than using only BB features with feature selection, showing that the two features sets are complementary. [sent-188, score-0.49]
96 4 Remarks The source code for the framework, the datasets and extra resources can be downloaded from http : / /www . [sent-189, score-0.279]
97 The project is also set to receive contribution from interested researchers using a GitHub repository: https : / / github . [sent-195, score-0.057]
98 The license for the Java code, Python and shell scripts is BSD, a permissive license with no restrictions on the use or extensions of the software for any purposes, including commercial. [sent-197, score-0.133]
99 , scikit-learn, GPy and Berkeley parser, their licenses apply, but features relying on these resources can be easily discarded if necessary. [sent-200, score-0.129]
wordName wordTfidf (topN-words)
[('quest', 0.46), ('qe', 0.323), ('bb', 0.257), ('segments', 0.223), ('lasso', 0.145), ('cikit', 0.139), ('svr', 0.132), ('gb', 0.129), ('randomised', 0.129), ('mt', 0.128), ('gpy', 0.126), ('segment', 0.126), ('translations', 0.122), ('optimisation', 0.121), ('specia', 0.12), ('gp', 0.116), ('quality', 0.106), ('source', 0.1), ('module', 0.1), ('gisting', 0.095), ('features', 0.094), ('translation', 0.09), ('extractors', 0.088), ('gaussian', 0.087), ('smt', 0.087), ('lm', 0.081), ('datasets', 0.078), ('grid', 0.076), ('feature', 0.075), ('rbf', 0.072), ('estimation', 0.071), ('target', 0.069), ('awvoerrda', 0.063), ('meinshausen', 0.063), ('tchee', 0.063), ('regression', 0.06), ('rl', 0.06), ('selection', 0.058), ('deciding', 0.057), ('github', 0.057), ('adequacy', 0.057), ('rasmussen', 0.056), ('translated', 0.054), ('af', 0.052), ('soricut', 0.05), ('narrower', 0.048), ('fie', 0.048), ('blatz', 0.048), ('que', 0.048), ('mae', 0.048), ('effort', 0.047), ('gps', 0.046), ('proportion', 0.046), ('oef', 0.045), ('ratio', 0.045), ('postediting', 0.044), ('bach', 0.044), ('estimating', 0.043), ('percentage', 0.042), ('files', 0.041), ('kernel', 0.04), ('shah', 0.04), ('algorithms', 0.04), ('selecting', 0.039), ('metrics', 0.038), ('revision', 0.038), ('bl', 0.038), ('code', 0.037), ('license', 0.037), ('systems', 0.037), ('python', 0.036), ('resources', 0.035), ('confidence', 0.034), ('ooff', 0.034), ('fs', 0.033), ('nine', 0.033), ('framework', 0.032), ('advanced', 0.031), ('scripts', 0.031), ('provides', 0.031), ('probabilities', 0.031), ('ba', 0.03), ('java', 0.03), ('downloaded', 0.029), ('eu', 0.029), ('benchmark', 0.028), ('kashif', 0.028), ('odfe', 0.028), ('quartiles', 0.028), ('lstm', 0.028), ('reportedly', 0.028), ('penalises', 0.028), ('nfs', 0.028), ('shell', 0.028), ('functioning', 0.028), ('resamples', 0.028), ('gweit', 0.028), ('luc', 0.028), ('userfriendly', 0.028), ('onfu', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
Author: Trevor Cohn ; Lucia Specia
Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.
3 0.2817525 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
Author: Daniel Beck ; Lucia Specia ; Trevor Cohn
Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .
4 0.12784243 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu
Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
5 0.11161759 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
Author: Chi-kiu Lo ; Karteek Addanki ; Markus Saers ; Dekai Wu
Abstract: We present the first ever results showing that tuning a machine translation system against a semantic frame based objective function, MEANT, produces more robustly adequate translations than tuning against BLEU or TER as measured across commonly used metrics and human subjective evaluation. Moreover, for informal web forum data, human evaluators preferred MEANT-tuned systems over BLEU- or TER-tuned systems by a significantly wider margin than that for formal newswire—even though automatic semantic parsing might be expected to fare worse on informal language. We argue thatbypreserving the meaning ofthe trans- lations as captured by semantic frames right in the training process, an MT system is constrained to make more accurate choices of both lexical and reordering rules. As a result, MT systems tuned against semantic frame based MT evaluation metrics produce output that is more adequate. Tuning a machine translation system against a semantic frame based objective function is independent ofthe translation model paradigm, so, any translation model can benefit from the semantic knowledge incorporated to improve translation adequacy through our approach.
6 0.11097826 240 acl-2013-Microblogs as Parallel Corpora
7 0.10564272 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
8 0.096862286 255 acl-2013-Name-aware Machine Translation
9 0.096453249 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
10 0.095383957 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
11 0.092496976 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
12 0.090210475 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
13 0.090162411 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
14 0.089894317 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
15 0.089493498 135 acl-2013-English-to-Russian MT evaluation campaign
16 0.087712392 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
17 0.085734293 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
18 0.081064388 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.076920673 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
20 0.076583661 154 acl-2013-Extracting bilingual terminologies from comparable corpora
topicId topicWeight
[(0, 0.206), (1, -0.073), (2, 0.137), (3, 0.042), (4, 0.002), (5, -0.004), (6, 0.034), (7, -0.023), (8, 0.056), (9, 0.06), (10, -0.036), (11, 0.085), (12, -0.124), (13, 0.068), (14, -0.119), (15, -0.048), (16, -0.11), (17, 0.01), (18, 0.002), (19, -0.004), (20, 0.155), (21, 0.013), (22, -0.152), (23, 0.04), (24, -0.098), (25, -0.031), (26, -0.033), (27, 0.038), (28, -0.02), (29, 0.025), (30, -0.088), (31, -0.045), (32, 0.098), (33, 0.081), (34, -0.13), (35, -0.005), (36, -0.011), (37, 0.097), (38, -0.004), (39, 0.106), (40, -0.174), (41, -0.088), (42, 0.074), (43, -0.042), (44, 0.013), (45, -0.077), (46, -0.044), (47, 0.103), (48, -0.018), (49, -0.121)]
simIndex simValue paperId paperTitle
same-paper 1 0.90766424 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
2 0.86868811 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
Author: Daniel Beck ; Lucia Specia ; Trevor Cohn
Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .
Author: Trevor Cohn ; Lucia Specia
Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.
4 0.71980202 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Author: Guillaume Wisniewski
Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.
5 0.65833384 135 acl-2013-English-to-Russian MT evaluation campaign
Author: Pavel Braslavski ; Alexander Beloborodov ; Maxim Khalilov ; Serge Sharoff
Abstract: This paper presents the settings and the results of the ROMIP 2013 MT shared task for the English→Russian language directfioorn. t Teh Een quality Rofu generated utraagnsel datiiroencswas assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations.
6 0.65093362 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
7 0.60594827 250 acl-2013-Models of Translation Competitions
8 0.57533163 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
9 0.53450602 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
10 0.5264321 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
11 0.52627403 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
12 0.51825464 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
13 0.51325101 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation
14 0.51230514 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning
15 0.50694406 255 acl-2013-Name-aware Machine Translation
16 0.46274358 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
17 0.46057707 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
18 0.4568851 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
19 0.45402214 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
20 0.45074368 312 acl-2013-Semantic Parsing as Machine Translation
topicId topicWeight
[(0, 0.051), (6, 0.103), (11, 0.042), (23, 0.125), (24, 0.032), (26, 0.083), (29, 0.011), (31, 0.011), (35, 0.057), (42, 0.042), (48, 0.019), (70, 0.032), (88, 0.02), (90, 0.032), (95, 0.242), (99, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.94193983 289 acl-2013-QuEst - A translation quality estimation framework
Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn
Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.
2 0.8692323 37 acl-2013-Adaptive Parser-Centric Text Normalization
Author: Congle Zhang ; Tyler Baldwin ; Howard Ho ; Benny Kimelfeld ; Yunyao Li
Abstract: Text normalization is an important first step towards enabling many Natural Language Processing (NLP) tasks over informal text. While many of these tasks, such as parsing, perform the best over fully grammatically correct text, most existing text normalization approaches narrowly define the task in the word-to-word sense; that is, the task is seen as that of mapping all out-of-vocabulary non-standard words to their in-vocabulary standard forms. In this paper, we take a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text. To understand the real effect of normalization on the parser, we tie normal- ization performance directly to parser performance. Additionally, we design a customizable framework to address the often overlooked concept of domain adaptability, and illustrate that the system allows for transfer to new domains with a minimal amount of data and effort. Our experimental study over datasets from three domains demonstrates that our approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to-word annotations.
3 0.86759174 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
Author: Silvana Hartmann ; Iryna Gurevych
Abstract: We present a new bilingual FrameNet lexicon for English and German. It is created through a simple, but powerful approach to construct a FrameNet in any language using Wiktionary as an interlingual representation. Our approach is based on a sense alignment of FrameNet and Wiktionary, and subsequent translation disambiguation into the target language. We perform a detailed evaluation of the created resource and a discussion of Wiktionary as an interlingual connection for the cross-language transfer of lexicalsemantic resources. The created resource is publicly available at http : / /www . ukp .tu-darmst adt .de / fnwkde / .
4 0.86441851 66 acl-2013-Beam Search for Solving Substitution Ciphers
Author: Malte Nuhn ; Julian Schamper ; Hermann Ney
Abstract: In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly bet- ter (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.
5 0.86192811 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
Author: Kareem Darwish
Abstract: Some languages lack large knowledge bases and good discriminative features for Name Entity Recognition (NER) that can generalize to previously unseen named entities. One such language is Arabic, which: a) lacks a capitalization feature; and b) has relatively small knowledge bases, such as Wikipedia. In this work we address both problems by incorporating cross-lingual features and knowledge bases from English using cross-lingual links. We show that such features have a dramatic positive effect on recall. We show the effectiveness of cross-lingual features and resources on a standard dataset as well as on two new test sets that cover both news and microblogs. On the standard dataset, we achieved a 4.1% relative improvement in Fmeasure over the best reported result in the literature. The features led to improvements of 17.1% and 20.5% on the new news and mi- croblogs test sets respectively.
6 0.8613919 328 acl-2013-Stacking for Statistical Machine Translation
7 0.85937876 359 acl-2013-Translating Dialectal Arabic to English
8 0.85890317 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
9 0.85679692 240 acl-2013-Microblogs as Parallel Corpora
10 0.85337967 333 acl-2013-Summarization Through Submodularity and Dispersion
11 0.84447348 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information
12 0.84405208 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
13 0.84361356 135 acl-2013-English-to-Russian MT evaluation campaign
14 0.84339476 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews
15 0.84224725 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
16 0.83874017 255 acl-2013-Name-aware Machine Translation
17 0.83804309 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits
18 0.83463538 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
19 0.83024716 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
20 0.82690185 97 acl-2013-Cross-lingual Projections between Languages from Different Families