acl acl2011 acl2011-146 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. [sent-3, score-0.312]
2 The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0. [sent-9, score-0.438]
3 9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. [sent-13, score-0.311]
4 However, a remaining open question is how to predict confidence scores for machine translated words and sentences. [sent-24, score-0.283]
5 Other areas, such as cross-lingual question-answering, information extraction and retrieval, can also benefit from the confidence scores of MT output. [sent-32, score-0.227]
6 Numerous attempts have been made to tackle the confidence estimation problem. [sent-34, score-0.262]
7 (2004) is perhaps the best known study of sentence and word level features and their impact on translation error prediction. [sent-36, score-0.255]
8 Soricut and Echihabi (2010) developed regression models which are used to predict the expected BLEU score of a given translation hypothesis. [sent-41, score-0.212]
9 Improvement also can be obtained by using target part-of-speech and null dependency link in a MaxEnt classifier (Xiong et al. [sent-42, score-0.232]
10 Literally, it translates backward the MT output into the source language to see whether the output of backward translation matches the original source sentence. [sent-47, score-0.444]
11 (2004) only investigated source ngram frequency statistics and source language model features, while other work mainly focused on target side features. [sent-51, score-0.357]
12 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 211–219, the translation references which is different from predicting the human-targeted translation edit rate (HTER) which is crucial in post-editing applications (Snover et al. [sent-59, score-0.312]
13 Finally, the backtranslation approach faces a serious issue when forward and backward translation models are symmetric. [sent-62, score-0.197]
14 In this paper, we predict error types of each word in the MT output with a confidence score, extend it to the sentence level, then apply it to n-best list reranking task to improve MT quality, and finally design a visualization prototype. [sent-64, score-0.455]
15 We try to answer the following questions: • Can we use a rich feature set such as sourcesCidaen winefor umseati aon r,i alignment context, a ansd dependency structures to improve error prediction performance? [sent-65, score-0.348]
16 d • Do confidence measures help the MT system to sDeole ccto a f ibdetetnecre etr manesalsautiroens? [sent-69, score-0.227]
17 • How confidence score can be presented to improve ceonndf-iudseenrc perception? [sent-70, score-0.227]
18 We describe novel features including source-side, alignment context, and dependency structures in Section 3. [sent-72, score-0.301]
19 Section 5 and 6 present applications of confidence scores. [sent-74, score-0.227]
20 We first estimate each individual word confidence and extend it to the whole sentence. [sent-77, score-0.227]
21 Given a training instance x, y is the true label of x; f stands for its feature vector f(x, y); and w is feature weight vector. [sent-91, score-0.176]
22 To estimate the confidence of a sentence S we rely on the information from the forward-backward inference. [sent-112, score-0.227]
23 However, this quaSonu-rc tity is the confidence measure for the label sequence predicted by the classifier and it does not represent SthTouaergrce goodness of the whole MT output. [sent-114, score-0.547]
24 is defined as follow goodness(S) = Pik=1p(yik= Good|S) (7) goodness(S) is ranging between 0 and 1, where 0 is equivalent to an absolutely wrong translation and 1 is a perfect translation. [sent-118, score-0.199]
25 Essentially, goodness(S) is the arithmetic mean which represents the goodness of translation per word in the whole sentence. [sent-119, score-0.378]
26 In this section, we describe three new feature sets introduced on top of our baseline classifier which has WPP and target POS features (Ueffing and Ney, 2007; Xiong et al. [sent-122, score-0.285]
27 1 Source-side features From MT decoder log, we can track which source phrases generate target phrases. [sent-125, score-0.268]
28 Furthermore, one can infer the alignment between source and target words within the phrase pair using simple aligners such as IBM Model-1 alignment. [sent-126, score-0.354]
29 Source phrase features: These features are designed to capture the likelihood that source phrase and target word co-occur with a given error label. [sent-127, score-0.376]
30 Figure 1a illustrates this feature template where the first line is source POS tags, the second line is the Buckwalter romanized source Arabic sequence, and the third line is MT output. [sent-141, score-0.272]
31 The source phrase feature is defined as follow f102(process) = ? [sent-142, score-0.246]
32 01 ioft shoeurwrcise-ePOS=“ DT DTNN ” Source POS and phrase context features: This feature set allows us to look at the surrounding context of the source phrase. [sent-146, score-0.323]
33 We also have other information such as on the right hand side the next two phrases are “ayda” and “tshyr” or the sequence of source target POS on the right hand side is “RB VBP”. [sent-148, score-0.299]
34 2 Alignment context features The IBM Model-1 feature performed relatively well in comparison with the WPP feature as shown by Blatz et al. [sent-151, score-0.232]
35 h6dbaN7rilth…dyDhTisaDolIfmNTDplJta rhomDNdeclTystmD TuaJlytRjindBlJsaoyDtiT VrnasNeBlhfqPySZwrnsatDTvIJOaNoly tbJfaDhoNdTercymtinsNaqSbdrilty (c) Left target (d) Source POS & right target Figure 2: Alignment context features. [sent-158, score-0.253]
36 IBM Model-1 feature but also the surrounding alignment context. [sent-159, score-0.177]
37 The key intuition is that collocation is a reliable indicator for judging if a target word is generated by a particular source word (Huang, 2009). [sent-160, score-0.209]
38 Moreover, the IBM Model-1 feature was already used in several steps of a translation system such as word alignment, phrase extraction and scoring. [sent-161, score-0.256]
39 The IBM Model-1 assumes one target word can only be aligned to one source word. [sent-164, score-0.254]
40 Therefore, given a target word we can always identify which source word it is aligned to. [sent-165, score-0.254]
41 Source alignment context feature: We anchor the target word and derive context features surrounding its source word. [sent-166, score-0.461]
42 For example, in Figure 2a and 2b we have an alignment between “tshyr” and “refers” The source contexts “tshyr” with a window of one word are “ayda” to the left and “aly” to the right. [sent-167, score-0.214]
43 Target alignment context feature: Similar to source alignment context features, we anchor the source word and derive context features surrounding the aligned target word. [sent-168, score-0.761]
44 Figure 2c shows a left target context feature of word “refers”. [sent-169, score-0.213]
45 Combining alignment context with POS tags: Instead of using lexical context we have features to look at source and target POS alignment context. [sent-171, score-0.572]
46 3 Source and target dependency structure features The contextual and source information in the previous sections only take into account surface structures of source and target sentences. [sent-175, score-0.608]
47 Meanwhile, dependency structures have been extensively used in various translation systems (Shen et al. [sent-176, score-0.287]
48 The adoption of dependency structures might enable the classifier to utilize deep structures to predict translation errors. [sent-180, score-0.456]
49 Child-Father agreement: The motivation is to take advantage of the long distance dependency relations between source and target words. [sent-186, score-0.281]
50 Given an alignment between a source word si and a target word tj . [sent-187, score-0.363]
51 A childfather agreement exists when sk is aligned to tl, where sk and tl are father of si and tj in source and target dependency trees, respectively. [sent-188, score-0.369]
52 Children agreement: In the child-father agreement feature we look up in the dependency tree, however, we also can look down to the dependency tree with a similar motivation. [sent-196, score-0.21]
53 Essentially, given an alignment between a source word si and a target word tj, how many children of si and tj are aligned together? [sent-197, score-0.408]
54 1 Arabic-English translation system The SMT engine is a phrase-based system similar to the description in (Tillmann, 2006), where various features are combined within a log-linear framework. [sent-200, score-0.215]
55 These features include source-to-target phrase translation score, source-to-target and target-to-source wordto-word translation scores, language model score, distortion model scores and word count. [sent-201, score-0.405]
56 We trained two types of classifiers to predict the error type of each word in MT output, namely Good/Bad with a binary classifier and Good/Insertion/Substitution/Shift with a 4-class classifier. [sent-217, score-0.22]
57 WPP + target POS: only WPP and target POS featWurPePs are aurgseedt. [sent-219, score-0.212]
58 Our features: the classifier has source side, alignOmeurnt f context, ahned c dependency ssotruurccteu rseid fee,a atulirgens;WPP and target POS features are excluded. [sent-222, score-0.394]
59 binary dev test WPP + source side + alignment context 4-class dev test 69. [sent-225, score-0.413]
60 3 WPP+ target POS + source side + alignment context + dependency structures 69. [sent-241, score-0.537]
61 Experimental results also indicate that sourceside information, alignment context and dependency Predicting Good/Bad words c-sFore67675240824659. [sent-260, score-0.224]
62 All-Good WPP WPP+target POS Our features WPP+Our features WPP+target POS+Our features (a) Binary Predicting Good/Insertion/Substitution/Shift words -esroFc6 6 5642109837659. [sent-268, score-0.177]
63 61 All-Good WPP WPP+target POS Our features WPP+Our features WPP+target POS+Our features (b) 4-class Figure 4: Performance of binary and 4-class classifiers trained with different feature sets on the development and unseen test sets. [sent-276, score-0.369]
64 Among the three proposed feature sets, we observe the source side information contributes the most gain, which is followed by the alignment context and dependency structure features. [sent-278, score-0.438]
65 On the unseen test set our proposed features outperform WPP and target POS features by 2. [sent-283, score-0.28]
66 4 Correlation between Goodness and HTER We estimate sentence level confidence score based on Equation 7. [sent-294, score-0.227]
67 Figure 5 illustrates the correlation between our proposed goodness sentence level confidence score and the human-targeted translation edit rate (HTER). [sent-295, score-0.66]
68 The Pearson correlation between goodness and HTER is 0. [sent-296, score-0.277]
69 1 Table 2: Detailed performance in precision, recall and F-score of binary and 4-class classifiers with WPP+target POS+Our features on the unseen test set. [sent-312, score-0.185]
70 bars are thresholds used to visualize good and bad sentences respectively. [sent-313, score-0.166]
71 We also experimented goodness computation in Equation 7 using geometric mean and harmonic mean; their Pearson correlation values are 0. [sent-314, score-0.277]
72 5 Improving MT quality with N-best list reranking Experiments reporting in Section 4 indicate that the proposed confidence measure has a high correlation with HTER. [sent-317, score-0.366]
73 However, it is not very clear ifthe core MT system can benefit from confidence measure by providing better translations. [sent-318, score-0.227]
74 The MT system generates top n hypotheses and for each hypothesis we compute sentence-level confidence scores. [sent-320, score-0.227]
75 The best candidate is the hypothesis with highest confidence score. [sent-321, score-0.227]
76 Table 3 shows the performance of reranking systems using goodness scores from our best classifier in various n-best sizes. [sent-322, score-0.36]
77 Figure 6 shows the improvement of reranking with goodness score. [sent-374, score-0.306]
78 6 Visualizing translation errors Besides the application of confidence score in the n- best list reranking task, we propose a method to visualize translation error using confidence scores. [sent-377, score-1.015]
79 Our purpose is to visualize word and sentence-level confidence scores with the following objectives 1) easy for spotting translations errors; 2) simple and intuitive; and 3) helpful for post-editing productivity. [sent-378, score-0.319]
80 On word level, the marginal probability of good label is used to visualize translation errors as follow: Li=dgbeaocdodent ioftph pe(yr wi= se G o od |S ) ≤≥ 0 . [sent-380, score-0.359]
81 On sentence level, the goodness score is used as follow: LS=dgbeaocdodent iofthg o ero wd ni se s s(S ) ≤≥0 . [sent-382, score-0.222]
82 57 Choices Intention Font size big small medium bad good decent Colors red black orange bad good decent Table 4: Choices of layout Different font sizes and colors are used to catch the attention of post-editors whenever translation errors are likely to appear as shown in Table 4. [sent-383, score-0.604]
83 The idea of using font size and colour to visualize translation confidence is similar to the idea of using tag/word cloud to describe the content of websites2. [sent-385, score-0.648]
84 The reason we are using big font size and red color is to attract post-editors’ attention and help them find translation errors quickly. [sent-386, score-0.378]
85 Figure 7 shows an example of visualizing confidence scores by font size and colours. [sent-387, score-0.367]
86 It shows that “not to deprive yourself”, displayed in big font and red color, is likely to be bad translations. [sent-388, score-0.241]
87 Meanwhile, other words, such as “you”, “different”, “from”, and “assimilation”, displayed in small font and black color, are likely to be good translation. [sent-389, score-0.174]
88 Medium font and orange color words are decent translations. [sent-390, score-0.233]
89 ا MT output you totally different from zaid amr , and not to deprive yourself in a basement of imitation and assimilation . [sent-446, score-0.346]
90 We predict you and visualize Human correction totally different from zaid amr , and not to deprive yourself in a basement of imitation and assimilation . [sent-447, score-0.537]
91 you are quite different from zaid and amr , so do not cram yourself in the tunnel of simulation , imitation and assimilation . [sent-448, score-0.244]
92 (b) Figure 7: MT errors visualization based on confidence scores. [sent-539, score-0.308]
93 7 Conclusions In this paper we proposed a method to predict confidence scores for machine translated words and sentences based on a feature-rich classifier using linguistic and context features. [sent-540, score-0.378]
94 Our major contributions are three novel feature sets including source side information, alignment context, and dependency structures. [sent-541, score-0.397]
95 Experimental results show that by combining the source side information, alignment context, and dependency structure features with word posterior probability and target POS context (Ueffing & Ney 2007; Xiong et al. [sent-542, score-0.537]
96 Furthermore, we show that the proposed confidence scores can help the MT system to select better translations and as a result improvements between 0. [sent-550, score-0.227]
97 Finally, we demonstrate a prototype to visualize translation errors. [sent-553, score-0.283]
98 First, we plan to apply confidence estimation to perform a second-pass constraint decoding. [sent-555, score-0.262]
99 After the first pass decoding, our confidence estimation model can la- sbe clo wndhi-cphas wso drdec oisd li nkgely ut ioliz be s c thoer ceoctnlfyid teran csela itnefdo. [sent-556, score-0.262]
100 A new string-to-dependency machine translation algorithm with a target dependency language model. [sent-646, score-0.334]
wordName wordTfidf (topN-words)
[('wpp', 0.531), ('confidence', 0.227), ('goodness', 0.222), ('mt', 0.222), ('tshyr', 0.163), ('translation', 0.156), ('font', 0.14), ('hter', 0.125), ('dtjj', 0.123), ('vbp', 0.122), ('alignment', 0.111), ('bach', 0.108), ('target', 0.106), ('source', 0.103), ('alamlyt', 0.102), ('ayda', 0.102), ('hdhh', 0.102), ('ter', 0.1), ('pos', 0.094), ('visualize', 0.092), ('reranking', 0.084), ('dtj', 0.082), ('ueffing', 0.078), ('dependency', 0.072), ('blatz', 0.072), ('aly', 0.072), ('assimilation', 0.072), ('rb', 0.07), ('ibm', 0.07), ('feature', 0.066), ('adm', 0.061), ('albhryt', 0.061), ('aljnsyt', 0.061), ('almtaddt', 0.061), ('alqwat', 0.061), ('deprive', 0.061), ('dtn', 0.061), ('dtnn', 0.061), ('imitation', 0.061), ('qdrt', 0.061), ('zaid', 0.061), ('structures', 0.059), ('features', 0.059), ('unseen', 0.056), ('xiong', 0.056), ('predict', 0.056), ('correlation', 0.055), ('classifier', 0.054), ('sanchis', 0.05), ('amr', 0.05), ('color', 0.049), ('mira', 0.048), ('visualization', 0.048), ('wt', 0.046), ('side', 0.045), ('aligned', 0.045), ('bleu', 0.045), ('decent', 0.044), ('label', 0.044), ('follow', 0.043), ('tj', 0.043), ('nguyen', 0.043), ('correction', 0.043), ('backward', 0.041), ('prepositional', 0.041), ('aonnd', 0.041), ('basement', 0.041), ('climate', 0.041), ('dgbeaocdodent', 0.041), ('dthte', 0.041), ('dtnns', 0.041), ('opnre', 0.041), ('rozenfeld', 0.041), ('wydyf', 0.041), ('ysj', 0.041), ('context', 0.041), ('nn', 0.041), ('bad', 0.04), ('error', 0.04), ('pearson', 0.04), ('colors', 0.039), ('ioft', 0.038), ('dev', 0.038), ('nj', 0.038), ('binary', 0.037), ('yi', 0.037), ('insertion', 0.037), ('raybaud', 0.036), ('arabicenglish', 0.036), ('zs', 0.036), ('prototype', 0.035), ('estimation', 0.035), ('stephan', 0.035), ('good', 0.034), ('phrase', 0.034), ('specia', 0.033), ('cloud', 0.033), ('errors', 0.033), ('meanwhile', 0.033), ('classifiers', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999839 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
2 0.2017062 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
Author: Jinxi Xu ; Jinying Chen
Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1
3 0.18409163 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth
Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.
Author: Chi-kiu Lo ; Dekai Wu
Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1
5 0.14771135 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
Author: Yanjun Ma ; Yifan He ; Andy Way ; Josef van Genabith
Abstract: We present a discriminative learning method to improve the consistency of translations in phrase-based Statistical Machine Translation (SMT) systems. Our method is inspired by Translation Memory (TM) systems which are widely used by human translators in industrial settings. We constrain the translation of an input sentence using the most similar ‘translation example’ retrieved from the TM. Differently from previous research which used simple fuzzy match thresholds, these constraints are imposed using discriminative learning to optimise the translation performance. We observe that using this method can benefit the SMT system by not only producing consistent translations, but also improved translation outputs. We report a 0.9 point improvement in terms of BLEU score on English–Chinese technical documents.
6 0.13419886 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
7 0.13057037 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
8 0.12997852 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
9 0.12815906 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
10 0.12110689 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
11 0.12000754 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
12 0.11855932 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
13 0.11637684 313 acl-2011-Two Easy Improvements to Lexical Weighting
14 0.11461445 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
15 0.11399439 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
16 0.1132296 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
17 0.1083511 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
18 0.10758726 44 acl-2011-An exponential translation model for target language morphology
19 0.10530087 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
20 0.10501881 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
topicId topicWeight
[(0, 0.268), (1, -0.158), (2, 0.096), (3, 0.083), (4, 0.036), (5, 0.032), (6, 0.078), (7, -0.025), (8, 0.069), (9, 0.07), (10, 0.028), (11, -0.057), (12, 0.001), (13, -0.076), (14, -0.057), (15, 0.064), (16, -0.016), (17, -0.067), (18, -0.048), (19, -0.05), (20, -0.045), (21, -0.019), (22, 0.0), (23, 0.051), (24, 0.008), (25, -0.022), (26, -0.039), (27, -0.006), (28, -0.013), (29, 0.023), (30, -0.054), (31, 0.028), (32, -0.036), (33, 0.02), (34, -0.024), (35, 0.058), (36, 0.033), (37, 0.007), (38, 0.072), (39, -0.012), (40, 0.018), (41, 0.027), (42, 0.004), (43, -0.025), (44, -0.014), (45, -0.005), (46, 0.112), (47, -0.045), (48, -0.034), (49, -0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.95518452 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
2 0.83950716 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
Author: Yanjun Ma ; Yifan He ; Andy Way ; Josef van Genabith
Abstract: We present a discriminative learning method to improve the consistency of translations in phrase-based Statistical Machine Translation (SMT) systems. Our method is inspired by Translation Memory (TM) systems which are widely used by human translators in industrial settings. We constrain the translation of an input sentence using the most similar ‘translation example’ retrieved from the TM. Differently from previous research which used simple fuzzy match thresholds, these constraints are imposed using discriminative learning to optimise the translation performance. We observe that using this method can benefit the SMT system by not only producing consistent translations, but also improved translation outputs. We report a 0.9 point improvement in terms of BLEU score on English–Chinese technical documents.
3 0.83142775 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Author: Jonathan H. Clark ; Chris Dyer ; Alon Lavie ; Noah A. Smith
Abstract: In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately.
4 0.81975698 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
5 0.79131275 313 acl-2011-Two Easy Improvements to Lexical Weighting
Author: David Chiang ; Steve DeNeefe ; Michael Pust
Abstract: We introduce two simple improvements to the lexical weighting features of Koehn, Och, and Marcu (2003) for machine translation: one which smooths the probability of translating word f to word e by simplifying English morphology, and one which conditions it on the kind of training data that f and e co-occurred in. These new variations lead to improvements of up to +0.8 BLEU, with an average improvement of +0.6 BLEU across two language pairs, two genres, and two translation systems.
6 0.76042438 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
7 0.75260586 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
8 0.73407167 220 acl-2011-Minimum Bayes-risk System Combination
9 0.72400254 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
10 0.72044271 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
11 0.71898109 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
12 0.7045216 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
13 0.70230377 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
14 0.69182634 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
15 0.68940604 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
16 0.68392962 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
18 0.68060398 264 acl-2011-Reordering Metrics for MT
19 0.66004843 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
20 0.65663242 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
topicId topicWeight
[(1, 0.011), (5, 0.024), (17, 0.031), (26, 0.022), (28, 0.012), (31, 0.014), (37, 0.101), (39, 0.052), (41, 0.04), (55, 0.035), (59, 0.032), (72, 0.045), (75, 0.269), (91, 0.06), (96, 0.171), (97, 0.018)]
simIndex simValue paperId paperTitle
1 0.96219021 303 acl-2011-Tier-based Strictly Local Constraints for Phonology
Author: Jeffrey Heinz ; Chetan Rawal ; Herbert G. Tanner
Abstract: Beginning with Goldsmith (1976), the phonological tier has a long history in phonological theory to describe non-local phenomena. This paper defines a class of formal languages, the Tier-based Strictly Local languages, which begin to describe such phenomena. Then this class is located within the Subregular Hierarchy (McNaughton and Papert, 1971). It is found that these languages contain the Strictly Local languages, are star-free, are incomparable with other known sub-star-free classes, and have other interesting properties.
Author: Omar F. Zaidan ; Chris Callison-Burch
Abstract: The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
3 0.82252431 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
same-paper 4 0.77134627 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
5 0.72746128 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
Author: David Chen ; William Dolan
Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
6 0.63837302 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
7 0.637582 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
9 0.63445693 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
10 0.63346946 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
11 0.63281333 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
12 0.63262963 133 acl-2011-Extracting Social Power Relationships from Natural Language
13 0.63129056 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment
14 0.63069057 28 acl-2011-A Statistical Tree Annotator and Its Applications
15 0.63054669 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
16 0.6304909 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
17 0.63033104 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
18 0.63024783 117 acl-2011-Entity Set Expansion using Topic information
19 0.6299811 44 acl-2011-An exponential translation model for target language morphology
20 0.62988126 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters