acl acl2011 acl2011-2 knowledge-graph by maker-knowledge-mining

2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment


Source: pdf

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 s g 2 Abstract This work introduces AM-FM, a semantic framework for machine translation evaluation. [sent-5, score-0.342]

2 Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. [sent-6, score-0.178]

3 The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. [sent-7, score-0.665]

4 Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. [sent-8, score-0.722]

5 Finally, the main pros and cons of the proposed framework are discussed along with future research directions. [sent-9, score-0.101]

6 1 Introduction Evaluation has always been one of the major issues in Machine Translation research, as both human and automatic evaluation methods exhibit very important limitations. [sent-10, score-0.175]

7 On the one hand, although highly reliable, in addition to being expensive and time consuming, human evaluation suffers from inconsistency problems due to inter- and intraannotator agreement issues. [sent-11, score-0.122]

8 On the other hand, while being consistent, fast and cheap, automatic 153 Haizhou Li Human Language Technology Department Institute for Infocomm Research 1Fusionopolis Way, Singapore 138632 hl i i r . [sent-12, score-0.053]

9 s g @ 2 evaluation has the major disadvantage of requiring reference translations. [sent-15, score-0.178]

10 This makes automatic evaluation not reliable in the sense that good translations not matching the available references are evaluated as poor or bad translations. [sent-16, score-0.261]

11 The main objective of this work is to propose and evaluate AM-FM, a semantic framework for assessing translation quality without the need for reference translations. [sent-17, score-0.58]

12 The proposed framework is theoretically grounded on the classical concepts of adequacy and fluency, and it is designed to account for these two components of translation quality in an independent manner. [sent-18, score-0.706]

13 First, a cross-language latent semantic indexing model is used for assessing the adequacy component by directly comparing the output translation with the input sentence it was generated from. [sent-19, score-0.895]

14 Second, an n-gram based language model of the target language is used for assessing the fluency component. [sent-20, score-0.355]

15 Both components of the metric are evaluated at the sentence level, providing the means for defining and implementing a sentence-based evaluation metric. [sent-21, score-0.424]

16 Finally, the two components are combined into a single measure by implementing a weighted harmonic mean, for which the weighting factor can be adjusted for optimizing the metric performance. [sent-22, score-0.549]

17 Section 2, presents some background work and the specific dataset that has been used in the experimental work. [sent-24, score-0.08]

18 Section 3, provides details on the proposed AM-FM framework and the specific metric implementation. [sent-25, score-0.289]

19 Section 4 presents the results of the conducted comparative evaluations. [sent-26, score-0.185]

20 , 2002) has become a de facto standard for machine translation evaluation, other metrics such as NIST (Doddington, 2002) and, more recently, Meteor (Banerjee and Lavie, 2005), are commonly used too. [sent-31, score-0.391]

21 The dataset used in this work corresponds to WMT-07. [sent-36, score-0.08]

22 This dataset is used, instead of a more recent one, because no human judgments on adequacy and fluency have been conducted in WMT after year 2007, and human evaluation data is not freely available from MetricsMATR. [sent-37, score-0.748]

23 In this dataset, translation outputs are available for fourteen tasks involving five European languages: English (EN), Spanish (ES), German (DE), French (FR) and Czech (CZ); and two domains: News Commentaries (News) and European Parliament Debates (EPPS). [sent-38, score-0.469]

24 A complete description on WMT-07 evaluation campaign and dataset is available in Callison-Burch et al. [sent-39, score-0.153]

25 System outputs for fourteen of the fifteen systems that participated in the evaluation are available. [sent-41, score-0.309]

26 This accounts for 86 independent system outputs with a total of 172,315 individual sentence translations, from which only 10,754 were rated 2 for both adequacy and fluency by human judges. [sent-42, score-0.706]

27 (2003) was applied to all adequacy and fluency scores for removing individual voting patterns and averaging votes. [sent-45, score-0.465]

28 Table 1 provides information on the corresponding domain, and source and target languages 1 http://www. [sent-46, score-0.048]

29 org/wmt10/ 154 for each of the fourteen translation tasks, along with their corresponding number of system outputs and the amount of sentence translations for which human evaluations are available. [sent-51, score-0.718]

30 Different from other approaches not using reference translations, we rely on a cross-language version of latent semantic indexing (Dumais et al. [sent-53, score-0.312]

31 , 1997) for creating a semantic space where translation outputs and inputs can be directly compared. [sent-54, score-0.36]

32 A two-component evaluation metric, based on the concepts of adequacy and fluency (White et al. [sent-55, score-0.539]

33 While adequacy accounts for the amount of source meaning being preserved by the translation (5:all, 4:most, 3:much, 2:little, 1:none), fluency accounts for the quality of the target language in the translation (5:flawless, 4:good, 3:nonnative, 2:disfluent, 1:incomprehensible). [sent-57, score-1.043]

34 1 Metric Definition For implementing the adequacy-oriented component (AM) of the metric, the cross-language latent semantic indexing approach is used (Dumais et al. [sent-59, score-0.39]

35 , 1997), in which the source sentence originating the translation is used as evaluation reference. [sent-60, score-0.303]

36 According to this, the AM component can be regarded to be mainly adequacy-oriented as it is computed on a cross-language semantic space. [sent-61, score-0.243]

37 For implementing the fluency-oriented component (FM) of the proposed metric, an n-gram based language model approach is used (Manning and Schutze, 1999). [sent-62, score-0.221]

38 This component can be regarded to be mainly fluency-oriented as it is computed on the target language side in a manner that is totally independent from the source language. [sent-63, score-0.234]

39 2 Implementation Details The adequacy-oriented component of the metric (AM) was implemented by following the procedure proposed by Dumais et al. [sent-66, score-0.342]

40 (1997), where a bilingual collection of data is used to generate a cross-language projection matrix for a vector-space representation of texts (Salton et al. [sent-67, score-0.147]

41 , 1975) by using singular value decomposition: SVD (Golub and Kahan, 1965). [sent-68, score-0.064]

42 From the singular value decomposition depicted in (2), a low-dimensional representation for any sentence vector xa or xb, in language a or b, can be computed as follows: 155 ya T = [xa ;0] T yb T = [0; xb] T where ya and yb UabM*L (3. [sent-70, score-0.534]

43 b) represent the L-dimensional vec- tors corresponding to the projections of the fulldimensional sentence vectors xa and xb, respectively; and UabM*L is a cross-language projection matrix composed of the first L column vectors of the unitary matrix Uab obtained in (2). [sent-72, score-0.522]

44 Notice, from (3a) and (3b), how both sentence vectors xa and xb are padded with zeros at each corresponding other-language vocabulary locations for performing the cross-language projections. [sent-73, score-0.403]

45 For computing the projection matrices, random sets of 10,000 parallel were drawn from the available training datasets. [sent-78, score-0.076]

46 Seven projection matrices were constructed in total, one for each different combination of domain and language pair. [sent-80, score-0.176]

47 TF-IDF weighting was applied to the constructed term-document matrices while maintaining all words in the vocabularies (i. [sent-81, score-0.222]

48 All computations related to SVD, sentence projections and cosine similarities were conducted with MATLAB. [sent-84, score-0.157]

49 sentences3 3 Although this accounts for a small proportion of the datasets (20% of News and 1% of European Parliament), it allowed for maintaining computational requirements under control while still providing a good vocabulary coverage. [sent-85, score-0.157]

50 The fluency-oriented component FM is implemented by using an n-gram language model. [sent-86, score-0.116]

51 According to this, the FM component is computed as follows: FM = exp(Σn=1:N log(p(wn|wn-1,…))/N) (5) where p(wn|wn-1,… ) represent the target language n-gram probabilities and N is the total number of words in the target sentence being evaluated. [sent-88, score-0.326]

52 By construction, the values of FM are also restricted to the interval [0,1]; so, both component values range within the same interval. [sent-89, score-0.154]

53 The models were computed with the SRILM toolbox (Stolcke, 2002). [sent-91, score-0.07]

54 … 4 Comparative Evaluations In order to evaluate the AM-FM framework, two comparative evaluations with standard metrics were conducted. [sent-93, score-0.352]

55 More specifically, BLEU, NIST and Meteor were considered, as they are the met- rics most frequently used in machine translation evaluation campaigns. [sent-94, score-0.295]

56 1 Correlation with Human Scores In this first evaluation, AM-FM is compared with standard evaluation metrics in terms of their correlations with human-generated scores. [sent-96, score-0.242]

57 Three parameters should be adjusted for the AM-FM implementation described in (1): the dimensionality of the reduced space for AM, the order of n-gram model for FM, and the harmonic 156 mean weighting parameter α. [sent-99, score-0.283]

58 Such parameters can be adjusted for maximizing the correlation coefficient between the AM-FM metric and humangenerated scores. [sent-100, score-0.573]

59 After exploring the solution space, the following values were selected, dimensionality for AM: 1,000; order of n-gram model for 4 FM: 3; and, weighting parameter α: 0. [sent-101, score-0.112]

60 30 In the comparative evaluation presented here, correlation coefficients between the automatic metrics and human-generated scores were computed at the system level (i. [sent-102, score-0.974]

61 the units of analysis were system outputs), by considering all 86 available system outputs (see Table 1). [sent-104, score-0.117]

62 For computing human scores and AM-FM at the system level, average values of sentence-based scores for each system output were considered. [sent-105, score-0.125]

63 Table 2 presents the Pearson’s correlation coefficients computed between the automatic metrics (BLEU, NIST, Meteor and our proposed AM-FM) and the human-generated scores (adequacy, fluency and the harmonic mean of both; i. [sent-106, score-1.063]

64 All correlation coefficients presented in the table are statistically significant with p<0. [sent-109, score-0.419]

65 01 (where p is the probability of getting the same correlation coefficient, with a similar number of 86 samples, by chance). [sent-110, score-0.175]

66 Recall that our proposed AM-FM metric is not using reference translations for assessing translation quality, while the other three metrics are. [sent-115, score-0.932]

67 In a similar exercise, the correlation coefficients were also computed at the sentence level (i. [sent-116, score-0.57]

68 As metrics are computed 4 As no development dataset was available for this particular task, a subset of the same evaluation dataset had to be used. [sent-120, score-0.472]

69 Again, all correlation coefficients presented in the table are statistically significant with p<0. [sent-122, score-0.419]

70 M3124 e80 a651n Table 3: Pearson’s correlation coefficients (computed at the sentence level) between automatic metrics and human-generated scores As seen from the table, in this case, BLEU and Meteor are the metrics exhibiting the largest correlation coefficients, followed by AM-FM and NIST. [sent-127, score-1.178]

71 2 Reproducing Rankings In addition to adequacy and fluency, the WMT-07 dataset includes rankings of sentence translations. [sent-129, score-0.439]

72 To evaluate the usefulness of AM-FM and its components in a different evaluation setting, we also conducted a comparative evaluation on their capacity for predicting human-generated rankings. [sent-130, score-0.383]

73 As ranking evaluations allowed for ties among sentence translations, we restricted our analysis to evaluate whether automatic metrics were able to predict the best, the worst and both sentence trans- lations for each of the 4,060 available rankings5 . [sent-131, score-0.512]

74 The number of items per ranking varies from 2 to 5, with an average of 4. [sent-132, score-0.083]

75 Table 4 presents the results of the comparative evaluation on predicting rankings. [sent-134, score-0.188]

76 As seen from the table, Meteor is the automatic metric exhibiting the largest ranking prediction capability, followed by BLEU and NIST, while our proposed AM-FM metric exhibits the lowest ranking prediction capability. [sent-135, score-0.797]

77 However, it still performs well above random chance predictions, which, for the given average of 4 items per ranking, is about 25% for best and worst ranking predictions, and about 8. [sent-136, score-0.134]

78 Again, recall that the AMFM metric is not using reference translations, while the other three metrics are. [sent-138, score-0.462]

79 Also, it is worth mentioning that human rankings were conducted 5 We discarded those rankings involving the translation system for which translation outputs were not available that, consequently, only had one translation output left. [sent-139, score-1.009]

80 157 by looking at the reference translations and not the source. [sent-140, score-0.24]

81 1o43256t70h% Table 4: Percentage of cases in which each automatic metric is able to predict the best, the worst, and both ranked sentence translations Additionally, results for the individual components, AM and FM, are also presented in the table. [sent-146, score-0.42]

82 Notice how the AM component exhibits a better ranking capability than the FM component. [sent-147, score-0.29]

83 5 Conclusions and Future Work This work presented AM-FM, a semantic framework for translation quality assessment. [sent-148, score-0.364]

84 Two comparative evaluations with standard metrics have been conducted over a large collection of humangenerated scores involving different languages. [sent-149, score-0.58]

85 Although the obtained performance is below standard metrics, the proposed method has the main advantage of not requiring reference translations. [sent-150, score-0.143]

86 Notice that a monolingual version of AM-FM is also possible by using monolingual latent semantic indexing (Landauer et al. [sent-151, score-0.331]

87 A detailed evaluation of a monolingual implementation of AM-FM can be found in Banchs and Li (201 1). [sent-153, score-0.135]

88 As future research, we plan to study the impact of different dataset sizes and vector space model parameters for improving the performance of the AM component of the metric. [sent-154, score-0.196]

89 This will include the study of learning curves based on the amount of training data used, and the evaluation of different vector model construction strategies, such as removing stop-words and considering bigrams and word categories in addition to individual words. [sent-155, score-0.073]

90 Finally, we also plan to study alternative uses of AM-FM within the context of statistical machine translation as, for example, a metric for MERT optimization, or using the AM component alone as an additional feature for decoding, rescoring and/or confidence estimation. [sent-156, score-0.569]

91 Regression for sentence-level MT evaluation with pseudo references. [sent-160, score-0.073]

92 METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. [sent-169, score-0.538]

93 Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. [sent-180, score-0.353]

94 Sentence-level MT evaluation without reference translations: beyond language modeling. [sent-190, score-0.178]

95 Calculating the singular values and pseudo-inverse of a matrix. [sent-197, score-0.064]

96 Orange: a method for evaluating automatic evaluation metrics for machine translation. [sent-207, score-0.331]

97 BLEU: a method for automatic evaluation of machine translation. [sent-216, score-0.162]

98 The back-translation score: automatic MT evaluation at the sentences level without reference translations. [sent-225, score-0.268]

99 Improving the confidence of machine translation quality estimates. [sent-241, score-0.323]

100 The ARPA MT evaluation methodologies: evolution, lessons and future approaches. [sent-251, score-0.073]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fm', 0.246), ('coefficients', 0.244), ('adequacy', 0.231), ('fluency', 0.196), ('metric', 0.188), ('translation', 0.186), ('xb', 0.178), ('correlation', 0.175), ('metrics', 0.169), ('uabm', 0.147), ('meteor', 0.144), ('xa', 0.144), ('translations', 0.135), ('fourteen', 0.119), ('outputs', 0.117), ('component', 0.116), ('comparative', 0.115), ('assessing', 0.111), ('reference', 0.105), ('matrices', 0.1), ('uab', 0.097), ('indexing', 0.092), ('adjusted', 0.091), ('banchs', 0.089), ('rankings', 0.084), ('tp', 0.084), ('ranking', 0.083), ('nist', 0.081), ('bleu', 0.08), ('harmonic', 0.08), ('dataset', 0.08), ('dumais', 0.077), ('projection', 0.076), ('exhibiting', 0.076), ('albrecht', 0.073), ('golub', 0.073), ('humangenerated', 0.073), ('unitary', 0.073), ('vab', 0.073), ('xab', 0.073), ('evaluation', 0.073), ('pearson', 0.071), ('matrix', 0.071), ('weighting', 0.071), ('conducted', 0.07), ('computed', 0.07), ('accounts', 0.069), ('mt', 0.068), ('evaluations', 0.068), ('implementing', 0.067), ('landauer', 0.065), ('blatz', 0.065), ('singular', 0.064), ('framework', 0.063), ('monolingual', 0.062), ('infocomm', 0.06), ('specia', 0.06), ('yb', 0.06), ('latent', 0.058), ('quality', 0.058), ('semantic', 0.057), ('rafael', 0.056), ('automatic', 0.053), ('exhibits', 0.053), ('components', 0.052), ('maintaining', 0.051), ('worst', 0.051), ('mb', 0.051), ('parliament', 0.051), ('salton', 0.051), ('dimensions', 0.05), ('human', 0.049), ('exercise', 0.049), ('target', 0.048), ('involving', 0.047), ('coefficient', 0.046), ('ya', 0.046), ('european', 0.044), ('svd', 0.044), ('sentence', 0.044), ('notice', 0.044), ('banerjee', 0.043), ('projections', 0.043), ('confidence', 0.043), ('pure', 0.042), ('dimensionality', 0.041), ('haizhou', 0.04), ('gamon', 0.039), ('theoretically', 0.039), ('wmt', 0.039), ('concepts', 0.039), ('interval', 0.038), ('proposed', 0.038), ('scores', 0.038), ('capability', 0.038), ('vocabulary', 0.037), ('white', 0.037), ('level', 0.037), ('machine', 0.036), ('largest', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

2 0.27506888 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Author: Chi-kiu Lo ; Dekai Wu

Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1

3 0.25295511 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng

Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1

4 0.19121476 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

5 0.16818984 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

6 0.16603106 264 acl-2011-Reordering Metrics for MT

7 0.13101551 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

8 0.12997852 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

9 0.12960242 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

10 0.11435229 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

11 0.10486341 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

12 0.10333616 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

13 0.10293627 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

14 0.10219058 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

15 0.10205292 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

16 0.1015885 313 acl-2011-Two Easy Improvements to Lexical Weighting

17 0.10010875 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

18 0.099790245 76 acl-2011-Comparative News Summarization Using Linear Programming

19 0.099730402 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

20 0.098638356 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.253), (1, -0.095), (2, 0.089), (3, 0.177), (4, 0.016), (5, 0.042), (6, 0.047), (7, 0.036), (8, 0.104), (9, -0.026), (10, -0.013), (11, -0.154), (12, 0.006), (13, -0.187), (14, -0.07), (15, 0.009), (16, -0.069), (17, -0.058), (18, 0.046), (19, -0.017), (20, 0.072), (21, 0.034), (22, 0.03), (23, 0.027), (24, -0.078), (25, -0.003), (26, -0.1), (27, -0.123), (28, 0.046), (29, 0.033), (30, -0.094), (31, -0.034), (32, -0.073), (33, 0.16), (34, 0.007), (35, 0.099), (36, 0.052), (37, 0.055), (38, 0.047), (39, -0.013), (40, 0.059), (41, -0.112), (42, 0.044), (43, 0.029), (44, -0.007), (45, -0.013), (46, 0.062), (47, 0.088), (48, 0.149), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9663533 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

2 0.8084814 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

3 0.80572706 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

Author: Chi-kiu Lo ; Dekai Wu

Abstract: We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the non-automatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacyjudgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor cost for the evaluation procedure. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER. 1

4 0.7684235 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Author: Jonathan H. Clark ; Chris Dyer ; Alon Lavie ; Noah A. Smith

Abstract: In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately.

5 0.75760812 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng

Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1

6 0.73399824 264 acl-2011-Reordering Metrics for MT

7 0.70743299 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

8 0.66164559 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

9 0.62221003 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

10 0.61689997 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

11 0.60431015 313 acl-2011-Two Easy Improvements to Lexical Weighting

12 0.58718657 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

13 0.5854398 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

14 0.57885647 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

15 0.57868063 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

16 0.57036841 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

17 0.55951625 151 acl-2011-Hindi to Punjabi Machine Translation System

18 0.5497067 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

19 0.54523861 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

20 0.52122551 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.254), (5, 0.036), (17, 0.032), (26, 0.017), (31, 0.01), (37, 0.091), (39, 0.035), (41, 0.076), (55, 0.029), (59, 0.036), (72, 0.045), (91, 0.032), (96, 0.243)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96703202 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

Author: Kenneth Hild ; Umut Orhan ; Deniz Erdogmus ; Brian Roark ; Barry Oken ; Shalini Purwar ; Hooman Nezamfar ; Melanie Fried-Oken

Abstract: Event related potentials (ERP) corresponding to stimuli in electroencephalography (EEG) can be used to detect the intent of a person for brain computer interfaces (BCI). This paradigm is widely used to build letter-byletter text input systems using BCI. Nevertheless using a BCI-typewriter depending only on EEG responses will not be sufficiently accurate for single-trial operation in general, and existing systems utilize many-trial schemes to achieve accuracy at the cost of speed. Hence incorporation of a language model based prior or additional evidence is vital to improve accuracy and speed. In this demonstration we will present a BCI system for typing that integrates a stochastic language model with ERP classification to achieve speedups, via the rapid serial visual presentation (RSVP) paradigm.

2 0.92576766 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe

Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1

3 0.86950207 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition

Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow

Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

same-paper 4 0.84801745 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

5 0.84286284 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

Author: Awais Athar

Abstract: Sentiment analysis of citations in scientific papers and articles is a new and interesting problem due to the many linguistic differences between scientific texts and other genres. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity in citations to scientific papers. Using a newly constructed annotated citation sentiment corpus, we explore the effectiveness of existing and novel features, including n-grams, specialised science-specific lexical features, dependency relations, sentence splitting and negation features. Our results show that 3-grams and dependencies perform best in this task; they outperform the sentence splitting, science lexicon and negation based features.

6 0.83423364 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

7 0.7357589 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

8 0.73530412 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

9 0.73506641 220 acl-2011-Minimum Bayes-risk System Combination

10 0.73493344 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

11 0.73369515 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations

12 0.7335366 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

13 0.73352933 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

14 0.73269975 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

15 0.73250395 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

16 0.73250365 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

17 0.73243129 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

18 0.73217362 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation

19 0.73211366 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

20 0.73203957 76 acl-2011-Comparative News Summarization Using Linear Programming