emnlp emnlp2010 emnlp2010-22 knowledge-graph by maker-knowledge-mining

22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs


Source: pdf

Author: Hideki Isozaki ; Tsutomu Hirao ; Kevin Duh ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Automatic evaluation of Machine Translation (MT) quality is essential to developing highquality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate ‘A because B’ as ‘B because A.’ Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. [sent-6, score-0.268]

2 However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e. [sent-7, score-0.537]

3 Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. [sent-11, score-0.094]

4 ’ Thus, word order is the most important problem for distant language translation. [sent-13, score-0.14]

5 However, conventional evaluation metrics do not significantly penalize such word order mistakes. [sent-14, score-0.305]

6 Therefore, locally optimizing these metrics leads to inadequate translations. [sent-15, score-0.214]

7 In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. [sent-16, score-0.508]

8 Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics. [sent-17, score-0.153]

9 1 Introduction Automatic evaluation of machine translation (MT) quality is essential to developing high-quality machine translation systems because human evaluation is time consuming, expensive, and irreproducible. [sent-18, score-0.499]

10 If we have a perfect automatic evaluation metric, we can tune our translation system for the metric. [sent-19, score-0.276]

11 , 2002a) showed high correlation with human judgments and is still used as the de facto standard automatic evaluation metric. [sent-22, score-0.374]

12 In these studies, Pearson’s correlation coefficient and Spearman’s rank correlation ρ with human evaluation scores are used to measure how closely an automatic evaluation method correlates with human evaluation. [sent-29, score-0.783]

13 This evaluation of automatic evaluation methods is called meta-evaluation. [sent-30, score-0.135]

14 In human evaluation, people judge the adequacy and the fluency of each translation. [sent-31, score-0.294]

15 (2008) In JE translation, most Statistical Machine Trans- lation (SMT) systems translate the Japanese sentence (J0) kare wa sono hon wo yonda node sekaishi ni kyoumi ga atta which means ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-36, score-0.196]

16 Consequently, the global word order is essential for translation between distant language pairs, and wrong word order can easily lead to misunderstanding or incomprehensibility. [sent-46, score-0.388]

17 Another example is: Japanese: bobu wa meari ni yubiwa wo kau × tameni, jon no mise ni itta. [sent-57, score-0.234]

18 945 In this way, this SMT service usually gives incomprehensible or misleading translations, and thus people prefer RBMT services. [sent-60, score-0.137]

19 Other SMT systems also tend to make similar word order mistakes, and special care should be paid to the translation between distant language pairs such as Japanese and English. [sent-61, score-0.367]

20 From this point of view, conventional automatic evaluation metrics of translation quality disregard word order mistakes too much. [sent-63, score-0.617]

21 Therefore, ×B(L11EU/1 gives a very good score to this inadequate translation because it checks only ngrams and does not regard global word order. [sent-70, score-0.23]

22 Since (R0) and (H0) look similar in terms of fluency, adequacy is more important than fluency in the translation between distant language pairs. [sent-71, score-0.583]

23 Therefore, these standard metrics are not optimal for evaluating translation between distant language pairs. [sent-78, score-0.496]

24 In this paper, we propose an alternative automatic evaluation metric appropriate for distant language pairs. [sent-79, score-0.293]

25 We use them to compare the word ranks in the reference with those in the hypothesis. [sent-81, score-0.107]

26 There are two popular rank correlation coeffi- 4/8)1/4 cients: Spearman’s ρ and Kendall’s τ (Kendall, 1975). [sent-82, score-0.3]

27 (2010), we used Kendall’s τ to measure the effectiveness of our Head Finalization rule as a preprocessor for English-to-Japanese translation, but we measured the quality of translation by using conventional metrics. [sent-84, score-0.276]

28 It is not clear how well τ works as an automatic evaluation metric of translation quality. [sent-85, score-0.339]

29 As we discuss later, τ considers only the direction of the rank change, whereas ρ considers the distance of the change. [sent-87, score-0.106]

30 The first objective of this paper is to examine which is the better metric for distant language pairs. [sent-88, score-0.203]

31 The second objective is to find improvements of these rank correlation-metrics. [sent-89, score-0.106]

32 Since Pearson’s correlation metric assumes linearity, nonlinear monotonic functions can change its score. [sent-108, score-0.301]

33 On the other hand, Spearman’s ρ and Kendall’s τ uses ranks instead of raw evaluation scores, and simple application of monotonic functions cannot change them (use of other operations such as averaging sentence scores can change them). [sent-109, score-0.185]

34 1 Word alignment for rank correlations We have to determine word ranks to obtain rank correlation coefficients. [sent-111, score-0.458]

35 Suppose we have: (R1) (H1) 946 John hit Bob yesterday Bob hit John yesterday The 1st word “John” in R1 becomes the 3rd word in H1. [sent-112, score-0.228]

36 Spearman’s ρ as fol- lows: “John” moved by d1 = 2 words, “hit” moved by d2 = 0 words, “Bob” moved by d3 = 2 words, and “yesterday” moved by d4 = 0 words. [sent-123, score-0.236]

37 We have to consider the limitation of the rank correlation metrics. [sent-131, score-0.3]

38 Suppose we have: (R2) (H2) the boy read the book the book was read by the boy By removing non-aligned words by one-to-one correspondence, we get: (R3) boy read book (H3) book read boy Thus, we lost “the. [sent-140, score-1.196]

39 (R5) he1 hi st ory6 was2 interested3 in4 becaus e7 he8 read9 world5 the10 book11 (H5) he1 he8 wa s2 read9 the10 book11 because7 int e re st ed3 in4 world5 hi st ory6 H5’s word order is [8, 9, 10, 11, 7, 1, 2, 3, 4, 5, 6]. [sent-147, score-0.272]

40 The number of increasing pairs is: 4C2 = 6 pairs in [8, 9, 10, 11] and 6C2 = 15 pairs in [1, 2, 3, 4, 5, 6]. [sent-148, score-0.123]

41 = Therefore, both Spearman’s ρ and Kendall’s τ give very bad scores to the misleading translation H0. [sent-154, score-0.285]

42 This fact implies they are much better metrics than BLEU, which gave a good score to it. [sent-155, score-0.222]

43 Since some hypothesis words do not have corresponding reference words, the output integer list worder is sometimes shorter than the evaluated sentence. [sent-159, score-0.117]

44 For each word hi in h: • • • If hi appears only once each in h and r, append j sI. [sent-170, score-0.253]

45 Figure 1: Word alignment algorithm for rank correlation 2. [sent-180, score-0.3]

46 2 Word order metrics and meta-evaluation metrics These rank correlation metrics sometimes have negative values. [sent-181, score-0.81]

47 These metrics are defined only when the number of aligned words is two or more. [sent-188, score-0.17]

48 Consequently, these normalized metrics have the same range [0, 1]. [sent-190, score-0.226]

49 In order to avoid confusion, we use these abbreviations (NKT and NSR) when we use rank correlations as word order metrics, because these correlation metrics are also used in the machine translation community for meta-evaluation. [sent-191, score-0.656]

50 For metaevaluation, we use Spearman’s ρ and Pearson’s correlation coefficient and call them “Spearman” and “Pearson,” respectively. [sent-192, score-0.23]

51 3 Overestimation problem Since we measure the rank correlation of only corresponding words, these metrics will overestimate the correlation. [sent-194, score-0.47]

52 For instance, a hypothesis sentence might have only two corresponding words among Figure 2: Scatter plots of normalized average adequacy with brevity penalty (left) and precision (right). [sent-195, score-0.369]

53 Solving this overestimation problem is the second objective of this paper. [sent-201, score-0.103]

54 We can combine the above word order metrics with BP, e. [sent-203, score-0.17]

55 In the NTCIR-7 data, three human judges gave five-point scores (1, 2, 3, 4, 5) for “adequacy” and “fluency” of each translated sentence. [sent-211, score-0.133]

56 For each translated sentence, we averaged three judges’ adequacy scores and normalized this average x by (x −1)/4. [sent-213, score-0.292]

57 Therefore, we have to consider other modifiers for this overestimation problem. [sent-219, score-0.103]

58 We can use other common metrics such as precision, recall, and Fmeasure to reduce the overestimation of NSR and NKT. [sent-220, score-0.273]

59 Our preliminary experiments with NTCIR-7 data showed that precision correlated best with adequacy among these three metrics (P, R, and Fβ=1). [sent-228, score-0.362]

60 The right graph of Figure 2 shows a scatter plot of precision and normalized average adequacy. [sent-231, score-0.104]

61 The graph shows that precision has more correlation with adequacy than BP. [sent-232, score-0.386]

62 We can observe that sentences with very small P values usually obtain very low adequacy scores but those with mediocre P values often obtain good adequacy scores. [sent-233, score-0.428]

63 1 Meta-evaluation with NTCIR-7 data In order to compare automatic translation evaluation methods, we use submissions to the NTCIR-7 Patent Translation (PATMT) task (Fujii et al. [sent-242, score-0.276]

64 ’ For automatic evaluation, we used a single reference sentence for each of these 100 manually evaluated sentences. [sent-251, score-0.1]

65 For this meta-evaluation, we measured the corpus-level correlation between the human evaluation scores and the automatic evaluation scores. [sent-254, score-0.41]

66 For existing metrics such as BLEU, we followed their definitions for corpus-level evaluation instead of simple averages of sentence-level scores. [sent-256, score-0.215]

67 2 for METEOR (Banerjee and Meta-evaluation with WMT-07 data We developed our metric mainly for automatic evaluation of translation quality for distant language pairs such as Japanese-English, but we also want to know how well the metric works for similar language pairs. [sent-265, score-0.583]

68 (2007) tried different human evaluation methods and showed detailed evaluation scores. [sent-269, score-0.127]

69 Error metrics, WER, PER, and TER, have negative correlation coefficients, but we did not show their minus signs here. [sent-277, score-0.194]

70 Both NSR-based metrics and NKT-based metrics perform better than conventional metrics for this NTCIR PATMT JE translation data. [sent-278, score-0.786]

71 Thus, we think Spearman is a better metaevaluation metric than Pearson. [sent-293, score-0.125]

72 Table 1: NTCIR-7 Meta-evaluation: correlation with human judgments (Spm = Spearman, Prs = Pearson) (NSRP1/4)1. [sent-294, score-0.231]

73 5 The right part of Table 1 shows correlation with fluency, but adequacy is more important, because our motivation is to provide a metric that is useful to reduce incomprehensible or misunderstanding outputs of MT systems. [sent-295, score-0.593]

74 Again, the correlation-based metrics gave better scores than conventional metrics, and BP performed badly. [sent-296, score-0.356]

75 NSR-based metrics proved to be as good as NKT-based metrics. [sent-297, score-0.17]

76 Meta-evaluation scores of the de facto standard BLEU is much lower than those of other metrics. [sent-298, score-0.097]

77 (2007) have performed differ- ent human evaluation methods for different language pairs and different corpora. [sent-311, score-0.123]

78 (The “constituent” methods obtained the best inter-annotator agreement, but these methods focus on local translation quality and have nothing to do with global word order, which we are discussing here. [sent-316, score-0.186]

79 ) Table 2 shows that our metrics designed for distant language pairs are comparable to conventional methods even for similar language pairs, but ROUGE-L and ROUGE-S performed better than ours for French News Corpus and German Europarl. [sent-317, score-0.441]

80 We can extend our metric by Fβ, weighted harmonic mean of P and R, or any other interpolation, but the introduction of new parameters into our metric makes it difficult to control. [sent-321, score-0.126]

81 uTsaeble 3 shows the performance of these metrics for NTCIR-7 data. [sent-329, score-0.17]

82 Pearson’s correlation coefficient with adequacy was improved by 1−√1 − NKT, but other scores were degraded i bny yth 1is − experiment. [sent-330, score-0.466]

83 (2010)’s method comes from the fact that we used Japanese-English translation data and Spearman’s correlation for meta-evaluation, whereas they used Chinese-English translation data and only Pearson’s correlation for meta-evaluation. [sent-332, score-0.76]

84 K895T24381) In spite of these differences, the two groups independently recognized the usefulness of rank correlations for automatic evaluation of translation quality for distant language pairs. [sent-340, score-0.522]

85 In their WMT-2010 paper (Birch and Osborne, 2010), they multiplied NKT with the brevity penalty and interpolated it with BLEU for the WMT-2010 shared task. [sent-341, score-0.121]

86 This fact implies that incomprehensible or misleading word order mistakes are rare in translation among European languages. [sent-342, score-0.369]

87 6 Conclusions When Statistical Machine Translation is applied to distant language pairs such as Japanese and English, word order becomes an important problem. [sent-343, score-0.181]

88 SMT systems often fail to find an appropriate translation because of a large search space. [sent-344, score-0.186]

89 Therefore, they often output misleading or incomprehensible sentences such as “A because B” vs. [sent-345, score-0.137]

90 ” To penalize such inadequate translations, we presented an automatic evaluation method based on rank correlation. [sent-347, score-0.24]

91 First, which correlation coefficient should we use: Spearman’s ρ or Kendall’s τ? [sent-349, score-0.23]

92 Second, how should we solve the overestimation problem caused by the nature of one-to-one correspondence? [sent-350, score-0.103]

93 We answered these questions through our experiments using the NTCIR-7 PATMT JE translation data. [sent-351, score-0.186]

94 For similar language pairs, our method was comparable to conventional evaluation methods. [sent-356, score-0.135]

95 Meteor: An automatic metric for MT evaluation with improved correlation with human judgements. [sent-360, score-0.384]

96 Re-evaluatiing the role of Bleu in machine translation research. [sent-373, score-0.186]

97 Automatic evaluation of machine translation based on recursive acquisition of an intuitive common parts continuum. [sent-387, score-0.231]

98 Metaevaluation of automatic evaluation methods for machine translation using patent translation data in ntcir7. [sent-391, score-0.585]

99 Overview of the patent translation task at the NTCIR-7 workshop. [sent-395, score-0.309]

100 A study of translation edit rate with targeted human annotation. [sent-429, score-0.223]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nkt', 0.349), ('spearman', 0.274), ('kendall', 0.264), ('nsr', 0.205), ('correlation', 0.194), ('adequacy', 0.192), ('translation', 0.186), ('pearson', 0.181), ('bp', 0.178), ('metrics', 0.17), ('bleu', 0.15), ('distant', 0.14), ('patent', 0.123), ('patmt', 0.123), ('je', 0.123), ('smt', 0.122), ('japanese', 0.121), ('bob', 0.117), ('hi', 0.107), ('rank', 0.106), ('book', 0.103), ('boy', 0.103), ('overestimation', 0.103), ('mt', 0.098), ('read', 0.093), ('conventional', 0.09), ('fujii', 0.088), ('incomprehensible', 0.082), ('fluency', 0.065), ('metric', 0.063), ('penalty', 0.063), ('metaevaluation', 0.062), ('misunderstanding', 0.062), ('ntcir', 0.062), ('worde', 0.062), ('worder', 0.062), ('hit', 0.059), ('moved', 0.059), ('brevity', 0.058), ('wa', 0.058), ('normalized', 0.056), ('reference', 0.055), ('coefficients', 0.055), ('wo', 0.055), ('misleading', 0.055), ('yesterday', 0.055), ('facto', 0.053), ('sudoh', 0.053), ('gave', 0.052), ('ranks', 0.052), ('scatter', 0.048), ('correspondence', 0.046), ('mistakes', 0.046), ('automatic', 0.045), ('evaluation', 0.045), ('birch', 0.045), ('john', 0.044), ('scores', 0.044), ('wer', 0.044), ('inadequate', 0.044), ('isozaki', 0.044), ('monotonic', 0.044), ('translate', 0.042), ('pairs', 0.041), ('atsushi', 0.041), ('bobu', 0.041), ('denoual', 0.041), ('eoatchhe', 0.041), ('hihi', 0.041), ('hon', 0.041), ('meari', 0.041), ('mikio', 0.041), ('nktp', 0.041), ('nsrp', 0.041), ('ntt', 0.041), ('rbmt', 0.041), ('rinw', 0.041), ('takehito', 0.041), ('square', 0.04), ('append', 0.039), ('jon', 0.039), ('ter', 0.038), ('human', 0.037), ('coefficient', 0.036), ('mary', 0.036), ('nist', 0.036), ('bought', 0.035), ('disregard', 0.035), ('finalization', 0.035), ('hiroshi', 0.035), ('katsuhito', 0.035), ('killed', 0.035), ('linearity', 0.035), ('metricsmatr', 0.035), ('sov', 0.035), ('papineni', 0.034), ('miles', 0.034), ('osborne', 0.034), ('snover', 0.034), ('translations', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

Author: Hideki Isozaki ; Tsutomu Hirao ; Kevin Duh ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Automatic evaluation of Machine Translation (MT) quality is essential to developing highquality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate ‘A because B’ as ‘B because A.’ Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

2 0.24496259 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

Author: Dominic Espinosa ; Rajakrishnan Rajkumar ; Michael White ; Shoshana Berleant

Abstract: We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion ofthe implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community. 1 Introduction and Background In building surface-realization systems for natural language generation, there is a need for reliable automated metrics to evaluate the output. Unlike in parsing, where there is usually a single goldstandard parse for a sentence, in surface realization there are usually many grammatically-acceptable ways to express the same concept. This parallels the task of evaluating machine-translation (MT) systems: for a given segment in the source language, 564 there are usually several acceptable translations into the target language. As human evaluation of translation quality is time-consuming and expensive, a number of automated metrics have been developed to evaluate the quality of MT outputs. In this study, we investigate whether the metrics developed for MT evaluation tasks can be used to reliably evaluate the outputs of surface realizers, and which of these metrics are best suited to this task. A number of surface realizers have been developed using the Penn Treebank (PTB), and BLEU scores are often reported in the evaluations of these systems. But how useful is BLEU in this context? The original BLEU study (Papineni et al., 2001) scored MT outputs, which are of generally lower quality than grammar-based surface realizations. Furthermore, even for MT systems, the usefulness of BLEU has been called into question (Callison-Burch et al., 2006). BLEU is designed to work with multiple reference sentences, but in treebank realization, there is only a single reference sentence available for comparison. A few other studies have investigated the use of such metrics in evaluating the output of NLG systems, notably (Reiter and Belz, 2009) and (Stent et al., 2005). The former examined the performance of BLEU and ROUGE with computer-generated weather reports, finding a moderate correlation with human fluency judgments. The latter study applied several MT metrics to paraphrase data from Barzilay and Lee’s corpus-based system (Barzilay and Lee, 2003), and found moderate correlations with human adequacy judgments, but little correlation with fluency judgments. Cahill (2009) examined the performance of six MT metrics (including BLEU) in evaluating the output of a LFG-based surface realizer for ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 5t6ic4s–574, German, also finding only weak correlations with the human judgments. To study the usefulness of evaluation metrics such as BLEU on the output of grammar-based surface realizers used with the PTB, we assembled a corpus of surface realizations from three different realizers operating on Section 00 of the PTB. Two human judges evaluated the adequacy and fluency of each of the realizations with respect to the reference sentence. The realizations were then scored with a number of automated evaluation metrics developed for machine translation. In order to investigate the correlation of targeted metrics with human evaluations, and gather other acceptable realizations for future evaluations, the judges manually repaired each unacceptable realization during the rating task. In contrast to previous NLG meta-evaluations, we found that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best. However, when looking at statistically significant system-level differences in human judgments, we found that some of the metrics get some of the rankings correct, but none get them all correct, with different metrics making different ranking errors. This suggests that multiple metrics should be routinely consulted when comparing realizer systems. Overall, our methodology is similar to that of previous MT meta-evaluations, in that we collected human judgments of system outputs, and compared these scores with those assigned by automatic metrics. A recent alternative approach to paraphrase evaluation is ParaMetric (Callison-Burch et al., 2008); however, it requires a corpus of annotated (aligned) paraphrases (which does not yet exist for PTB data), and is arguably focused more on paraphrase analysis than paraphrase generation. The plan of the paper is as follows: Section 2 discusses the preparation of the corpus of surface realizations. Section 3 describes the human evaluation task and the automated metrics applied. Sections 4 and 5 present and discuss the results of these evaluations. We conclude with some general observations about automatic evaluation of surface realizers, and some directions for further research. 565 2 Data Preparation We collected realizations of the sentences in Section 00 of the WSJ corpus from the following three sources: 1. OpenCCG, a CCG-based chart realizer (White, 2006) 2. The XLE Generator, a LFG-based system developed by Xerox PARC (Crouch et al., 2008) 3. WordNet synonym substitutions, to investigate how differences in lexical choice compare to grammar-based variation.1 Although all three systems used Section 00 of the PTB, they were applied with various parameters (e.g., language models, multiple-output versus single-output) and on different input structures. Accordingly, our study does not compare OpenCCG to XLE, or either of these to the WordNet system. 2.1 OpenCCG realizations OpenCCG is an open source parsing/realization library with multimodal extensions to CCG (Baldridge, 2002). The OpenCCG chart realizer takes logical forms as input and produces strings by combining signs for lexical items. Alternative realizations are scored using integrated n-gram and perceptron models. For robustness, fragments are greedily assembled when necessary. Realizations were generated from 1,895 gold standard logical forms, created by constrained parsing of development-section derivations. The following OpenCCG models (which differ essentially in the way the output is ranked) were used: 1. Baseline 1: Output ranked by a trigram word model 2. Baseline 2: Output ranked using three language models (3-gram words 3-gram words with named entity class replacement factored language model of words, POS tags and CCG supertags) + + 1Not strictly surface realizations, since they do not involve an abstract input specification, but for simplicity we refer to them as realizations throughout. 3. Baseline 3: Perceptron with syntax features and the three LMs mentioned above 4. Perceptron full-model: n-best realizations ranked using perceptron with syntax features and the three n-gram models, as well as discriminative n-grams The perceptron model was trained on sections 0221 of the CCGbank, while a grammar extracted from section 00-21 was used for realization. In addition, oracle supertags were inserted into the chart during realization. The purpose of such a non-blind testing strategy was to evaluate the quality of the output produced by the statistical ranking models in isolation, rather than focusing on grammar coverage, and avoid the problems associated with lexical smoothing, i.e. lexical categories in the development section not being present in the training section. To enrich the variation in the generated realizations, dative-alternation was enforced during realization by ensuring alternate lexical categories of the verb in question, as in the following example: (1) the executives gave [the chefs] [a standing ovation] (2) the executives gave [a standing ovation] [to the chefs] 2.2 XLE realizations The corpus of realizations generated by the XLE system contained 42,527 surface realizations of approximately 1,421 section 00 sentences (an average of 30 per sentence), initially unranked. The LFG f-structures used as input to the XLE generator were derived from automatic parses, as described in (Riezler et al., 2002). The realizations were first tokenized using Penn Treebank conventions, then ranked using perplexities calculated from the same trigram word model used with OpenCCG. For each sentence, the top 4 realizations were selected. The XLE generator provides an interesting point of comparison to OpenCCG as it uses a manuallydeveloped grammar with inputs that are less abstract but potentially noisier, as they are derived from automatic parses rather than gold-standard ones. 566 2.3 WordNet synonymizer To produce an additional source of variation, the nouns and verbs of the sentences in section 00 of the PTB were replaced with all of their WordNet synonyms. Verb forms were generated using verb stems, part-of-speech tags, and the morphg tool.2 These substituted outputs were then filtered using the n-gram data which Google Inc. has made available.3 Those without any 5-gram matches centered on the substituted word (or 3-gram matches, in the case of short sentences) were eliminated. 3 Evaluation From the data sources described in the previous sec- tion, a corpus of realizations to be evaluated by the human judges was constructed by randomly choosing 305 sentences from section 00, then selecting surface realizations of these sentences using the following algorithm: 1. Add OpenCCG’s best-scored realization. 2. Add other OpenCCG realizations until all four models are represented, to a maximum of 4. 3. Add up to 4 realizations from either the XLE system or the WordNet pool, chosen randomly. The intent was to give reasonable coverage of all realizer systems discussed in Section 2 without overloading the human judges. “System” here means any instantiation that emits surface realizations, including various configurations of OpenCCG (using different language models or ranking systems), and these can be multiple-output, such as an n-best list, or single-output (best-only, worst-only, etc.). Accordingly, more realizations were selected from the OpenCCG realizer because 5 different systems were being represented. Realizations were chosen randomly, rather than according to sentence types or other criteria, in order to produce a representative sample of the corpus. In total, 2,114 realizations were selected for evaluation. 2http : //www. informatics . sussex. ac .uk/ re search/ groups / nlp / carro l /morph .html l 3http : //www . ldc . upenn .edu/Catalog/docs/ LDC2 0 0 6T 13 / readme .txt 3.1 Human judgments Two human judges evaluated each surface realization on two criteria: adequacy, which represents the extent to which the output conveys all and only the meaning of the reference sentence; and fluency, the extent to which it is grammatically acceptable. The realizations were presented to the judges in sets containing a reference sentence and the 1-8 outputs selected for that sentence. To aid in the evaluation of adequacy, one sentence each of leading and trailing context were displayed. Judges used the guidelines given in Figure 1, based on the scales developed by the NIST Machine Translation Evaluation Workshop. In addition to rating each realization on the two five-point scales, each judge also repaired each output which he or she did not judge to be fully adequate and fluent. An example is shown in Figure 2. These repairs resulted in new reference sentences for a substantial number of sentences. These repaired realizations were later used to calculate targeted versions of the evaluation metrics, i.e., using the repaired sentence as the reference sentence. Although targeted metrics are not fully automatic, they are of interest because they allow the evaluation algorithm to focus on what is actually wrong with the input, rather than all textual differences. Notably, targeted TER (HTER) has been shown to be more consistent with human judgments than human annotators are with one another (Snover et al., 2006). 3.2 Automatic evaluation The realizations were also evaluated using seven automatic metrics: • IBM’s BLEU, which scores a hypothesis by counting n-gram matches with the reference sentence (Papineni et al., 2001), with smoothing as described in (Lin and Och, 2004) • • • • • • The NIST n-gram evaluation metric, similar to BLEU, but rewarding rarer n-gram matches, and using a different length penalty METEOR, which measures the harmonic mean of unigram precision and recall, with a higher weight for recall (Banerjee and Lavie, 2005) 567 TER (Translation Edit Rate), a measure of the number of edits required to transform a hypothesis sentence into the reference sentence (Snover et al., 2006) TERP, an augmented version of TER which performs phrasal substitutions, stemming, and checks for synonyms, among other improvements (Snover et al., 2009) TERPA, an instantiation of TERP with edit weights optimized for correlation with adequacy in MT evaluations GTM (General Text Matcher), a generaliza- tion of the F-measure that rewards contiguous matching spans (Turian et al., 2003) Additionally, targeted versions of BLEU, METEOR, TER, and GTM were computed by using the human-repaired outputs as the reference set. The human repair was different from the reference sentence in 193 cases (about 9% of the total), and we expected this to result in better scores and correlations with the human judgments overall. 4 Results 4.1 Human judgments Table 1 summarizes the dataset, as well as the mean adequacy and fluency scores garnered from the human evaluation. Overall adequacy and fluency judgments were high (4.16, 3.63) for the realizer systems on average, and the best-rated realizer systems achieved mean fluency scores above 4. 4.2 Inter-annotator agreement Inter-annotator agreement was measured using the κ-coefficient, which is commonly used to measure the extent to which annotators agree in category P(1A−)P−(PE()E), judgment tasks. κ is defined as where P(A) is the observed agreement 1 b−etPw(eEe)n annotators and P(E) is the probability of agreement due to chance (Carletta, 1996). Chance agreement for this data is calculated by the method discussed in Carletta’s squib. However, in previous work in MT meta-evaluation, Callison-Burch et al. (2007), assume the less strict criterion of uniform chance agreement, i.e. for a five-point scale. They also 51 Score Adequacy Fluency 5All the meaning of the referencePerfectly grammatical 4 Most of the meaning Awkward or non-native; punctuation errors 3 Much of the meaning Agreement errors or minor syntactic problems 2 Meaning substantially different Major syntactic problems, such as missing words 1 Meaning completely different Completely ungrammatical Figure Ref. Realiz. Repair 1: Rating scale and guidelines It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf spurns them again It weren’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them Figure 2: Example of repair introduce the notion of “relative” κ, which measures how often two or more judges agreed that A > B, A = B, or A < B for two outputs A and B, irrespective of the specific values given on the five-point scale; here, uniform chance agreement is taken to be We report both absolute and relative κ in Table 2, using actual chance agreement rather than uniform chance agreement. 31. The κ scores of0.60 for adequacy and 0.63 for fluency across the entire dataset represent “substantial” agreement, according to the guidelines discussed in (Landis and Koch, 1977), better than is typically reported for machine translation evaluation tasks; for example, Callison-Burch et al. (2007) reported “fair” agreement, with κ = 0.281 for fluency and κ = 0.307 for adequacy (relative). Assuming the uniform chance agreement that the previously cited work adopts, our inter-annotator agreements (both absolute and relative) are still higher. This is likely due to the generally high quality of the realizations evaluated, leading to easier judgments. 4.3 Correlation with automatic evaluation To determine how well the automatic evaluation methods described in Section 3 correlate with the human judgments, we averaged the human judgments for adequacy and fluency, respectively, for each of the rated realizations, and then computed both Pearson’s correlation coefficient and Spearman’s rank correlation coefficient between these scores and each of the metrics. Spearman’s correlation makes fewer assumptions about the distribu- tion of the data, but may not reflect a linear rela568 tionship that is actually present. Both are frequently reported in the literature. Due to space constraints, we show only Spearman’s correlation, although the TER family scored slightly better on Pearson’s coefficient, relatively. The results for Spearman’s correlation are given in Table 3. Additionally, the average scores for adequacy and fluency were themselves averaged into a single score, following (Snover et al., 2009), and the Spearman’s correlation of each of the automatic metrics with these scores are given in Table 4. All reported correlations are significant at p < 0.001. 4.4 Bootstrap sampling of correlations For each of the sub-corpora shown in Table 1, we computed confidence intervals for the correlations between adequacy and fluency human scores with selected automatic metrics (BLEU, HBLEU, TER, TERP, and HTER) as described in (Koenh, 2004). We sampled each sub-corpus 1000 times with replace- ment, and calculated correlations between the rankings induced by the human scores and those induced by the metrics for each reference sentence. We then used these coefficients to estimate the confidence interval, after excluding the top 25 and bottom 25 coefficients, following (Lin and Och, 2004). The results of this for the BLEU metric are shown in Table 5. We determined which correlations lay within the 95% confidence interval of the best performing metric in each row of Table Table 3; these figures are italicized. 5 Discussion 5.1 Human judgments of systems The results for the four OpenCCG perceptron models mostly confirm those reported in (White and Rajkumar, 2009), with one exception: the B-3 model was below B-2, though the P-B (perceptron-best) model still scored highest. This may have been due to differences in the testing scenario. None of the differences in adequacy scores among the individual systems are significant, with the exception of the WordNet system. In this case, the lack of wordsense disambiguation for the substituted words results in a poor overall adequacy score (e.g., wage floor → wage story). Conversely, it scores highest ffoloro fluency, as substituting a noun or tve srcbo rwesith h a synonym does not usually introduce ungrammaticality. 5.2 Correlations of human judgments with MT metrics Of the non-human-targeted metrics evaluated, BLEU and TER/TERP demonstrate the highest correlations with the human judgments of fluency (r = 0.62, 0.64). The TER family of evaluation metrics have been observed to perform very well in MTevaluation tasks, and although the data evaluated here differs from typical MT data in some important ways, the correlation of TERP with the human judgments is substantial. In contrast with previous MT evaluations where TERP performs considerably better than TER, these scored close to equal on our data, possibly because TERP’s stem, synonym, and paraphrase matching are less useful when most of the variation is syntactic. The correlations with BLEU and METEOR are lower than those reported in (Callison-Burch et al., 2007); in that study, BLEU achieved adequacy and fluency correlations of 0.690 and 0.722, respectively, and METEOR achieved 0.701 and 0.719. The correlations for these metrics might be expected to be lower for our data, since overall quality is higher, making the metrics’ task more difficult as the outputs involve subtler differences between acceptable and unacceptable variation. The human-targeted metrics (represented by the prefixed H in the data tables) correlated even more strongly with the human judgments, compared to the non-targeted versions. HTER demonstrated the best 569 correlation with realizer fluency (r = 0.75). For several kinds of acceptable variation involving the rearrangement of constituents (such as dative shift), TERP gives a more reasonable score than BLEU, due to its ability to directly evaluate phrasal shifts. The following realization was rated 4.5 for fluency, and was more correctly ranked by TERP than BLEU: (3) Ref: The deal also gave Mitsui access to a high-tech medical product. (4) Realiz.: The deal also gave access to a high-tech medical product to Mitsui. For each reference sentence, we compared the ranking of its realizations induced from the human scores to the ranking induced from the TERP score, and counted the rank errors by the latter, informally categorizing them by error type (see Table 7). In the 50 sentences with the highest numbers of rank errors, 17 were affected by punctuation differences, typically involving variation in comma placement. Human fluency judgments of outputs with only punctuation problems were generally high, and many realizations with commas inserted or removed were rated fully fluent by the annotators. However, TERP penalizes such insertions or deletions. Agreement errors are another frequent source of ranking errors for TERP. The human judges tended to harshly penalize sentences with number-agreement or tense errors, whereas TERP applies only a single substitution penalty for each such error. We expect that with suitable optimization of edit weights to avoid over-penalizing punctuation shifts and underpenalizing agreement errors, TERP would exhibit an even stronger correlation with human fluency judgments. None of the evaluation metrics can distinguish an acceptable movement of a word or constituent from an unacceptable movement, with only one reference sentence. A substantial source of error for both TERP and BLEU is variation in adverbial placement, as shown in (7). Similar errors are seen with prepositional phrases and some commonly-occurring temporal adverbs, which typically admit a number of variations in placement. Another important example of acceptable variation which these metrics do not generally rank correctly is dative alternation: Ref. We need to clarify what exactly is wrong with it. Realiz. Flu. TERP BLEU We need to clarify exactly what is wrong with it.50.10.5555 We need to clarify exactly what ’s wrong with it. 5 0.2 0.4046 (7) We need to clarify what , exactly , is wrong with it. 5 0.2 0.5452 We need to clarify what is wrong with it exactly. 4.5 0.1 0.6756 We need to clarify what exactly , is wrong with it. 4 0.1 0.7017 We need to clarify what , exactly is wrong with it. 4 0.1 0.7017 We needs to clarify exactly what is wrong with it. (5) Ref. When test booklets were passed out 48 hours ahead of time, she says she copied questions in the social studies section and gave the answers to students. (6) Realiz. When test booklets were passed out 48 hours ahead of time , she says she copied questions in the social studies section and gave students the answers. The correlations of each of the metrics with the human judgments of fluency for the realizer systems indicate at least a moderate relationship, in contrast with the results reported in (Stent et al., 2005) for paraphrase data, which found an inverse correlation for fluency, and (Cahill, 2009) for the output ofa surface realizer for German, which found only a weak correlation. However, the former study employed a corpus-based paraphrase generation system rather than grammar-driven surface realizers, and the resulting paraphrases exhibited much broader variation. In Cahill’s study, the outputs of the realizer were almost always grammatically correct, and the automated evaluation metrics were ranking markedness instead of grammatical acceptability. 5.3 System-level comparisons In order to investigate the efficacy of the metrics in ranking different realizer systems, or competing realizations from the same system generated using different ranking models, we considered seven different “systems” from the whole dataset of realizations. These consisted of five OpenCCG-based realizations (the best realization from three baseline models, and the best and the worst realization from the full perceptron model), and two XLE-based sys- tems (the best and the worst realization, after ranking the outputs of the XLE realizer with an n-gram model). The mean of the combined adequacy and 570 3 0.103 0.346 fluency scores of each of these seven systems was compared with that of every other system, resulting in 21 pairwise comparisons. Then Tukey’s HSD test was performed to determine the systems which differed significantly in terms of the average adequacy and fluency rating they received.4 The test revealed five pairwise comparisons where the scores were significantly different. Subsequently, for each of these systems, an overall system-level score for each of the MT metrics was calculated. For the five pairwise comparisons where the adequacy-fluency group means differed significantly, we checked whether the metric ranked the systems correctly. Table 8 shows the results of a pairwise comparison between the ranking induced by each evaluation metric, and the ranking induced by the human judgments. Five of the seven non- targeted metrics correctly rank more than half of the systems. NIST, METEOR, and GTM get the most comparisons right, but neither NIST nor GTM correctly rank the OpenCCG-baseline model 1 with respect to the XLE-best model. TER and TERP get two of the five comparisons correct, and they incorrectly rank two of the five OpenCCG model comparisons, as well as the comparison between the XLE-worst and OpenCCG-best systems. For the targeted metrics, HNIST is correct for all five comparisons, while neither HBLEU nor HMETEOR correctly rank all the OpenCCG models. On the other hand, HTER and HGTM incorrectly rank the XLE-best system versus OpenCCG-based models. In summary, some of the metrics get some of the rankings correct, but none of the non-targeted metrics get all of them correct. Moreover, different metrics make different ranking errors. This argues for 4This particular test was chosen since it corrects for multiple post-hoc analyses conducted on the same data-set. the use of multiple metrics in comparing realizer systems. 6 Conclusion Our study suggests that although the task of evaluating the output from realizer systems differs from the task of evaluating machine translations, the automatic metrics used to evaluate MT outputs deliver moderate correlations with combined human fluency and adequacy scores when used on surface realizations. We also found that the MT-evaluation metrics are useful in evaluating different versions of the same realizer system (e.g., the various OpenCCG realization ranking models), and finding cases where a system is performing poorly. As in MT-evaluation tasks, human-targeted metrics have the highest correlations with human judgments overall. These results suggest that the MT-evaluation metrics are useful for developing surface realizers. However, the correlations are lower than those reported for MT data, suggesting that they should be used with caution, especially for cross-system evaluation, where consulting multiple metrics may yield more reliable comparisons. In our study, the targeted version of TERP correlated most strongly with human judgments of fluency. In future work, the performance of the TER family of metrics on this data might be improved by opti- mizing the edit weights used in computing its scores, so as to avoid over-penalizing punctuation movements or under-penalizing agreement errors, both of which were significant sources of ranking errors. Multiple reference sentences may also help mitigate these problems, and the corpus of human-repaired realizations that has resulted from our study is a step in this direction, as it provides multiple references for some cases. We expect the corpus to also prove useful for feature engineering and error analysis in developing better realization models.5 Acknowledgements We thank Aoife Cahill and Tracy King for providing us with the output of the XLE generator. We also thank Chris Callison-Burch and the anonymous reviewers for their helpful comments and suggestions. 5The corpus can be downloaded from http : / /www . l ing .ohio-st ate . edu / ˜mwhite / dat a / emnlp 10 / . 571 This material is based upon work supported by the National Science Foundation under Grant No. 0812297. References Jason Baldridge. 2002. Lexically Specified Derivational Control in Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72. R. Barzilay and L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In proceedings of HLT-NAACL, volume 2003, pages 16–23. Aoife Cahill. 2009. Correlating human and automatic evaluation of a german surface realiser. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 97–100, Suntec, Singapore, August. Association for Computational Linguistics. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of BLEU in machine translation research. In Proceedings of EACL, volume 2006, pages 249–256. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (meta-) evaluation ofmachine translation. In StatMT ’07: Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Morristown, NJ, USA. Association for Computational Linguistics. C. Callison-Burch, T. Cohn, and M. Lapata. 2008. Parametric: An automatic evaluation metric for paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 97–104. Association for Computational Linguistics. J. Carletta. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2):249–254. Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy King, John Maxwell, and Paula Newman. 2008. Xle documentation. Technical report, Palo Alto Research Center. Philip Koenh. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. J.R. Landis and G.G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1): 159–174. Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING ’04: Proceedings Chin-Yew of the 20th international conference on Computational 501, Morristown, NJ, USA. Associfor Computational Linguistics. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. Linguistics, page ation K. Bleu: a method for automatic evaluation of machine translation. E. Technical report, IBM Research. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558. Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 271–278, Philadelphia, Pennsylvania, USA, July. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–23 1. M. Snover, N. Madnani, B.J. Dorr, and R. Schwartz. 2009. Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 259–268. Association for Computational Linguistics. Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of CICLing. J.P. Turian, L. Shen, and I.D. Melamed. 2003. Evaluation of machine translation and its evaluation. recall (C— R), 100:2. Michael White and Rajakrishnan Rajkumar. 2009. Perceptron reranking for CCG realization. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 410–419, Singapore, August. Association for Computational Linguistics. Michael White. 2006. Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language and Computation, 4(1):39–75. 572 Table 1: Descriptive statistics Table 2: Corpora-wise inter-annotator agreement (absolute and relative κ values shown) SXAROWlpeyLos-aErFndAliCzueqrtd-GAFluq0 N.354217690 B.356219470M .35287410G .35241780 TP.465329170T.A34521670T.465230 H.54T76321H0 .543N89270H.653B7491280H.563M41270H.5643G218 Table 3: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (-Adq: adequacy and -Flu: Fluency); Scores which fall within the 95 %CI of the best are italicized. SROXAWlLeypoasErldniCze rtG0 N.35246 190 B.5618740 M.542719G0 .5341890T .P632180T.A54268 0T .629310 H.7T6 3985H0 .546N180 H.765B8730H.673M5190 H.56G 4318 Table 4: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (combined adequacy and fluency scores) 573 SRAXOWylLpeosatrEldniezCm rtG0S A.p61d35q94107 .5304%65874L09.5462%136U0SF .lp256u 1209 .51 6%9213L0 .562%91845U Table 5: Spearman’s correlation analysis (bootstrap sampling) of the BLEU scores of various systems with human adequacy and fluency scores SRXOAWylLpeosarEndiCztGH J -12 0 N.6543210 B.6512830 M.4532 960 G.13457960T.P56374210T.A45268730T.562738140 H.7T6854910H.56N482390H.675B1398240H.567M3 240H.56G41290H.8J71562- Table 6: Spearman’s correlations of NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), human variants (HT, HN, HB, HM, HG), and individual human judgments (combined adq. and flu. scores) Factor Count Punctuation17 Adverbial shift Agreement Other shifts Conjunct rearrangement Complementizer ins/del PP shift 16 14 8 8 5 4 Table 7: Factors influencing TERP ranking errors for 50 worst-ranked realization groups Table 8: Metric-wise ranking performance in terms of agreement with a ranking induced by combined adequacy and fluency scores; each metric gets a score out of 5 (i.e. number of system-level comparisons that emerged significant as per the Tukey’s HSD test) Legend: Perceptron Best (PB); Perceptron Worst (PW); XLE Best (XB); XLE Worst (XW); OpenCCG baseline models 1 to 3 (C1 ... C3) 574

3 0.17048401 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

4 0.1596207 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

5 0.12871756 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

6 0.12617785 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

7 0.12187896 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

8 0.12051705 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

9 0.11863945 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.11013303 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

11 0.10889747 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

12 0.10830522 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

13 0.092793755 39 emnlp-2010-EMNLP 044

14 0.092115439 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

15 0.085766539 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

16 0.069080092 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

17 0.064861402 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

18 0.059345983 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

19 0.058464974 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

20 0.05475362 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.237), (1, -0.245), (2, -0.123), (3, 0.041), (4, -0.062), (5, 0.112), (6, -0.054), (7, -0.004), (8, 0.16), (9, -0.075), (10, -0.043), (11, -0.023), (12, 0.045), (13, 0.025), (14, -0.032), (15, -0.039), (16, 0.096), (17, -0.084), (18, -0.076), (19, -0.304), (20, -0.076), (21, -0.097), (22, 0.139), (23, 0.004), (24, 0.151), (25, -0.15), (26, 0.071), (27, 0.183), (28, -0.138), (29, -0.002), (30, 0.02), (31, 0.022), (32, -0.023), (33, 0.096), (34, 0.044), (35, -0.064), (36, -0.008), (37, 0.014), (38, 0.061), (39, -0.039), (40, -0.056), (41, -0.151), (42, 0.014), (43, -0.026), (44, 0.065), (45, 0.027), (46, -0.05), (47, -0.007), (48, -0.008), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94426179 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

Author: Hideki Isozaki ; Tsutomu Hirao ; Kevin Duh ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Automatic evaluation of Machine Translation (MT) quality is essential to developing highquality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate ‘A because B’ as ‘B because A.’ Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

2 0.91286939 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

Author: Dominic Espinosa ; Rajakrishnan Rajkumar ; Michael White ; Shoshana Berleant

Abstract: We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion ofthe implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community. 1 Introduction and Background In building surface-realization systems for natural language generation, there is a need for reliable automated metrics to evaluate the output. Unlike in parsing, where there is usually a single goldstandard parse for a sentence, in surface realization there are usually many grammatically-acceptable ways to express the same concept. This parallels the task of evaluating machine-translation (MT) systems: for a given segment in the source language, 564 there are usually several acceptable translations into the target language. As human evaluation of translation quality is time-consuming and expensive, a number of automated metrics have been developed to evaluate the quality of MT outputs. In this study, we investigate whether the metrics developed for MT evaluation tasks can be used to reliably evaluate the outputs of surface realizers, and which of these metrics are best suited to this task. A number of surface realizers have been developed using the Penn Treebank (PTB), and BLEU scores are often reported in the evaluations of these systems. But how useful is BLEU in this context? The original BLEU study (Papineni et al., 2001) scored MT outputs, which are of generally lower quality than grammar-based surface realizations. Furthermore, even for MT systems, the usefulness of BLEU has been called into question (Callison-Burch et al., 2006). BLEU is designed to work with multiple reference sentences, but in treebank realization, there is only a single reference sentence available for comparison. A few other studies have investigated the use of such metrics in evaluating the output of NLG systems, notably (Reiter and Belz, 2009) and (Stent et al., 2005). The former examined the performance of BLEU and ROUGE with computer-generated weather reports, finding a moderate correlation with human fluency judgments. The latter study applied several MT metrics to paraphrase data from Barzilay and Lee’s corpus-based system (Barzilay and Lee, 2003), and found moderate correlations with human adequacy judgments, but little correlation with fluency judgments. Cahill (2009) examined the performance of six MT metrics (including BLEU) in evaluating the output of a LFG-based surface realizer for ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 5t6ic4s–574, German, also finding only weak correlations with the human judgments. To study the usefulness of evaluation metrics such as BLEU on the output of grammar-based surface realizers used with the PTB, we assembled a corpus of surface realizations from three different realizers operating on Section 00 of the PTB. Two human judges evaluated the adequacy and fluency of each of the realizations with respect to the reference sentence. The realizations were then scored with a number of automated evaluation metrics developed for machine translation. In order to investigate the correlation of targeted metrics with human evaluations, and gather other acceptable realizations for future evaluations, the judges manually repaired each unacceptable realization during the rating task. In contrast to previous NLG meta-evaluations, we found that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best. However, when looking at statistically significant system-level differences in human judgments, we found that some of the metrics get some of the rankings correct, but none get them all correct, with different metrics making different ranking errors. This suggests that multiple metrics should be routinely consulted when comparing realizer systems. Overall, our methodology is similar to that of previous MT meta-evaluations, in that we collected human judgments of system outputs, and compared these scores with those assigned by automatic metrics. A recent alternative approach to paraphrase evaluation is ParaMetric (Callison-Burch et al., 2008); however, it requires a corpus of annotated (aligned) paraphrases (which does not yet exist for PTB data), and is arguably focused more on paraphrase analysis than paraphrase generation. The plan of the paper is as follows: Section 2 discusses the preparation of the corpus of surface realizations. Section 3 describes the human evaluation task and the automated metrics applied. Sections 4 and 5 present and discuss the results of these evaluations. We conclude with some general observations about automatic evaluation of surface realizers, and some directions for further research. 565 2 Data Preparation We collected realizations of the sentences in Section 00 of the WSJ corpus from the following three sources: 1. OpenCCG, a CCG-based chart realizer (White, 2006) 2. The XLE Generator, a LFG-based system developed by Xerox PARC (Crouch et al., 2008) 3. WordNet synonym substitutions, to investigate how differences in lexical choice compare to grammar-based variation.1 Although all three systems used Section 00 of the PTB, they were applied with various parameters (e.g., language models, multiple-output versus single-output) and on different input structures. Accordingly, our study does not compare OpenCCG to XLE, or either of these to the WordNet system. 2.1 OpenCCG realizations OpenCCG is an open source parsing/realization library with multimodal extensions to CCG (Baldridge, 2002). The OpenCCG chart realizer takes logical forms as input and produces strings by combining signs for lexical items. Alternative realizations are scored using integrated n-gram and perceptron models. For robustness, fragments are greedily assembled when necessary. Realizations were generated from 1,895 gold standard logical forms, created by constrained parsing of development-section derivations. The following OpenCCG models (which differ essentially in the way the output is ranked) were used: 1. Baseline 1: Output ranked by a trigram word model 2. Baseline 2: Output ranked using three language models (3-gram words 3-gram words with named entity class replacement factored language model of words, POS tags and CCG supertags) + + 1Not strictly surface realizations, since they do not involve an abstract input specification, but for simplicity we refer to them as realizations throughout. 3. Baseline 3: Perceptron with syntax features and the three LMs mentioned above 4. Perceptron full-model: n-best realizations ranked using perceptron with syntax features and the three n-gram models, as well as discriminative n-grams The perceptron model was trained on sections 0221 of the CCGbank, while a grammar extracted from section 00-21 was used for realization. In addition, oracle supertags were inserted into the chart during realization. The purpose of such a non-blind testing strategy was to evaluate the quality of the output produced by the statistical ranking models in isolation, rather than focusing on grammar coverage, and avoid the problems associated with lexical smoothing, i.e. lexical categories in the development section not being present in the training section. To enrich the variation in the generated realizations, dative-alternation was enforced during realization by ensuring alternate lexical categories of the verb in question, as in the following example: (1) the executives gave [the chefs] [a standing ovation] (2) the executives gave [a standing ovation] [to the chefs] 2.2 XLE realizations The corpus of realizations generated by the XLE system contained 42,527 surface realizations of approximately 1,421 section 00 sentences (an average of 30 per sentence), initially unranked. The LFG f-structures used as input to the XLE generator were derived from automatic parses, as described in (Riezler et al., 2002). The realizations were first tokenized using Penn Treebank conventions, then ranked using perplexities calculated from the same trigram word model used with OpenCCG. For each sentence, the top 4 realizations were selected. The XLE generator provides an interesting point of comparison to OpenCCG as it uses a manuallydeveloped grammar with inputs that are less abstract but potentially noisier, as they are derived from automatic parses rather than gold-standard ones. 566 2.3 WordNet synonymizer To produce an additional source of variation, the nouns and verbs of the sentences in section 00 of the PTB were replaced with all of their WordNet synonyms. Verb forms were generated using verb stems, part-of-speech tags, and the morphg tool.2 These substituted outputs were then filtered using the n-gram data which Google Inc. has made available.3 Those without any 5-gram matches centered on the substituted word (or 3-gram matches, in the case of short sentences) were eliminated. 3 Evaluation From the data sources described in the previous sec- tion, a corpus of realizations to be evaluated by the human judges was constructed by randomly choosing 305 sentences from section 00, then selecting surface realizations of these sentences using the following algorithm: 1. Add OpenCCG’s best-scored realization. 2. Add other OpenCCG realizations until all four models are represented, to a maximum of 4. 3. Add up to 4 realizations from either the XLE system or the WordNet pool, chosen randomly. The intent was to give reasonable coverage of all realizer systems discussed in Section 2 without overloading the human judges. “System” here means any instantiation that emits surface realizations, including various configurations of OpenCCG (using different language models or ranking systems), and these can be multiple-output, such as an n-best list, or single-output (best-only, worst-only, etc.). Accordingly, more realizations were selected from the OpenCCG realizer because 5 different systems were being represented. Realizations were chosen randomly, rather than according to sentence types or other criteria, in order to produce a representative sample of the corpus. In total, 2,114 realizations were selected for evaluation. 2http : //www. informatics . sussex. ac .uk/ re search/ groups / nlp / carro l /morph .html l 3http : //www . ldc . upenn .edu/Catalog/docs/ LDC2 0 0 6T 13 / readme .txt 3.1 Human judgments Two human judges evaluated each surface realization on two criteria: adequacy, which represents the extent to which the output conveys all and only the meaning of the reference sentence; and fluency, the extent to which it is grammatically acceptable. The realizations were presented to the judges in sets containing a reference sentence and the 1-8 outputs selected for that sentence. To aid in the evaluation of adequacy, one sentence each of leading and trailing context were displayed. Judges used the guidelines given in Figure 1, based on the scales developed by the NIST Machine Translation Evaluation Workshop. In addition to rating each realization on the two five-point scales, each judge also repaired each output which he or she did not judge to be fully adequate and fluent. An example is shown in Figure 2. These repairs resulted in new reference sentences for a substantial number of sentences. These repaired realizations were later used to calculate targeted versions of the evaluation metrics, i.e., using the repaired sentence as the reference sentence. Although targeted metrics are not fully automatic, they are of interest because they allow the evaluation algorithm to focus on what is actually wrong with the input, rather than all textual differences. Notably, targeted TER (HTER) has been shown to be more consistent with human judgments than human annotators are with one another (Snover et al., 2006). 3.2 Automatic evaluation The realizations were also evaluated using seven automatic metrics: • IBM’s BLEU, which scores a hypothesis by counting n-gram matches with the reference sentence (Papineni et al., 2001), with smoothing as described in (Lin and Och, 2004) • • • • • • The NIST n-gram evaluation metric, similar to BLEU, but rewarding rarer n-gram matches, and using a different length penalty METEOR, which measures the harmonic mean of unigram precision and recall, with a higher weight for recall (Banerjee and Lavie, 2005) 567 TER (Translation Edit Rate), a measure of the number of edits required to transform a hypothesis sentence into the reference sentence (Snover et al., 2006) TERP, an augmented version of TER which performs phrasal substitutions, stemming, and checks for synonyms, among other improvements (Snover et al., 2009) TERPA, an instantiation of TERP with edit weights optimized for correlation with adequacy in MT evaluations GTM (General Text Matcher), a generaliza- tion of the F-measure that rewards contiguous matching spans (Turian et al., 2003) Additionally, targeted versions of BLEU, METEOR, TER, and GTM were computed by using the human-repaired outputs as the reference set. The human repair was different from the reference sentence in 193 cases (about 9% of the total), and we expected this to result in better scores and correlations with the human judgments overall. 4 Results 4.1 Human judgments Table 1 summarizes the dataset, as well as the mean adequacy and fluency scores garnered from the human evaluation. Overall adequacy and fluency judgments were high (4.16, 3.63) for the realizer systems on average, and the best-rated realizer systems achieved mean fluency scores above 4. 4.2 Inter-annotator agreement Inter-annotator agreement was measured using the κ-coefficient, which is commonly used to measure the extent to which annotators agree in category P(1A−)P−(PE()E), judgment tasks. κ is defined as where P(A) is the observed agreement 1 b−etPw(eEe)n annotators and P(E) is the probability of agreement due to chance (Carletta, 1996). Chance agreement for this data is calculated by the method discussed in Carletta’s squib. However, in previous work in MT meta-evaluation, Callison-Burch et al. (2007), assume the less strict criterion of uniform chance agreement, i.e. for a five-point scale. They also 51 Score Adequacy Fluency 5All the meaning of the referencePerfectly grammatical 4 Most of the meaning Awkward or non-native; punctuation errors 3 Much of the meaning Agreement errors or minor syntactic problems 2 Meaning substantially different Major syntactic problems, such as missing words 1 Meaning completely different Completely ungrammatical Figure Ref. Realiz. Repair 1: Rating scale and guidelines It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf spurns them again It weren’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them It wasn’t clear how NL and Mr. Simmons would respond if Georgia Gulf again spurns them Figure 2: Example of repair introduce the notion of “relative” κ, which measures how often two or more judges agreed that A > B, A = B, or A < B for two outputs A and B, irrespective of the specific values given on the five-point scale; here, uniform chance agreement is taken to be We report both absolute and relative κ in Table 2, using actual chance agreement rather than uniform chance agreement. 31. The κ scores of0.60 for adequacy and 0.63 for fluency across the entire dataset represent “substantial” agreement, according to the guidelines discussed in (Landis and Koch, 1977), better than is typically reported for machine translation evaluation tasks; for example, Callison-Burch et al. (2007) reported “fair” agreement, with κ = 0.281 for fluency and κ = 0.307 for adequacy (relative). Assuming the uniform chance agreement that the previously cited work adopts, our inter-annotator agreements (both absolute and relative) are still higher. This is likely due to the generally high quality of the realizations evaluated, leading to easier judgments. 4.3 Correlation with automatic evaluation To determine how well the automatic evaluation methods described in Section 3 correlate with the human judgments, we averaged the human judgments for adequacy and fluency, respectively, for each of the rated realizations, and then computed both Pearson’s correlation coefficient and Spearman’s rank correlation coefficient between these scores and each of the metrics. Spearman’s correlation makes fewer assumptions about the distribu- tion of the data, but may not reflect a linear rela568 tionship that is actually present. Both are frequently reported in the literature. Due to space constraints, we show only Spearman’s correlation, although the TER family scored slightly better on Pearson’s coefficient, relatively. The results for Spearman’s correlation are given in Table 3. Additionally, the average scores for adequacy and fluency were themselves averaged into a single score, following (Snover et al., 2009), and the Spearman’s correlation of each of the automatic metrics with these scores are given in Table 4. All reported correlations are significant at p < 0.001. 4.4 Bootstrap sampling of correlations For each of the sub-corpora shown in Table 1, we computed confidence intervals for the correlations between adequacy and fluency human scores with selected automatic metrics (BLEU, HBLEU, TER, TERP, and HTER) as described in (Koenh, 2004). We sampled each sub-corpus 1000 times with replace- ment, and calculated correlations between the rankings induced by the human scores and those induced by the metrics for each reference sentence. We then used these coefficients to estimate the confidence interval, after excluding the top 25 and bottom 25 coefficients, following (Lin and Och, 2004). The results of this for the BLEU metric are shown in Table 5. We determined which correlations lay within the 95% confidence interval of the best performing metric in each row of Table Table 3; these figures are italicized. 5 Discussion 5.1 Human judgments of systems The results for the four OpenCCG perceptron models mostly confirm those reported in (White and Rajkumar, 2009), with one exception: the B-3 model was below B-2, though the P-B (perceptron-best) model still scored highest. This may have been due to differences in the testing scenario. None of the differences in adequacy scores among the individual systems are significant, with the exception of the WordNet system. In this case, the lack of wordsense disambiguation for the substituted words results in a poor overall adequacy score (e.g., wage floor → wage story). Conversely, it scores highest ffoloro fluency, as substituting a noun or tve srcbo rwesith h a synonym does not usually introduce ungrammaticality. 5.2 Correlations of human judgments with MT metrics Of the non-human-targeted metrics evaluated, BLEU and TER/TERP demonstrate the highest correlations with the human judgments of fluency (r = 0.62, 0.64). The TER family of evaluation metrics have been observed to perform very well in MTevaluation tasks, and although the data evaluated here differs from typical MT data in some important ways, the correlation of TERP with the human judgments is substantial. In contrast with previous MT evaluations where TERP performs considerably better than TER, these scored close to equal on our data, possibly because TERP’s stem, synonym, and paraphrase matching are less useful when most of the variation is syntactic. The correlations with BLEU and METEOR are lower than those reported in (Callison-Burch et al., 2007); in that study, BLEU achieved adequacy and fluency correlations of 0.690 and 0.722, respectively, and METEOR achieved 0.701 and 0.719. The correlations for these metrics might be expected to be lower for our data, since overall quality is higher, making the metrics’ task more difficult as the outputs involve subtler differences between acceptable and unacceptable variation. The human-targeted metrics (represented by the prefixed H in the data tables) correlated even more strongly with the human judgments, compared to the non-targeted versions. HTER demonstrated the best 569 correlation with realizer fluency (r = 0.75). For several kinds of acceptable variation involving the rearrangement of constituents (such as dative shift), TERP gives a more reasonable score than BLEU, due to its ability to directly evaluate phrasal shifts. The following realization was rated 4.5 for fluency, and was more correctly ranked by TERP than BLEU: (3) Ref: The deal also gave Mitsui access to a high-tech medical product. (4) Realiz.: The deal also gave access to a high-tech medical product to Mitsui. For each reference sentence, we compared the ranking of its realizations induced from the human scores to the ranking induced from the TERP score, and counted the rank errors by the latter, informally categorizing them by error type (see Table 7). In the 50 sentences with the highest numbers of rank errors, 17 were affected by punctuation differences, typically involving variation in comma placement. Human fluency judgments of outputs with only punctuation problems were generally high, and many realizations with commas inserted or removed were rated fully fluent by the annotators. However, TERP penalizes such insertions or deletions. Agreement errors are another frequent source of ranking errors for TERP. The human judges tended to harshly penalize sentences with number-agreement or tense errors, whereas TERP applies only a single substitution penalty for each such error. We expect that with suitable optimization of edit weights to avoid over-penalizing punctuation shifts and underpenalizing agreement errors, TERP would exhibit an even stronger correlation with human fluency judgments. None of the evaluation metrics can distinguish an acceptable movement of a word or constituent from an unacceptable movement, with only one reference sentence. A substantial source of error for both TERP and BLEU is variation in adverbial placement, as shown in (7). Similar errors are seen with prepositional phrases and some commonly-occurring temporal adverbs, which typically admit a number of variations in placement. Another important example of acceptable variation which these metrics do not generally rank correctly is dative alternation: Ref. We need to clarify what exactly is wrong with it. Realiz. Flu. TERP BLEU We need to clarify exactly what is wrong with it.50.10.5555 We need to clarify exactly what ’s wrong with it. 5 0.2 0.4046 (7) We need to clarify what , exactly , is wrong with it. 5 0.2 0.5452 We need to clarify what is wrong with it exactly. 4.5 0.1 0.6756 We need to clarify what exactly , is wrong with it. 4 0.1 0.7017 We need to clarify what , exactly is wrong with it. 4 0.1 0.7017 We needs to clarify exactly what is wrong with it. (5) Ref. When test booklets were passed out 48 hours ahead of time, she says she copied questions in the social studies section and gave the answers to students. (6) Realiz. When test booklets were passed out 48 hours ahead of time , she says she copied questions in the social studies section and gave students the answers. The correlations of each of the metrics with the human judgments of fluency for the realizer systems indicate at least a moderate relationship, in contrast with the results reported in (Stent et al., 2005) for paraphrase data, which found an inverse correlation for fluency, and (Cahill, 2009) for the output ofa surface realizer for German, which found only a weak correlation. However, the former study employed a corpus-based paraphrase generation system rather than grammar-driven surface realizers, and the resulting paraphrases exhibited much broader variation. In Cahill’s study, the outputs of the realizer were almost always grammatically correct, and the automated evaluation metrics were ranking markedness instead of grammatical acceptability. 5.3 System-level comparisons In order to investigate the efficacy of the metrics in ranking different realizer systems, or competing realizations from the same system generated using different ranking models, we considered seven different “systems” from the whole dataset of realizations. These consisted of five OpenCCG-based realizations (the best realization from three baseline models, and the best and the worst realization from the full perceptron model), and two XLE-based sys- tems (the best and the worst realization, after ranking the outputs of the XLE realizer with an n-gram model). The mean of the combined adequacy and 570 3 0.103 0.346 fluency scores of each of these seven systems was compared with that of every other system, resulting in 21 pairwise comparisons. Then Tukey’s HSD test was performed to determine the systems which differed significantly in terms of the average adequacy and fluency rating they received.4 The test revealed five pairwise comparisons where the scores were significantly different. Subsequently, for each of these systems, an overall system-level score for each of the MT metrics was calculated. For the five pairwise comparisons where the adequacy-fluency group means differed significantly, we checked whether the metric ranked the systems correctly. Table 8 shows the results of a pairwise comparison between the ranking induced by each evaluation metric, and the ranking induced by the human judgments. Five of the seven non- targeted metrics correctly rank more than half of the systems. NIST, METEOR, and GTM get the most comparisons right, but neither NIST nor GTM correctly rank the OpenCCG-baseline model 1 with respect to the XLE-best model. TER and TERP get two of the five comparisons correct, and they incorrectly rank two of the five OpenCCG model comparisons, as well as the comparison between the XLE-worst and OpenCCG-best systems. For the targeted metrics, HNIST is correct for all five comparisons, while neither HBLEU nor HMETEOR correctly rank all the OpenCCG models. On the other hand, HTER and HGTM incorrectly rank the XLE-best system versus OpenCCG-based models. In summary, some of the metrics get some of the rankings correct, but none of the non-targeted metrics get all of them correct. Moreover, different metrics make different ranking errors. This argues for 4This particular test was chosen since it corrects for multiple post-hoc analyses conducted on the same data-set. the use of multiple metrics in comparing realizer systems. 6 Conclusion Our study suggests that although the task of evaluating the output from realizer systems differs from the task of evaluating machine translations, the automatic metrics used to evaluate MT outputs deliver moderate correlations with combined human fluency and adequacy scores when used on surface realizations. We also found that the MT-evaluation metrics are useful in evaluating different versions of the same realizer system (e.g., the various OpenCCG realization ranking models), and finding cases where a system is performing poorly. As in MT-evaluation tasks, human-targeted metrics have the highest correlations with human judgments overall. These results suggest that the MT-evaluation metrics are useful for developing surface realizers. However, the correlations are lower than those reported for MT data, suggesting that they should be used with caution, especially for cross-system evaluation, where consulting multiple metrics may yield more reliable comparisons. In our study, the targeted version of TERP correlated most strongly with human judgments of fluency. In future work, the performance of the TER family of metrics on this data might be improved by opti- mizing the edit weights used in computing its scores, so as to avoid over-penalizing punctuation movements or under-penalizing agreement errors, both of which were significant sources of ranking errors. Multiple reference sentences may also help mitigate these problems, and the corpus of human-repaired realizations that has resulted from our study is a step in this direction, as it provides multiple references for some cases. We expect the corpus to also prove useful for feature engineering and error analysis in developing better realization models.5 Acknowledgements We thank Aoife Cahill and Tracy King for providing us with the output of the XLE generator. We also thank Chris Callison-Burch and the anonymous reviewers for their helpful comments and suggestions. 5The corpus can be downloaded from http : / /www . l ing .ohio-st ate . edu / ˜mwhite / dat a / emnlp 10 / . 571 This material is based upon work supported by the National Science Foundation under Grant No. 0812297. References Jason Baldridge. 2002. Lexically Specified Derivational Control in Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72. R. Barzilay and L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In proceedings of HLT-NAACL, volume 2003, pages 16–23. Aoife Cahill. 2009. Correlating human and automatic evaluation of a german surface realiser. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 97–100, Suntec, Singapore, August. Association for Computational Linguistics. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of BLEU in machine translation research. In Proceedings of EACL, volume 2006, pages 249–256. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (meta-) evaluation ofmachine translation. In StatMT ’07: Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Morristown, NJ, USA. Association for Computational Linguistics. C. Callison-Burch, T. Cohn, and M. Lapata. 2008. Parametric: An automatic evaluation metric for paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 97–104. Association for Computational Linguistics. J. Carletta. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2):249–254. Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy King, John Maxwell, and Paula Newman. 2008. Xle documentation. Technical report, Palo Alto Research Center. Philip Koenh. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. J.R. Landis and G.G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1): 159–174. Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING ’04: Proceedings Chin-Yew of the 20th international conference on Computational 501, Morristown, NJ, USA. Associfor Computational Linguistics. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. Linguistics, page ation K. Bleu: a method for automatic evaluation of machine translation. E. Technical report, IBM Research. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558. Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 271–278, Philadelphia, Pennsylvania, USA, July. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–23 1. M. Snover, N. Madnani, B.J. Dorr, and R. Schwartz. 2009. Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 259–268. Association for Computational Linguistics. Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of CICLing. J.P. Turian, L. Shen, and I.D. Melamed. 2003. Evaluation of machine translation and its evaluation. recall (C— R), 100:2. Michael White and Rajakrishnan Rajkumar. 2009. Perceptron reranking for CCG realization. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 410–419, Singapore, August. Association for Computational Linguistics. Michael White. 2006. Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language and Computation, 4(1):39–75. 572 Table 1: Descriptive statistics Table 2: Corpora-wise inter-annotator agreement (absolute and relative κ values shown) SXAROWlpeyLos-aErFndAliCzueqrtd-GAFluq0 N.354217690 B.356219470M .35287410G .35241780 TP.465329170T.A34521670T.465230 H.54T76321H0 .543N89270H.653B7491280H.563M41270H.5643G218 Table 3: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (-Adq: adequacy and -Flu: Fluency); Scores which fall within the 95 %CI of the best are italicized. SROXAWlLeypoasErldniCze rtG0 N.35246 190 B.5618740 M.542719G0 .5341890T .P632180T.A54268 0T .629310 H.7T6 3985H0 .546N180 H.765B8730H.673M5190 H.56G 4318 Table 4: Spearman’s correlations among NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), TER (T), human variants (HN, HB, HM, HT, HG) and human judgments (combined adequacy and fluency scores) 573 SRAXOWylLpeosatrEldniezCm rtG0S A.p61d35q94107 .5304%65874L09.5462%136U0SF .lp256u 1209 .51 6%9213L0 .562%91845U Table 5: Spearman’s correlation analysis (bootstrap sampling) of the BLEU scores of various systems with human adequacy and fluency scores SRXOAWylLpeosarEndiCztGH J -12 0 N.6543210 B.6512830 M.4532 960 G.13457960T.P56374210T.A45268730T.562738140 H.7T6854910H.56N482390H.675B1398240H.567M3 240H.56G41290H.8J71562- Table 6: Spearman’s correlations of NIST (N), BLEU (B), METEOR (M), GTM (G), TERp (TP), TERpa (TA), human variants (HT, HN, HB, HM, HG), and individual human judgments (combined adq. and flu. scores) Factor Count Punctuation17 Adverbial shift Agreement Other shifts Conjunct rearrangement Complementizer ins/del PP shift 16 14 8 8 5 4 Table 7: Factors influencing TERP ranking errors for 50 worst-ranked realization groups Table 8: Metric-wise ranking performance in terms of agreement with a ranking induced by combined adequacy and fluency scores; each metric gets a score out of 5 (i.e. number of system-level comparisons that emerged significant as per the Tukey’s HSD test) Legend: Perceptron Best (PB); Perceptron Worst (PW); XLE Best (XB); XLE Worst (XW); OpenCCG baseline models 1 to 3 (C1 ... C3) 574

3 0.58573073 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

4 0.49345914 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

5 0.44795424 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

Author: Jinxi Xu ; Antti-Veikko Rosti

Abstract: Word alignment plays a central role in statistical MT (SMT) since almost all SMT systems extract translation rules from word aligned parallel training data. While most SMT systems use unsupervised algorithms (e.g. GIZA++) for training word alignment, supervised methods, which exploit a small amount of human-aligned data, have become increasingly popular recently. This work empirically studies the performance of these two classes of alignment algorithms and explores strategies to combine them to improve overall system performance. We used two unsupervised aligners, GIZA++ and HMM, and one supervised aligner, ITG, in this study. To avoid language and genre specific conclusions, we ran experiments on test sets consisting of two language pairs (Chinese-to-English and Arabicto-English) and two genres (newswire and weblog). Results show that the two classes of algorithms achieve the same level of MT perfor- mance. Modest improvements were achieved by taking the union of the translation grammars extracted from different alignments. Significant improvements (around 1.0 in BLEU) were achieved by combining outputs of different systems trained with different alignments. The improvements are consistent across languages and genres.

6 0.428583 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

7 0.3637872 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

8 0.36190924 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

9 0.3584561 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

10 0.35393581 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

11 0.28834033 39 emnlp-2010-EMNLP 044

12 0.24952789 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

13 0.24608524 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

14 0.23720104 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

15 0.22879609 40 emnlp-2010-Effects of Empty Categories on Machine Translation

16 0.22296581 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

17 0.22282676 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

18 0.21158595 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

19 0.21074401 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

20 0.20061049 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.02), (12, 0.043), (29, 0.155), (30, 0.017), (32, 0.016), (52, 0.034), (56, 0.045), (62, 0.012), (66, 0.098), (72, 0.046), (76, 0.019), (83, 0.398)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80399096 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

Abstract: Inducing a grammar directly from text is one of the oldest and most challenging tasks in Computational Linguistics. Significant progress has been made for inducing dependency grammars, however the models employed are overly simplistic, particularly in comparison to supervised parsing models. In this paper we present an approach to dependency grammar induction using tree substitution grammar which is capable of learning large dependency fragments and thereby better modelling the text. We define a hierarchical non-parametric Pitman-Yor Process prior which biases towards a small grammar with simple productions. This approach significantly improves the state-of-the-art, when measured by head attachment accuracy.

same-paper 2 0.72753835 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

Author: Hideki Isozaki ; Tsutomu Hirao ; Kevin Duh ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Automatic evaluation of Machine Translation (MT) quality is essential to developing highquality MT systems. Various evaluation metrics have been proposed, and BLEU is now used as the de facto standard metric. However, when we consider translation between distant language pairs such as Japanese and English, most popular metrics (e.g., BLEU, NIST, PER, and TER) do not work well. It is well known that Japanese and English have completely different word orders, and special care must be paid to word order in translation. Otherwise, translations with wrong word order often lead to misunderstanding and incomprehensibility. For instance, SMT-based Japanese-to-English translators tend to translate ‘A because B’ as ‘B because A.’ Thus, word order is the most important problem for distant language translation. However, conventional evaluation metrics do not significantly penalize such word order mistakes. Therefore, locally optimizing these metrics leads to inadequate translations. In this paper, we propose an automatic evaluation metric based on rank correlation coefficients modified with precision. Our meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

3 0.57722044 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

4 0.43986264 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

Abstract: We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. Furthermore, we show that two well known models for these respective tasks (DMV and the IBM models) share common modeling assumptions. This motivates us to develop an alignment-based framework for unsupervised dependency parsing. The framework (which will be made publicly available) is flexible, modular and easy to extend. Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach.

5 0.43753639 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

Abstract: We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manually-specified language-independent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms state-of-theart unsupervised methods by a significant margin.1

6 0.43605411 77 emnlp-2010-Measuring Distributional Similarity in Context

7 0.43577138 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

8 0.43437091 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

9 0.43307799 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

10 0.43270603 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

11 0.4318108 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

12 0.43148243 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

13 0.43015599 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

14 0.42906642 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

15 0.42890614 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

16 0.42799312 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

17 0.42703387 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

18 0.42240271 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

19 0.42217603 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

20 0.42126656 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding