emnlp emnlp2011 emnlp2011-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
Reference: text
sentIndex sentText sentNum sentScore
1 translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. [sent-7, score-0.56]
2 After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. [sent-9, score-0.224]
3 These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. [sent-10, score-1.1]
4 In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. [sent-11, score-0.585]
5 These results support the use of heterogeneous measures in order to consolidate text evaluation results. [sent-12, score-0.472]
6 However, original measures based on lexical matching, such as BLEU (Papineni et al. [sent-17, score-0.262]
7 Second, the advantages of novel measures are not easy to demonstrate in terms of correlation with human judgements. [sent-21, score-0.454]
8 Rather than this, we first analyze in depth the state of the art, concluding that it is not easy to determine the reliability of a measure. [sent-23, score-0.535]
9 Second, we formalize and check empirically two intrinsic properties that any evaluation measure based on similarity to human-produced references satisfies. [sent-25, score-0.224]
10 Assuming that a measure satisfies a set of basic formal constraints, these properties imply that corroborating a system comparison with additional measures always increases the overall reliability of the evaluation process, even when the added measures have a low correlation with human judgements. [sent-26, score-1.654]
11 In most papers, evaluation results are corroborated with similar n-gram based measures (eg. [sent-27, score-0.298]
12 However, according to our second property, the greater the heterogeneity of Proce Ed iningbsu orfg th ,e S 2c0o1tl1an Cdo,n UfeKr,en Jcuely on 27 E–m31p,ir 2ic0a1l1 M. [sent-29, score-0.323]
13 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 4c5s5–46 , the measures (which is measurable) the higher their reliability. [sent-31, score-0.262]
14 The practical implication is that, corroborating evaluation results with measures based on higher linguistic levels increases the heterogeneity, and therefore, the reliability of evaluation results. [sent-32, score-1.165]
15 1 Individual measures Among NLP disciplines, MT probably has the widest set of automatic evaluation measures. [sent-34, score-0.342]
16 , 2003a), the fact that it is not necessary to increase BLEU to improve systems (Callison-burch and Osborne, 2006), the overscoring of statistical MT systems (Le and Przybocki, 2005), the low reliability over rich morphology languages (Homola et al. [sent-58, score-0.535]
17 The reaction to these criticisms has been focused on the development of more sophisticated measures in which candidate and reference translations are automatically annotated and compared at different linguistic levels. [sent-60, score-0.384]
18 , 2005) which reported some reliability improvement over ROUGE in terms of correlation with human judgements. [sent-71, score-0.727]
19 2 Combined measures Several researchers have suggested integrating heterogeneous measures. [sent-80, score-0.436]
20 For instance, Albrecht and Hwa included syntax-based measures together with lexical measures, outperforming other combination schemes (Albrecht and Hwa, 2007a; Albrecht and Hwa, 2007b). [sent-93, score-0.262]
21 In addition, they showed that this mixed combination improved over the combination oflinguistic or n-gram based measures alone (Corston-Oliver et al. [sent-100, score-0.262]
22 , 2009) reported a reliability improvement by including measures based on textual entailment in the set. [sent-103, score-0.797]
23 In (Gim e´nez and M `arquez, 2008), a simple arithmetic mean of scores for com- bining measures at different linguistic levels was applied with remarkable results in recent shared evaluation tasks (Callison-Burch et al. [sent-104, score-0.397]
24 (2001b) evaluated the reliability of the BLEU metric according to its ability to emulate human assessors, as measured in terms of Pearson correlation with human assessments ofadequacy and fluency at the document level. [sent-109, score-0.948]
25 The measure NIST (Doddington, 2002) was meta-evaluated also in terms of correlation with human assessments, but over different document sources and for a varying number of references and segment sizes. [sent-110, score-0.324]
26 Banerjee and Lavie (2005) argued that the reliability of metrics at the document level can be due to averaging effects but might not be robust across sentence translations. [sent-116, score-0.593]
27 In order to address this issue, they computed the translation-bytranslation correlation with human assessments (i. [sent-117, score-0.273]
28 However, correlation with human judgements is not enough to determine the reliability of measures. [sent-120, score-0.8]
29 First, correlation at sentence level (unlike correlation at system level) tends to be low and difficult to interpret. [sent-121, score-0.282]
30 , 2009) it is observed that higher linguistic levels in measures increases the correlation with human judgements at the system level at the cost of correlation at the segment level. [sent-124, score-0.832]
31 Culy and Rieheman observed that, although BLEU can achieve a high correlation at system level in some test suites, it over-scores a poor automatic translation of “Tom Sawyer” against a human produced translation (Culy and Riehemann, 2003). [sent-127, score-0.416]
32 , 2006), in contrast to correlation with human judgements which is referred to as human acceptability. [sent-129, score-0.316]
33 The main advantage of meta-evaluation based on human likeness is that, since human assessments are not required, metrics can be evaluated over larger test beds. [sent-136, score-0.301]
34 4 The use of evaluation measures In general, the state of the art includes a wide set of results that show the drawbacks of n-gram based measures as BLEU, and a wide set of proposals for new single and combined measures which are meta- evaluated in terms of human acceptability (i. [sent-139, score-0.95]
35 , their ability to emulate human judges, typically measured in terms of correlation with human judgements) or human-likeness (i. [sent-141, score-0.288]
36 However, the original measures BLEU and ROUGE are still preferred. [sent-145, score-0.262]
37 We believe that one of the reasons is the lack of an in-depth study on to what extent providing additional evaluation results with other metrics contributes to the reliability of such results. [sent-146, score-0.629]
38 The state of the art suggests that the use of heterogeneous measures can improve the evaluation reliability. [sent-147, score-0.51]
39 However, as far as we know, there is no comprehensive analysis on the contribution of novel measures when corroborating evaluation results with additional measures. [sent-148, score-0.462]
40 3 Similarity Based Evaluation Measures In general, automatic evaluation measures applied in tasks like MT or AS are similarity measures between system outputs and human references. [sent-149, score-0.704]
41 These measures are related with precision, recall or overlap over specific types of linguistic units. [sent-150, score-0.318]
42 Other measures that work at higher linguistic levels apply precision, recall or overlap of linguistic components such as dependency relations, grammatical categories, semantic roles, etc. [sent-152, score-0.417]
43 But, actually, measures such as ROUGE or BLEU are not proper “metrics”, because they do not satisfy the symmetry and the triangle inequality properties. [sent-156, score-0.298]
44 2 Measures As for evaluation measures, for MT we have used a rich set of 64 measures provided within the ASIYA Toolkit (Gim e´nez and M `arquez, This includes measures operating at different linguistic levels: lexical, syntactic, and semantic. [sent-200, score-0.616]
45 At the lexical level this set includes variants of 8 measures employed in the state of the art: BLEU, NIST, GTM, METEOR, ROUGE, WER, PER and TER. [sent-201, score-0.262]
46 According to our computations, our measures cover high and low correlations at both levels. [sent-209, score-0.262]
47 Notice that the original ROUGE measures are oriented to recall. [sent-232, score-0.262]
48 In total, we have 21 measures for the summarization task. [sent-233, score-0.313]
49 5 Additive reliability As discussed in Section 2, a number of recent publications address the problem of measure combination with successful results, specially when heterogeneous measures are combined. [sent-235, score-1.071]
50 The following property clarifies this issue and justifies the use of heterogeneous measures when corroborating evaluation results. [sent-236, score-0.7]
51 It asserts that the reliability of system improvements always increases when the evaluation result is corroborated by an additional similarity measure, regardless of the correlation achieved by the additional measure in isolation. [sent-237, score-0.894]
52 Let us define the reliability R(X) of a measure set as the probability of a real improvement (as measured by human judges) when a score improvement is observed simultaneously for all measures in the set X. [sent-240, score-0.948]
53 a translation) with a highly reliable measure set, but 460 we can ensure a system improvement when all measures corroborate the result. [sent-243, score-0.391]
54 Then the additive reliability property can be stated as: ∪ R(X {x}) ≥ R(X) We could think of violating this property by adding, for instance, a measure consisting of a ran- dom function (x0(s) = rand(0. [sent-244, score-0.816]
55 Although our test suites includes measures with low correlation at segment and system level, we can confirm empirically that all of them satisfy this property. [sent-249, score-0.536]
56 We have developed the following experiment: taking all possible measure pairs in the test suites, we have compared their reliability as a set versus the maximal reliability of any of them (by computing the difference R(X) − max(R(x1) , R(x2)). [sent-250, score-1.17]
57 This result has a key implication: Corroborating evaluation results with a new measure, even when it has lower correlation with human judgements, increases the reliability of results. [sent-253, score-0.796]
58 According to the following property, this factor is the heterogeneity of measures. [sent-255, score-0.323]
59 6 Heterogeneity This property states that the reliability of any measure combination is lower bounded by the heterogeneity of the measure set. [sent-256, score-1.122]
60 In other words, a single measure can be more or less reliable, but a system improvement according to all measures in an heterogeneous set is reliable. [sent-257, score-0.536]
61 Let us define the heterogeneity H(X) of a set of measures X as, given two system outputs s and s0 such that g s s0 g (g is the reference tseuxcth), tthhaet p grob 6=ab sility6 = th sat =ther ge e(xgis its t twhoe mreefearseunrcees that contradict each other. [sent-258, score-0.615]
62 x(s) > x(s0) ∧ x0(s) < x0(s0)) = = = Figure 1: Additive reliability for metric pairs. [sent-260, score-0.579]
63 Clearly, the harder it is that measures agree, the more meaningful it is when they do. [sent-262, score-0.262]
64 Increasing the heterogeneity implies joining measures or measure sets progressively. [sent-265, score-0.685]
65 According to the Additive Reliability property, this joining implies a reliability increase. [sent-266, score-0.535]
66 If H(X) = 1then, for any distinct pair of outputs that differ from the reference, there exist at least two measures in the set contradicting each other. [sent-269, score-0.262]
67 g = s → Q(s) ≥ Q(s0) Therefore, the reliability of the measure set is maximal. [sent-277, score-0.635]
68 In summary, if H(X) = 1then: R(X) = P(Q(s) ≥ Q(s0) |x(s) ≥ x(s0) ∀x ∈ X) = = P(Q(s) ≥ Q(s0) |s = g) = 1 Figures 2 and 3 show the relationship between the heterogeneity ofrandomly selected measure sets and their reliability for the MT and summarization test suites. [sent-278, score-1.009]
69 As the figures show, the higher the heterogeneity, the higher the reliability of the measure set. [sent-279, score-0.635]
70 Notice that the heterogeneity property does not necessarily imply a high correlation between reliability and heterogeneity. [sent-281, score-1.094]
71 For instance, an ideal single measure would have zero heterogeneity and Figure 3: Heterogeneity vs. [sent-282, score-0.423]
72 The property rather brings us to the following situation: let us suppose that we have a set of single measures available which achieve a certain range of reliability. [sent-285, score-0.326]
73 But if we combine them, increasing the heterogeneity, the minimal reliability of the selected measures will be higher. [sent-288, score-0.797]
74 at high linguistic levels) that do not achieve high correlation in isolation, is better than corroborating results with any individual measure alone, such as ROUGE and BLEU, which is the common practice in the state of the art. [sent-291, score-0.461]
75 The main drawback of this property is that increasing the heterogeneity implies a sensitivity reduction. [sent-292, score-0.479]
76 In other words, unanimous evaluation results from heterogeneous measures are reliable but harder to achieve for the system developer. [sent-295, score-0.531]
77 Finally, Figure 4 shows that linguistic measures increase the heterogeneity of measure sets. [sent-297, score-0.741]
78 462 Figure 4: Heterogeneity of lexical measures vs. [sent-300, score-0.262]
79 Additive Reliability According to the previous properties, corroborating evaluation results with several measures increases the reliability of evaluation results at the cost of sen- sitivity. [sent-303, score-1.066]
80 More specifically, we have found that the reliability of a measure set is higher than the reliability of each of the individual measures at a similar level of sensitivity. [sent-307, score-1.432]
81 human assessed) quality improvement: S(X) = P(x(s) ≥ x(s0)∀x ∈ X|Q(s) ≥ Q(s0)) Being Rth(x) and Sth(x) the reliability and sensitivity of a single measure x for a certain increase score threshold th: Figure 5: Heterogeneity vs. [sent-310, score-0.778]
82 Rth(x) Sth(x) = P(Q(s) ≥ Q(s0)|x(s) − x(s0) ≥ th) = P(x(s) − x(s0) ≥ th|Q(s) ≥ Q(s0)) The property that we want to check is that, at the same sensitivity level, combining measures is more reliable than increasing the score threshold of single measures: S(X) = Sth(x). [sent-312, score-0.447]
83 x ∈ X −→ R(X) ≥ Rth(x) Note that if we had a perfect measure xp such that R(xp) = S(xp) = 1, then combining this measure with a low reliability measure xl would produce a lower sensitivity, but the maximal reliability would be preserved. [sent-313, score-1.37]
84 In order to confirm empirically this property, we have developed the following experiment: (i) We compute the reliability and sensitivity of randomly chosen measure sets over single text pairs. [sent-314, score-0.727]
85 Reliability Gain = R(X) − max{Rth(x)/x ∈ X ∧ Sth(x) = S(X)} If there are several reliability values with the same sensitivity for a given single measures, we choose the highest reliability value for the single measure. [sent-320, score-1.162]
86 The horizontal axis represents the Heterogeneity of measure sets, while the vertical axis represents the reliability gain. [sent-322, score-0.635]
87 Remarkably, the reliability gain is positive for all cases in our test suites. [sent-323, score-0.535]
88 08 for AS (note that summarization measures are more redundant in our corpora). [sent-326, score-0.313]
89 In both test suites, the largest information gains are obtained with highly heterogeneous measure sets. [sent-327, score-0.274]
90 In summary, given comparable measures in terms of reliability, corroborating evaluation results with several measures is more effective than optimizing systems according to the best measure in the set. [sent-328, score-0.824]
91 This empirical property provides an additional evidence in favour of the use of heterogeneous measures and, in particular, of the use of linguistic measures in combination with standard lexical measures. [sent-329, score-0.818]
92 8 Conclusions In this paper, we have analyzed the state of the art in order to clarify why novel text evaluation measures are not exploited by the community. [sent-330, score-0.366]
93 Our first conclusion is that it is not easy to determine the reliability of measures, which is highly corpus-dependent and often contradictory when comparing correlation with human judgements at segment vs. [sent-331, score-0.832]
94 In order to tackle this issue, we have studied a number of properties that suggest the convenience of using heterogeneous measures to corroborate evaluation results. [sent-333, score-0.511]
95 According to these properties, we can ensure that, even when if we can not determine the reliability of individual measures, corroborating a system improvement with additional measures always increases the reliability of the results. [sent-334, score-1.529]
96 In addition, the more heterogeneous the measures employed (which is measurable), the higher the reliability of the results. [sent-335, score-0.971]
97 But perhaps the most important practical finding is that the reliability at similar sensitivity levels by corroborating evaluation results with several measures is always higher than improving systems according to any of the combined measures in isolation. [sent-336, score-1.394]
98 These properties point to the practical advantages of considering linguistic knowledge (beyond lexical information) in measures, even if they do not achieve a high correlation with human judgements. [sent-337, score-0.287]
99 Our experiments show that linguistic knowledge increases the heterogeneity of measure sets, which in turn increases the reliability of evaluation results when corroborating system comparisons with several measures. [sent-338, score-1.28]
100 Reevaluating the role of bleu in machine translation research. [sent-372, score-0.228]
wordName wordTfidf (topN-words)
[('reliability', 0.535), ('heterogeneity', 0.323), ('measures', 0.262), ('gim', 0.222), ('nez', 0.184), ('rouge', 0.183), ('heterogeneous', 0.174), ('corroborating', 0.164), ('correlation', 0.141), ('mt', 0.134), ('amig', 0.116), ('bleu', 0.106), ('albrecht', 0.105), ('measure', 0.1), ('sensitivity', 0.092), ('arquez', 0.092), ('jes', 0.09), ('translation', 0.09), ('owczarzak', 0.082), ('assessments', 0.081), ('akiba', 0.075), ('judgements', 0.073), ('suites', 0.065), ('property', 0.064), ('culy', 0.06), ('gtm', 0.06), ('likeness', 0.06), ('metrics', 0.058), ('linguistic', 0.056), ('meteor', 0.054), ('gildea', 0.053), ('additive', 0.053), ('julio', 0.052), ('enrique', 0.051), ('hwa', 0.051), ('rth', 0.051), ('human', 0.051), ('summarization', 0.051), ('similarity', 0.049), ('gonzalo', 0.047), ('karolina', 0.047), ('asiya', 0.045), ('emulate', 0.045), ('felisa', 0.045), ('kahn', 0.045), ('mehay', 0.045), ('popovic', 0.045), ('tmi', 0.045), ('automatic', 0.044), ('metric', 0.044), ('llu', 0.043), ('levels', 0.043), ('nist', 0.042), ('decomposition', 0.041), ('gamon', 0.041), ('turian', 0.041), ('kulesza', 0.039), ('leusch', 0.039), ('overlapped', 0.039), ('proposals', 0.039), ('properties', 0.039), ('banerjee', 0.039), ('art', 0.038), ('wer', 0.037), ('satisfy', 0.036), ('translations', 0.036), ('evaluation', 0.036), ('melamed', 0.035), ('summit', 0.035), ('liu', 0.034), ('chan', 0.034), ('josef', 0.034), ('increases', 0.033), ('papineni', 0.033), ('andy', 0.033), ('monz', 0.033), ('duc', 0.033), ('editing', 0.033), ('measurable', 0.033), ('nie', 0.033), ('machine', 0.032), ('chris', 0.032), ('proceedings', 0.032), ('segment', 0.032), ('imply', 0.031), ('christof', 0.031), ('tillmann', 0.031), ('genabith', 0.031), ('cder', 0.03), ('clarify', 0.03), ('homola', 0.03), ('lita', 0.03), ('reeder', 0.03), ('riehemann', 0.03), ('unanimous', 0.03), ('upc', 0.03), ('viii', 0.03), ('reference', 0.03), ('edit', 0.029), ('reliable', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
2 0.17145585 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
3 0.11297207 125 emnlp-2011-Statistical Machine Translation with Local Language Models
Author: Christof Monz
Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.
4 0.08269985 110 emnlp-2011-Ranking Human and Machine Summarization Systems
Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary
Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.
5 0.079481848 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao
Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –
6 0.07023064 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
7 0.062880471 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
8 0.062559903 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
9 0.062462814 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization
10 0.061228402 136 emnlp-2011-Training a Parser for Machine Translation Reordering
11 0.057676837 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation
12 0.055273771 38 emnlp-2011-Data-Driven Response Generation in Social Media
13 0.055019893 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
14 0.054922231 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures
15 0.054174282 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
16 0.05381887 3 emnlp-2011-A Correction Model for Word Alignments
17 0.052724145 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
18 0.05268579 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
19 0.052564785 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
20 0.051854983 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
topicId topicWeight
[(0, 0.182), (1, 0.05), (2, 0.041), (3, -0.147), (4, 0.018), (5, -0.051), (6, 0.056), (7, 0.001), (8, -0.003), (9, -0.044), (10, 0.038), (11, -0.135), (12, -0.01), (13, 0.03), (14, 0.038), (15, 0.199), (16, -0.084), (17, 0.105), (18, 0.028), (19, -0.008), (20, -0.082), (21, -0.082), (22, 0.122), (23, 0.029), (24, -0.064), (25, 0.037), (26, -0.057), (27, 0.18), (28, -0.129), (29, -0.176), (30, -0.015), (31, 0.054), (32, 0.095), (33, -0.07), (34, -0.032), (35, -0.048), (36, 0.05), (37, 0.013), (38, -0.228), (39, -0.116), (40, 0.051), (41, -0.025), (42, 0.068), (43, -0.035), (44, 0.033), (45, 0.031), (46, -0.035), (47, -0.11), (48, 0.037), (49, -0.127)]
simIndex simValue paperId paperTitle
same-paper 1 0.95716637 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
2 0.72508961 110 emnlp-2011-Ranking Human and Machine Summarization Systems
Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary
Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.
3 0.68345231 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
4 0.51863158 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman
Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.
5 0.50275165 125 emnlp-2011-Statistical Machine Translation with Local Language Models
Author: Christof Monz
Abstract: Part-of-speech language modeling is commonly used as a component in statistical machine translation systems, but there is mixed evidence that its usage leads to significant improvements. We argue that its limited effectiveness is due to the lack of lexicalization. We introduce a new approach that builds a separate local language model for each word and part-of-speech pair. The resulting models lead to more context-sensitive probability distributions and we also exploit the fact that different local models are used to estimate the language model probability of each word during decoding. Our approach is evaluated for Arabic- and Chinese-to-English translation. We show that it leads to statistically significant improvements for multiple test sets and also across different genres, when compared against a competitive baseline and a system using a part-of-speech model.
6 0.48978662 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures
8 0.38763964 38 emnlp-2011-Data-Driven Response Generation in Social Media
9 0.36866713 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP
10 0.33849397 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization
11 0.32749549 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
12 0.31169093 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts
13 0.3043825 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
14 0.29547599 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
15 0.28475723 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
16 0.27953905 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
17 0.27234426 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
18 0.26686311 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
19 0.24524999 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
20 0.24338359 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
topicId topicWeight
[(23, 0.099), (28, 0.012), (36, 0.015), (37, 0.022), (45, 0.064), (53, 0.043), (54, 0.039), (57, 0.015), (62, 0.018), (64, 0.015), (66, 0.014), (69, 0.011), (79, 0.466), (82, 0.014), (90, 0.012), (96, 0.037), (98, 0.019)]
simIndex simValue paperId paperTitle
1 0.99199736 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
Author: Emily Thomforde ; Mark Steedman
Abstract: This paper introduces Chart Inference (CI), an algorithm for deriving a CCG category for an unknown word from a partial parse chart. It is shown to be faster and more precise than a baseline brute-force method, and to achieve wider coverage than a rule-based system. In addition, we show the application of CI to a domain adaptation task for question words, which are largely missing in the Penn Treebank. When used in combination with self-training, CI increases the precision of the baseline StatCCG parser over subjectextraction questions by 50%. An error analysis shows that CI contributes to the increase by expanding the number of category types available to the parser, while self-training adjusts the counts.
2 0.94721365 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
same-paper 3 0.92595834 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
4 0.91034424 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
5 0.6301167 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
Author: Tom Kwiatkowski ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman
Abstract: We consider the problem of learning factored probabilistic CCG grammars for semantic parsing from data containing sentences paired with logical-form meaning representations. Traditional CCG lexicons list lexical items that pair words and phrases with syntactic and semantic content. Such lexicons can be inefficient when words appear repeatedly with closely related lexical content. In this paper, we introduce factored lexicons, which include both lexemes to model word meaning and templates to model systematic variation in word usage. We also present an algorithm for learning factored CCG lexicons, along with a probabilistic parse-selection model. Evaluations on benchmark datasets demonstrate that the approach learns highly accurate parsers, whose generalization performance greatly from the lexical factoring. benefits
6 0.58134758 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
7 0.57674325 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
8 0.57139242 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
9 0.56512552 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
10 0.56023836 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
11 0.5582152 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
12 0.54383612 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
13 0.52775294 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
14 0.52650917 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
15 0.52510029 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
16 0.52359307 38 emnlp-2011-Data-Driven Response Generation in Social Media
17 0.52355295 70 emnlp-2011-Identifying Relations for Open Information Extraction
18 0.52338982 136 emnlp-2011-Training a Parser for Machine Translation Reordering
19 0.51917875 147 emnlp-2011-Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy!
20 0.51809925 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars