acl acl2013 acl2013-59 knowledge-graph by maker-knowledge-mining

59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics


Source: pdf

Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin

Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu @ , Abstract The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. [sent-6, score-0.912]

2 This motivates the development of a more accurate automated method to compute pyramid scores. [sent-7, score-0.569]

3 Of three methods tested here, the one that performs best relies on latent semantics. [sent-8, score-0.098]

4 1 Introduction The pyramid method is an annotation and scor- ing procedure to assess semantic content of summaries in which the content units emerge from the annotation. [sent-9, score-0.912]

5 Each content by its frequency in human reference summaries. [sent-10, score-0.057]

6 It has been shown unit is weighted to produce reliable rank- ings of automated summarization systems, based on performance tasks (Nenkova across multiple summarization and Passonneau, 2004; Passon- neau, 2010). [sent-11, score-0.273]

7 It has also been applied to assessment of oral narrative skills of children (Passonneau et al. [sent-12, score-0.165]

8 Here we show its potential for assessment of the reading comprehension of community college students. [sent-14, score-0.318]

9 We then present a method to automate pyramid scores based on latent semantics. [sent-15, score-0.558]

10 The pyramid method depends on two phases of manual annotation, one to identify weighted content units in model summaries written by proficient humans, and one to score target summaries against the models. [sent-16, score-1.231]

11 Each SCU is weighted by the number of model summaries it occurs in. [sent-18, score-0.305]

12 Figure 1 illustrates a Summary Content Unit taken from pyramid annotation of five model summaries of an elementary physics text. [sent-19, score-0.792]

13 The elements of an SCU are its index; a label, created by the annotator; contributors (Ctr. [sent-20, score-0.083]

14 ), corresponding to the number of contributors from distinct model summaries. [sent-22, score-0.083]

15 Four of the five model summaries contribute to SCU 105 shown here. [sent-23, score-0.34]

16 The four contributors have lexical items in common (matter, objects, substances), and many differences (makes up, being present). [sent-24, score-0.083]

17 SCU weights, which range from 1 to the number of model summaries M, induce a partition on the set of SCUs in all summaries into subsets Tw, w ∈ 1, . [sent-25, score-0.64]

18 The resulting partition is referred, two as a pyramid because, starting with the subset for SCUs with weight 1, each next subset has fewer SCUs. [sent-29, score-0.425]

19 To score new target summaries, they are first annotated to identify which SCUs they express. [sent-30, score-0.07]

20 Application of the pyramid method to assessment of student reading comprehension is impractical without an automated method to annotate target summaries. [sent-31, score-0.918]

21 Previous work on automated pyramid scores of automated summarizers performs well at ranking systems on many document sets, but is not precise enough to score human summaries of a single text. [sent-32, score-1.182]

22 We test three automated pyramid scoring procedures, and find that one based on distributional semantics correlates best with manual pyramid scores, and has higher precision and recall for content units in students’ summaries than methods that depend on string matching. [sent-33, score-1.576]

23 2 Related Work The most prominent NLP technique applied to reading comprehension is LSA (Landauer and Dumais, 1997), an early approach to latent semantic analysis claimed to correlate with reading comprehension (Foltz et al. [sent-34, score-0.529]

24 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgicess 143–147, has been incorporated with a suite of NLP metrics to assess students’ strategies for reading compre- hension using think-aloud protocols (BoonthumDenecke et al. [sent-38, score-0.18]

25 The resulting tool, and similar assesment tools such as Coh-Metrix, assess aspects of readability of texts, such as coherence, but do not assess students’ comprehension through their writing (Graesser et al. [sent-40, score-0.182]

26 E-rater is an automated essay scorer for standardized tests such as GMAT that also relies on a suite ofNLP techniques (Burstein et al. [sent-43, score-0.31]

27 The pyramid method (Nenkova and Passonneau, 2004), was inspired in part by work in reading comprehension that scores content using human annotation (Beck et al. [sent-45, score-0.754]

28 An alternate line of research attempts to replicate human reading comprehension. [sent-47, score-0.109]

29 An automated tool to read and answer questions relies on abductive reasoning over logical forms extracted from text (Wellner et al. [sent-48, score-0.201]

30 The most widely used automated content evaluation is ROUGE (Lin, 2004; Lin and Hovy, 2003). [sent-51, score-0.231]

31 It relies on model summaries, and depends on ngram overlap measures of different types. [sent-52, score-0.078]

32 In contrast to ROUGE, pyramid scoring is robust with as few as four or five model summaries (Nenkova and Passonneau, 2004). [sent-54, score-0.79]

33 A fully automated approach to evaluation for ranking systems that requires no model summaries incorporates latent semantic distributional similarities across words (Louis and Nenkova, 2009). [sent-55, score-0.61]

34 3 Criteria for Automated Scoring Pyramid scores of students’ summaries correlate well with a manual main ideas score developed for an intervention study with community college freshmen who attended remedial classes (Perin et al. [sent-57, score-0.685]

35 Twenty student summaries by students who attended the same college and took the same remedial course were selected from a larger set of 322 that summarized an elementary physics text. [sent-59, score-0.667]

36 All were native speakers of English, and scored within 5 points of the mean reading score for the larger sample. [sent-60, score-0.215]

37 For the intervention study, student summaries had been assigned a score to represent how many main ideas from the source text were covered (Perin et al. [sent-61, score-0.456]

38 Interrater reliability of the main ideas score, as given by the Pearson correlation coefficient, was 0. [sent-63, score-0.087]

39 One of the co-authors created a model pyramid from summaries written by proficient Masters of Education students, annotated 20 target summaries against this pyramid, and scored the result. [sent-65, score-1.071]

40 The raw score of a target summary is the sum of its SCU weights. [sent-66, score-0.157]

41 Pyramid scores have been normalized by the number of SCUs in the summary (analogous to precision), or the average number of SCUs in model summaries (analogous to recall). [sent-67, score-0.452]

42 We normalized raw scores as the average of the two previous normalizations (analogous to F-measure). [sent-68, score-0.06]

43 The resulting scores have a high Pearson’s correlation of 0. [sent-69, score-0.117]

44 To be pedagogically useful, an automated method to assign pyramid scores to students’ sum- maries should meet the following criteria: 1) reliably rank students’ summaries of a source text, 2) assign correct pyramid scores, and 3) identify the correct SCUs. [sent-72, score-1.437]

45 A method could do well on criterion 1but not 2, through scores that have uniform differences from corresponding manual pyramid scores. [sent-73, score-0.537]

46 Also, since each weight partition will have more than one SCU, it is possible to produce the correct numeric score by matching incorrect SCUs that have the correct weights. [sent-74, score-0.069]

47 4 Approach: Dynamic Programming Previous work observed that assignment of SCUs to a target summary can be cast as a dynamic programming problem (Harnly et al. [sent-76, score-0.118]

48 The method presented there relied on unigram overlap to score the closeness of the match of each eligible substring in a summary against each SCU in the pyramid. [sent-78, score-0.218]

49 It produced good rankings across summarization tasks, but assigned scores much lower than those assigned by humans. [sent-80, score-0.123]

50 We test two new semantic text similarities, a string comparison method and a distributional semantic method, and we present a general mechanism to set a threshold value for an arbitrary computation of text similarity. [sent-82, score-0.073]

51 Unigram overlap ignores word order, and cannot consider the latent semantic content of a string, only the observed unigram tokens. [sent-83, score-0.22]

52 To take the underlying semantics into account, we use cosine similarity of 100-dimensional latent vectors of the candidate substrings and of the textual components of the SCU (label and contributors). [sent-85, score-0.194]

53 Because the algorithm optimizes for the total sum of all SCUs, when there is no threshold similarity to count as a match, it favors matching shorter substrings to SCUs with higher weights. [sent-86, score-0.129]

54 Because each similarity metric has different properties and distributions, a single absolute value threshhold is not comparable across metrics. [sent-88, score-0.068]

55 We present a method to set comparable thresholds across metrics. [sent-89, score-0.103]

56 1 Latent Vector Representations To represent the semantics of SCUs and candidate substrings of target summaries, we applied the latent vector model of Guo and Diab (2012). [sent-91, score-0.184]

57 1 Guo and Diab find that it is very hard to learn a 100dimension latent vector based only on the limited observed words in a short text. [sent-92, score-0.071]

58 Weighted matrix factorization (WMF) assigns a small weight for missing words so that latent semantics depends largely on observed words. [sent-95, score-0.105]

59 A 100-dimension latent vector representation was learned for every span of contiguous words × within sentence bounds in a target summary, for the 20 summaries. [sent-96, score-0.102]

60 The training data was selected to be domain independent, so that our model could be used for summaries across domains. [sent-97, score-0.332]

61 Similarly, the contributors to and the label for an SCU were given a 100dimensional latent vector representation. [sent-103, score-0.154]

62 These representations were then used to compare candidates from a summary to SCUs in the pyramid. [sent-104, score-0.087]

63 (2005), we use three similarity comparisons scusim(X, SCU), where X is the target summary string. [sent-113, score-0.159]

64 When the comparison parameter is set to min (max or mean), the similarity of X to each SCU contributor and the label is computed in turn, and the minimum (max, or mean) is returned. [sent-114, score-0.095]

65 3 Similarity Thresholds We define a threshold parameter for a target SCU to match a pyramid SCU based on the distributions of scores each similarity method gives to the target SCUs identified by the human annotator. [sent-116, score-0.598]

66 The similarity score being a continuous random variable, the empirical sample of 204 scores is very sparse. [sent-118, score-0.14]

67 Hence, we use a Gaussian kernel density estimator to provide a non-parametric estimation of the probability densities of scores assigned by each of the similarity methods to the manually identified SCUs. [sent-119, score-0.101]

68 We then select five threshold values corresponding to those for which the inverse cumulative density function (icdf) is equal to 0. [sent-120, score-0.075]

69 ×× 5 Experiment The three similarity computations, three methods to compare against SCUs, and five icdf thresholds yield 45 variants, as shown in Figure 2. [sent-127, score-0.247]

70 Each variant was evaluated by comparing the unnormalized automated variant, e. [sent-128, score-0.261]

71 To assess the 45 variants, we compared their scores to the manual scores. [sent-134, score-0.137]

72 By our criterion 1), an automated score that correlates well with manual scores for summaries of a given text could be used µ (3 Similarities) (3 Comparisons) (5 Thresholds) = 45 (Uni, R/O, Lvc) (min, mean, max) (0. [sent-136, score-0.66]

73 4572 Table 1: Five variants from the top twelve of all correlations, with confidence interval and rank (P=Pearson’s, S=Spearman, K=Kendall’s tau), mean summed SCU weight, difference of mean from mean gold score, T test p-value. [sent-215, score-0.299]

74 to indicate how well students rank against other students. [sent-216, score-0.202]

75 Pearsons tests the strength of a linear correlation between the two sets of scores; it will be high if the same order is produced, with the same distance between pairs of scores. [sent-218, score-0.092]

76 The Spearman rank correlation is said to be preferable for ordinal comparisons, meaning where the unit interval is less relevant. [sent-219, score-0.117]

77 Kendall’s tau, an alternative rank correlation, is less sensitive to outliers and more intuitive. [sent-220, score-0.06]

78 Since correlations can be high when differences are uniform, we use Student’s T to test whether differences score means statistically significant. [sent-222, score-0.094]

79 Criterion 2) is met if the correlations are high and the means are not significantly different. [sent-223, score-0.055]

80 6 Results The correlation tests indicate that several variants achieve sufficiently high correlations to rank students’ summaries (criterion 2). [sent-224, score-0.55]

81 On all correlation tests, the highest ranking automated method is LVc, max, 0. [sent-225, score-0.231]

82 As shown in Table 1, the Pearson correlation is 0. [sent-228, score-0.057]

83 40 did not rank as highly for Speaman and Kendall’s tau correlations, but the Student’s T result in column 3 of Table 1 shows that this is the only variant in the table that yields absolute scores that are not significantly different from the human annotated scores. [sent-232, score-0.228]

84 Thus this variant best balances criteria 1 and 2. [sent-233, score-0.113]

85 The differences in the unnormalized score computed by the automated systems from the score assigned by human annotation are consistently positive. [sent-234, score-0.312]

86 Inspection of the SCUs retrieved by each automated variant reveals that the automated systems lean toward the tendency to identify false positives. [sent-235, score-0.406]

87 To get a measure of the degree of overlap between the SCUs that were selected automatically versus manually (criterion 4), we computed recall and precision for the various methods. [sent-237, score-0.115]

88 Table 2 shows the mean recall and precision (with standard deviations) across all five thresholds for each combination of similarity method and method of comparison to the SCU. [sent-238, score-0.31]

89 The low standard deviations show that the recall and precision are relatively similar across thresholds for each variant. [sent-239, score-0.194]

90 The LVc methods outperform R/O and unigram overlap methods, particularly for the precision of SCUs retrieved, indicating the use of distributional semantics is a superior approach for pyramid summary scoring than methods based on string matching. [sent-240, score-0.731]

91 The unigram overlap and R/O methods show the least variation across comparison methods (min, mean, max). [sent-241, score-0.119]

92 Meeting all three criteria is difficult, and the LVc method is clearly superior. [sent-243, score-0.055]

93 Improvements resulted µµ from principled thresholds for similarity, and from a vector representation (LVc) to capture the latent semantics of short spans of text (Guo and Diab, 2012). [sent-246, score-0.181]

94 The LVc methods perform best at all three criteria for a pedagogically useful automatic metric. [sent-247, score-0.103]

95 Future work will address how to improve precision and recall of the gold SCUs. [sent-248, score-0.064]

96 Automatic natural language processing and the detection of reading skills and reading comprehension. [sent-310, score-0.266]

97 A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. [sent-363, score-0.071]

98 Annotation of children’s oral narrations: Modeling emergent narrative skills for computational applications. [sent-386, score-0.107]

99 Formal and functional assessment of the Pyramid Method for summary content evaluation. [sent-391, score-0.202]

100 A contextualized curricular supplement for developmental reading and writing. [sent-402, score-0.109]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pyramid', 0.395), ('scus', 0.379), ('scu', 0.35), ('lvc', 0.316), ('summaries', 0.305), ('automated', 0.174), ('perin', 0.143), ('students', 0.142), ('reading', 0.109), ('comprehension', 0.102), ('passonneau', 0.097), ('harnly', 0.095), ('icdf', 0.095), ('summary', 0.087), ('contributors', 0.083), ('nenkova', 0.078), ('graesser', 0.078), ('max', 0.077), ('thresholds', 0.076), ('burstein', 0.073), ('latent', 0.071), ('mean', 0.067), ('guo', 0.065), ('danielle', 0.063), ('scores', 0.06), ('rank', 0.06), ('assessment', 0.058), ('variant', 0.058), ('content', 0.057), ('correlation', 0.057), ('uni', 0.056), ('scoring', 0.055), ('correlations', 0.055), ('criteria', 0.055), ('min', 0.054), ('rebecca', 0.053), ('columbia', 0.052), ('pearson', 0.051), ('overlap', 0.051), ('kendall', 0.05), ('tau', 0.05), ('college', 0.049), ('student', 0.049), ('substrings', 0.048), ('skills', 0.048), ('attended', 0.048), ('pedagogically', 0.048), ('ratcliff', 0.048), ('remedial', 0.048), ('criterion', 0.045), ('essay', 0.043), ('mcnamara', 0.042), ('rouge', 0.042), ('unigram', 0.041), ('similarity', 0.041), ('threshold', 0.04), ('assess', 0.04), ('score', 0.039), ('ani', 0.038), ('variants', 0.038), ('manual', 0.037), ('beck', 0.037), ('std', 0.037), ('wellner', 0.037), ('correlate', 0.036), ('summarization', 0.036), ('tests', 0.035), ('five', 0.035), ('precision', 0.035), ('summarizers', 0.035), ('proficient', 0.035), ('efron', 0.035), ('jill', 0.035), ('lisa', 0.035), ('diab', 0.034), ('semantics', 0.034), ('intervention', 0.033), ('foltz', 0.033), ('distributional', 0.033), ('oral', 0.032), ('arthur', 0.032), ('automate', 0.032), ('annotation', 0.031), ('target', 0.031), ('suite', 0.031), ('florida', 0.031), ('analogous', 0.03), ('ideas', 0.03), ('partition', 0.03), ('louis', 0.029), ('recall', 0.029), ('unnormalized', 0.029), ('weiwei', 0.028), ('units', 0.027), ('across', 0.027), ('relies', 0.027), ('deviations', 0.027), ('narrative', 0.027), ('spearman', 0.027), ('physics', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin

Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.

2 0.21723567 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

Author: Peter A. Rankel ; John M. Conroy ; Hoa Trang Dang ; Ani Nenkova

Abstract: How good are automatic content metrics for news summary evaluation? Here we provide a detailed answer to this question, with a particular focus on assessing the ability of automatic evaluations to identify statistically significant differences present in manual evaluation of content. Using four years of data from the Text Analysis Conference, we analyze the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems. Our experiments show that some of the neglected variants of ROUGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years; the commonly used ROUGE-1 scores find too many significant differences between systems which manual evaluation would deem comparable. We also test combinations ofROUGE variants and find that they considerably improve the accuracy of automatic prediction.

3 0.14611532 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: In automatic summarization, centrality is the notion that a summary should contain the core parts of the source text. Current systems use centrality, along with redundancy avoidance and some sentence compression, to produce mostly extractive summaries. In this paper, we investigate how summarization can advance past this paradigm towards robust abstraction by making greater use of the domain of the source text. We conduct a series of studies comparing human-written model summaries to system summaries at the semantic level of caseframes. We show that model summaries (1) are more abstractive and make use of more sentence aggregation, (2) do not contain as many topical caseframes as system summaries, and (3) cannot be reconstructed solely from the source text, but can be if texts from in-domain documents are added. These results suggest that substantial improvements are unlikely to result from better optimizing centrality-based criteria, but rather more domain knowledge is needed.

4 0.14115976 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

Author: Lu Wang ; Claire Cardie

Abstract: We address the challenge of generating natural language abstractive summaries for spoken meetings in a domain-independent fashion. We apply Multiple-Sequence Alignment to induce abstract generation templates that can be used for different domains. An Overgenerateand-Rank strategy is utilized to produce and rank candidate abstracts. Experiments using in-domain and out-of-domain training on disparate corpora show that our system uniformly outperforms state-of-the-art supervised extract-based approaches. In addition, human judges rate our system summaries significantly higher than compared systems in fluency and overall quality.

5 0.12091849 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

Author: Lu Wang ; Hema Raghavan ; Vittorio Castelli ; Radu Florian ; Claire Cardie

Abstract: We consider the problem of using sentence compression techniques to facilitate queryfocused multi-document summarization. We present a sentence-compression-based framework for the task, and design a series of learning-based compression models built on parse trees. An innovative beam search decoder is proposed to efficiently find highly probable compressions. Under this framework, we show how to integrate various indicative metrics such as linguistic motivation and query relevance into the compression process by deriving a novel formulation of a compression scoring function. Our best model achieves statistically significant improvement over the state-of-the-art systems on several metrics (e.g. 8.0% and 5.4% improvements in ROUGE-2 respectively) for the DUC 2006 and 2007 summarization task. ,

6 0.10935772 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

7 0.099923015 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

8 0.09751007 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

9 0.087243326 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

10 0.082853891 250 acl-2013-Models of Translation Competitions

11 0.082142949 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

12 0.074798793 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

13 0.069092467 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

14 0.065725408 246 acl-2013-Modeling Thesis Clarity in Student Essays

15 0.064039864 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

16 0.063910529 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

17 0.058162317 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

18 0.056038644 333 acl-2013-Summarization Through Submodularity and Dispersion

19 0.055857047 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

20 0.055812065 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.135), (1, 0.048), (2, 0.023), (3, -0.101), (4, 0.008), (5, -0.044), (6, 0.084), (7, -0.016), (8, -0.135), (9, -0.044), (10, -0.046), (11, 0.059), (12, -0.103), (13, -0.004), (14, -0.103), (15, 0.14), (16, 0.109), (17, -0.126), (18, -0.019), (19, 0.049), (20, 0.021), (21, -0.079), (22, 0.011), (23, -0.056), (24, -0.018), (25, -0.01), (26, -0.019), (27, 0.014), (28, 0.016), (29, 0.004), (30, -0.07), (31, -0.015), (32, 0.092), (33, -0.073), (34, -0.027), (35, -0.023), (36, -0.06), (37, 0.04), (38, -0.004), (39, -0.004), (40, 0.075), (41, 0.027), (42, -0.052), (43, -0.052), (44, 0.069), (45, 0.012), (46, -0.014), (47, -0.003), (48, 0.054), (49, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92349291 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin

Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.

2 0.90281326 5 acl-2013-A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

Author: Peter A. Rankel ; John M. Conroy ; Hoa Trang Dang ; Ani Nenkova

Abstract: How good are automatic content metrics for news summary evaluation? Here we provide a detailed answer to this question, with a particular focus on assessing the ability of automatic evaluations to identify statistically significant differences present in manual evaluation of content. Using four years of data from the Text Analysis Conference, we analyze the performance of eight ROUGE variants in terms of accuracy, precision and recall in finding significantly different systems. Our experiments show that some of the neglected variants of ROUGE, based on higher order n-grams and syntactic dependencies, are most accurate across the years; the commonly used ROUGE-1 scores find too many significant differences between systems which manual evaluation would deem comparable. We also test combinations ofROUGE variants and find that they considerably improve the accuracy of automatic prediction.

3 0.851282 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: In automatic summarization, centrality is the notion that a summary should contain the core parts of the source text. Current systems use centrality, along with redundancy avoidance and some sentence compression, to produce mostly extractive summaries. In this paper, we investigate how summarization can advance past this paradigm towards robust abstraction by making greater use of the domain of the source text. We conduct a series of studies comparing human-written model summaries to system summaries at the semantic level of caseframes. We show that model summaries (1) are more abstractive and make use of more sentence aggregation, (2) do not contain as many topical caseframes as system summaries, and (3) cannot be reconstructed solely from the source text, but can be if texts from in-domain documents are added. These results suggest that substantial improvements are unlikely to result from better optimizing centrality-based criteria, but rather more domain knowledge is needed.

4 0.72497976 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

Author: Chen Li ; Xian Qian ; Yang Liu

Abstract: In this paper, we propose a bigram based supervised method for extractive document summarization in the integer linear programming (ILP) framework. For each bigram, a regression model is used to estimate its frequency in the reference summary. The regression model uses a variety ofindicative features and is trained discriminatively to minimize the distance between the estimated and the ground truth bigram frequency in the reference summary. During testing, the sentence selection problem is formulated as an ILP problem to maximize the bigram gains. We demonstrate that our system consistently outperforms the previous ILP method on different TAC data sets, and performs competitively compared to the best results in the TAC evaluations. We also conducted various analysis to show the impact of bigram selection, weight estimation, and ILP setup.

5 0.71155894 333 acl-2013-Summarization Through Submodularity and Dispersion

Author: Anirban Dasgupta ; Ravi Kumar ; Sujith Ravi

Abstract: We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We consider three natural dispersion functions and show that a greedy algorithm can obtain an approximately optimal summary in all three cases. We conduct experiments on two corpora—DUC 2004 and user comments on news articles—and show that the performance of our algorithm outperforms those that rely only on submodularity.

6 0.692514 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

7 0.6509496 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

8 0.59848619 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

9 0.5574404 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

10 0.54524845 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

11 0.53716236 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

12 0.51867598 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

13 0.5186668 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

14 0.51611114 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

15 0.49307248 225 acl-2013-Learning to Order Natural Language Texts

16 0.47507185 246 acl-2013-Modeling Thesis Clarity in Student Essays

17 0.46901709 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse

18 0.45186052 375 acl-2013-Using Integer Linear Programming in Concept-to-Text Generation to Produce More Compact Texts

19 0.4460969 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

20 0.43659484 178 acl-2013-HEADY: News headline abstraction through event pattern clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.073), (6, 0.081), (11, 0.052), (14, 0.011), (15, 0.042), (24, 0.038), (26, 0.038), (28, 0.01), (35, 0.079), (42, 0.042), (48, 0.036), (61, 0.247), (70, 0.053), (88, 0.032), (90, 0.05), (95, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78964919 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

Author: Rebecca J. Passonneau ; Emily Chen ; Weiwei Guo ; Dolores Perin

Abstract: The pyramid method for content evaluation of automated summarizers produces scores that are shown to correlate well with manual scores used in educational assessment of students’ summaries. This motivates the development of a more accurate automated method to compute pyramid scores. Of three methods tested here, the one that performs best relies on latent semantics.

2 0.73133504 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

Author: Ryan McDonald ; Joakim Nivre ; Yvonne Quirmbach-Brundage ; Yoav Goldberg ; Dipanjan Das ; Kuzman Ganchev ; Keith Hall ; Slav Petrov ; Hao Zhang ; Oscar Tackstrom ; Claudia Bedini ; Nuria Bertomeu Castello ; Jungmee Lee

Abstract: We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean. To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before. This ‘universal’ treebank is made freely available in order to facilitate research on multilingual dependency parsing.1

3 0.70417726 225 acl-2013-Learning to Order Natural Language Texts

Author: Jiwei Tan ; Xiaojun Wan ; Jianguo Xiao

Abstract: Ordering texts is an important task for many NLP applications. Most previous works on summary sentence ordering rely on the contextual information (e.g. adjacent sentences) of each sentence in the source document. In this paper, we investigate a more challenging task of ordering a set of unordered sentences without any contextual information. We introduce a set of features to characterize the order and coherence of natural language texts, and use the learning to rank technique to determine the order of any two sentences. We also propose to use the genetic algorithm to determine the total order of all sentences. Evaluation results on a news corpus show the effectiveness of our proposed method. 1

4 0.57070702 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

Author: Ulle Endriss ; Raquel Fernandez

Abstract: Crowdsourcing, which offers new ways of cheaply and quickly gathering large amounts of information contributed by volunteers online, has revolutionised the collection of labelled data. Yet, to create annotated linguistic resources from this data, we face the challenge of having to combine the judgements of a potentially large group of annotators. In this paper we investigate how to aggregate individual annotations into a single collective annotation, taking inspiration from the field of social choice theory. We formulate a general formal model for collective annotation and propose several aggregation methods that go beyond the commonly used majority rule. We test some of our methods on data from a crowdsourcing experiment on textual entailment annotation.

5 0.56080925 250 acl-2013-Models of Translation Competitions

Author: Mark Hopkins ; Jonathan May

Abstract: What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from the Workshop on Machine Translation (WMT). 1 The WMT Translation Competition Every year, the Workshop on Machine Transla- , tion (WMT) conducts a competition between machine translation systems. The WMT organizers invite research groups to submit translation systems in eight different tracks: Czech to/from English, French to/from English, German to/from English, and Spanish to/from English. For each track, the organizers also assemble a panel of judges, typically machine translation specialists.1 The role of a judge is to repeatedly rank five different translations of the same source text. Ties are permitted. In Table 1, we show an example2 where a judge (we’ll call him “jdoe”) has ranked five translations of the French sentence “Il ne va pas.” Each such elicitation encodes ten pairwise comparisons, as shown in Table 2. For each competition track, WMT typically elicits between 5000 and 20000 comparisons. Once the elicitation process is complete, WMT faces a large database of comparisons and a question that must be answered: whose system is the best? 1Although in recent competitions, some ofthejudging has also been crowdsourced (Callison-Burch et al., 2010). 2The example does not use actual system output. jmay} @ sdl . com Table21r:a(451tniekW)MsTuycbejskhmtdeiunltmics“Hp r“eHt derfa eongris densolacstneogi tnsog.”bto. y”asking judges to simultaneously rank five translations, with ties permitted. In this (fictional) example, the source sentence is the French “Il ne va pas.” ble 1. A preference of 0 means neither translation was preferred. Otherwise the preference specifies the preferred system. 2 A Ranking Problem For several years, WMT used the following heuristic for ranking the translation systems: ORIGWMT(s) =win(sw)in +(s ti)e( +s t)ie +(s lo)ss(s) For system s, win (s) is the number of pairwise comparisons in which s was preferred, loss(s) is the number of comparisons in which s was dispreferred, and tie(s) is the number of comparisons in which s participated but neither system was preferred. Recently, (Bojar et al., 2011) questioned the adequacy of this heuristic through the following ar1416 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h.e ? Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1416–1424, gument. Consider a competition with systems A and B. Suppose that the systems are different but equally good, such that one third of the time A is judged better than B, one third of the time B is judged better than A, and one third of the time they are judged to be equal. The expected values of ORIGWMT(A) and ORIGWMT(B) are both 2/3, so the heuristic accurately judges the systems to be equivalently good. Suppose however that we had duplicated B and had submitted it to the competition a second time as system C. Since B and C produce identical translations, they should always tie with one another. The expected value of ORIGWMT(A) would not change, but the expected value of ORIGWMT(B) would increase to 5/6, buoyed by its ties with system C. This vulnerability prompted (Bojar et al., 2011) to offer the following revision: BOJAR(s) =win(sw)in +(s lo)ss(s) The following year, it was BOJAR’s turn to be criticized, this time by (Lopez, 2012): Superficially, this appears to be an improvement....couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors? On the other hand, couldn’t a system be rewarded simply by being compared against a bad system more frequently than its competitors? Lopez’s concern, while reasonable, is less obviously damning than (Bojar et al., 2011)’s criticism of ORIGWMT. It depends on whether the collected set of comparisons is small enough or biased enough to make the variance in competition significant. While this hypothesis is plausible, Lopez makes no attempt to verify it. Instead, he offers a ranking heuristic of his own, based on a Minimum Feedback Arc solver. The proliferation of ranking heuristics continued from there. The WMT 2012 organizers (Callison-Burch et al., 2012) took Lopez’s ranking scheme and provided a variant called Most Proba- ble Ranking. Then, noting some potential pitfalls with that, they created two more, called Monte Carlo Playoffs and Expected Wins. While one could raise philosophical objections about each of these, where would it end? Ultimately, the WMT 2012 findings presented five different rankings for the English-German competition track, with no guidance about which ranking we should pay attention to. How can we know whether one ranking is better than other? Or is this even the right question to ask? 3 A Problem with Rankings Suppose four systems participate in a translation competition. Three of these systems are extremely close in quality. We’ll call these close1, close2, and close3. Nevertheless, close1 is very slightly better3 than close2, and close2 is very slightly better than close3. The fourth system, called terrific, is a really terrific system that far exceeds the other three. Now which is the better ranking? terrific, close3, close1, close2 close1, terrific, close2, close3 (1) (2) Spearman’s rho4 would favor the second ranking, since it is a less disruptive permutation of the gold ranking. But intuition favors the first. While its mistakes are minor, the second ranking makes the hard-to-forgive mistake of placing close1 ahead of the terrific system. The problem is not with Spearman’s rho. The problem is the disconnnect between the knowledge that we want a ranking to reflect and the knowledge that a ranking actually contains. Without this additional knowledge, we cannot determine whether one ranking is better than another, even if we know the gold ranking. We need to determine what information they lack, and define more rigorously what we hope to learn from a translation competition. 4 From Rankings to Relative Ability Ostensibly the purpose of a translation competition is to determine the relative ability of a set of translation systems. Let S be the space of all otrfan trsalnatsiloanti systems. Hereafter, we hwei lslp raecfeer o tfo Sll as nthslea space ostfe smtus.de Hntesr. a Wftee c,h woeos wei ltlh ires teerrm to t So evoke the metaphor of a translation competition as a standardized test, which shares the same goal: to assess the relative abilities of a set of participants. But what exactly do we mean by “ability”? Before formally defining this term, first recognize that it means little without context, namely: 3What does “better” mean? We’ll return to this question. 4Or Pearson’s correlation coefficient. 1417 1. What kind of source text do we want the systems to translate well? Say system A is great at translating travel-related documents, but terrible at translating newswire. Meanwhile, system B is pretty good at both. The question “which system is better?” requires us to state how much we care about travel versus newswire documents otherwise the question is underspecified. – 2. Who are we trying to impress? While it’s tempting to think that translation quality is a universal notion, the 50-60% interannotator agreement in WMT evaluations (CallisonBurch et al., 2012) suggests otherwise. It’s also easy to imagine reasons why one group of judges might have different priorities than another. Think a Fortune 500 company versus web forum users. Lawyers versus laymen. Non-native versus native speakers. Posteditors versus Google Translate users. Different groups have different uses for translation, and therefore different definitions of what “better” means. With this in mind, let’s define some additional elements of a translation competition. Let X be the space osf o afll a possible segments toitfi source text, J h bee tshpea space lolf p paolls possible judges, fa snodu rΠc = {0, 1, 2} bthee tshpea space ol fp pairwise d pgreesf,e arenndc Πes=. 5 W0,e1 assume all spaces are countable. Unless stated otherwise, variables s1 and s2 represent students from S, variable x represents a segment from X, variaSb,l ev j represents a judge af sroemgm J, ta fnrod mva Xria,b vlea π represents a preference fero fmro mΠ. J Moreover, adbelfein πe the negation ˆπ of preference π such that ˆπ = 2 (if π = 1), ˆπ = 1(if π = 2), and ˆπ = 0 (if π = 0). Now assume a joint distribution P(s1, s2, x, j,π) specifying the probability that we ask judge j to evaluate students s1 and s2’s respective translations of source text x, and that judge j’s preference is π. We will further assume that the choice of student pair, source text, and judge are marginally independent of one another. In other words: P(s1, s2, x, j,π) = P(π|s1, s2, x,j) · P(x|s1, s2, j) = ·P(j|s1,s2) · P(s1,s2) P(π|s1, s2, x, j) · P(x) · P(j) · P(s1, s2) = PX(x) · PJ(j) · P(s1, s2) · P(π|s1, s2, x,j) X(x) 5As a reminder, 0 indicates no preference. It will be useful to reserve notation PX and PJ for the marginal distributions over source text and judges. We can marginalize over the source segments and judges to obtain a useful quantity: P(π|s1, s2) = X XPX(x) · PJ(j) · P(π|s1,s2,x,j) Xx∈X Xj∈J We refer to this as the hPX, PJi-relative ability of Wstued reenftesr s1 hanisd a s2. By using d-rifeflearteinvet marginal distributions PX, we can specify what kinds of source text interest us (for instance, PX could focus most of its probability mass on German tweets). Similarly, by using different marginal distributions PJ, we can specify what judges we want to impress (for instance, PJ could focus all of its mass on one important corporate customer or evenly among all fluent bilingual speakers of a language pair). With this machinery, we can express the purpose of a translation competition more clearly: to estimate the hPX, PJi-relative ability of a set toof eststuidmenattes. Ien h Pthe case orefl WMT, PJ presumably6 defines a space of competent source-totarget bilingual speakers, while PX defines a space of newswire documents. We’ll refer to an estimate of P(π|s1 , s2) as a preference rm toode anl. Istni moattheer o words, a prefer- ence model is a distribution Q(π|s1 , s2). Given a cseet moofd pairwise comparisons (e.g., Table 2), the challenge is to estimate a preference model Q(π|s1 , s2) such that Q is “close” to P. For measuring distributional proximity, a natural choice is KL-divergence (Kullback and Leibler, 195 1), but we cannot use it here because P is unknown. Fortunately, ifwe have i.i.d. data drawn from P, then we can do the next best thing and compute the perplexity of preference model Q on this heldout test data. Let D be a sequence of triples hs1, s2, πi wteshter dea tah.e L preferences π are i o.if.d t.r samples fr,oπmi P(π|s1 , s2). The perplexity of preference model Q on stest data D is: perplexity(Q|D) = 2−Phs1,s2,πi∈D |D1|log2Q(π|s1,s2) How do we obtain such a test set from competition data? Recall that a WMT competition produces pairwise comparisons like those in Table 2. 6One could argue that it specifies a space of machine translation specialists, but likely these individuals are thought to be a representative sample of a broader community. 1418 Let C be the set of comparisons hs1, s2, x, j,πi Lobettai Cne bde f trhoem s a t orfan csolamtipoanr competition. ,Cjo,mπipetition data C is not necessarily7 sampled i.i.d. fpreotmiti P(s1, s2, x, j,π) n beeccaeusssaer we may intentionally8 bias data collection towards certain students, judges or source text. Also, because WMT elicits its data in batches (see Table 1), every segment x of source text appears in at least ten comparisons. To create an appropriately-sized test set that closely resembles i.i.d. data, we isolate the subset C0 of comparisons whose source text appears isne ta tC most k comparisons, where k is the smallest positive integer such that |C0| >= 2000. We then cporesaitteiv teh ien tteegste sre stu uDch hfr thomat |CC0: D = {hs1, s2, πi|hs1, s2, x,j, πi ∈ C0} We reserve the remaining comparisons for training preference models. Table 3 shows the resulting dataset sizes for each competition track. Unlike with raw rankings, the claim that one preference model is better than another has testable implications. Given two competing models, we can train them on the same comparisons, and compare their perplexities on the test set. This gives us a quantitative9 answer to the question of which is the better model. We can then publish a system ranking based on the most trustworthy preference model. 5 Baselines Let’s begin then, and create some simple preference models to serve as baselines. 5.1 Uniform The simplest preference model is a uniform distribution over preferences, for any choice of students s1 s2: , Q(π|s1,s2) =31 ∀π ∈ Π This will be our only model that does not require training data, and its perplexity on any test set will be 3 (i.e. equal to number of possible preferences). 5.2 Adjusted Uniform Now suppose we have a set C of comparisons aNvoawilab sluep pfoors training. L aet s Cπ ⊆ fC c odmenpoatreis otnhes subset of comparisons wLiteht preference π, oatned hleet 7In WMT, it certainly is not. 8To collect judge agreement statistics, for instance. 9As opposed to philosophical. C(s1 , s2) denote the subset comparing students s1 aCn(ds s2. Perhaps the simplest thing we can do with the training data is to estimate the probability of ties (i.e. preference 0). We can then distribute the remaining probability mass uniformly among the other two preferences: 6SQim(pπ|lse1B,sa2y)e=sia   n1M−o2d|C Ce0| lsiofthπer=wi0se 6.1 Independent Pairs Another simple model is the direct estimation of each relative ability P(π|s1 , s2) independently. In oetahcher words, f aobri eliatych P pair sof students s1 and s2, we estimate a separate preference distribution. The maximum likelihood estimate of each distribution would be: Q(π|s1,s2) =|C|Cπ((ss11,,ss22))|| ++ | CC πˆ(s(2s,2s,1s)1|)| However the maximum likelihood estimate would test poorly, since any zero probability estimates for test set preferences would result in infinite perplexity. To make this model practical, we assume a symmetric Dirichlet prior with strength α for each preference distribution. This gives us the following Bayesian estimate: Q(π|s1,s2) =α3α + + |C |πC((ss11,,ss22))|| + + | |CC πˆ((ss22,,ss11))|| We call this the Independent model. Pairs preference 6.2 Independent Students The Independent Pairs model makes a strong inde- pendence assumption. It assumes that even if we know that student A is much better than student B, and that student B is much better than student C, we can infer nothing about how student A will fare versus student C. Instead of directly estimating the relative ability P(π|s1 , s2) of students s1 and s2, we ctoivueld a binilsittyead P Ptry tso estimate the universal ability P(π|s1) Ps2∈S P(π|s1, s2) · P(s2|s1) of ietaych P i(nπd|sividual sPtud∈enSt s1 πa|nsd the)n try tso reconstruct the relativeP abilities from these estimates. For the same reasons as before, we assume a symmetric Dirichlet prior with strength α for each = 1419 preference distribution, which gives us the following Bayesian estimate: Q(π|s1) =α3α + +PPs2s∈2S∈|SC|πC( s 1 , s 2 ) | + + | CCˆ π( s 2 , s 1 ) | The estimates Q(π|Ps1) do not yet constitute a preference mimoadteesl. QA( dπo|swnside of this approach is that there is no principled way to reconstruct a preference model from the universal ability estimates. We experiment with three ad-hoc reconstructions. The asymmetric reconstruction simply ignores any information we have about student s2: Q(π|s1, s2) = Q(π|s1) The arithmetic and geometric reconstructions compute an arithmetic/geometric average of the two universal abilities: Q(π|s1,s2) Q(π|s1, s2) = Q(π|s1) +2 Q( πˆ|s2) = [Q(π|s1) ∗ Q(ˆ π|s2)]21 We respectively call these the (Asymmetric/Arithmetic/Geometric) Independent Students preference models. Notice the similarities between the universal ability estimates Q(π|s1) and ttwhee eBnO tJhAeR u ranking h aebuilritiysti ecs. iTmhaetsees t Qhr(eπe| smodels are our attempt to render the BOJAR heuristic as preference models. 7 Item-Response Theoretic (IRT) Models Let’s revisit (Lopez, 2012)’s objection to the BO- JAR ranking heuristic: “...couldn’t a system still be penalized simply by being compared to [good systems] more frequently than its competitors?” The official WMT 2012 findings (Callison-Burch et al., 2012) echoes this concern in justifying the exclusion of reference translations from the 2012 competition: [W]orkers have a very clear preference for reference translations, so including them unduly penalized systems that, through (un)luck of the draw, were pitted against the references more often. Presuming the students are paired uniformly at random, this issue diminishes as more comparisons are elicited. But preference elicitation is expensive, so it makes sense to assess the relative ability of the students with as few elicitations as possible. Still, WMT 2012’s decision to eliminate references entirely is a bit of a draconian measure, a treatment of the symptom rather than the (perceived) disease. If our models cannot function in the presence of training data variation, then we should change the models, not the data. A model that only works when the students are all about the same level is not one we should rely on. We experiment with a simple model that relaxes some independence assumptions made by previous models, in order to allow training data variation (e.g. who a student has been paired with) to influence the estimation of the student abilities. Figure 1(left) shows plate notation (Koller and Friedman, 2009) for the model’s independence structure. First, each student’s ability distribution is drawn from a common prior distribution. Then a number of translation items are generated. Each item is authored by a student and has a quality drawn from the student’s ability distribution. Then a number of pairwise comparisons are generated. Each comparison has two options, each a translation item. The quality of each item is observed by a judge (possibly noisily) and then the judge states a preference by comparing the two observations. We investigate two parameterizations of this model: Gaussian and categorical. Figure 1(right) shows an example of the Gaussian parameterization. The student ability distributions are Gaussians with a known standard deviation σa, drawn from a zero-mean Gaussian prior with known standard deviation σ0. In the example, we show the ability distributions for students 6 (an aboveaverage student, whose mean is 0.4) and 14 (a poor student, whose mean is -0.6). We also show an item authored by each student. Item 43 has a somewhat low quality of -0.3 (drawn from student 14’s ability distribution), while item 205 is not student 6’s best work (he produces a mean quality of 0.4), but still has a decent quality at 0.2. Comparison 1pits these items against one another. A judge draws noise from a zero-mean Gaussian with known standard deviation σobs, then adds this to the item’s actual quality to get an observed quality. For the first option (item 43), the judge draws a noise of -0.12 to observe a quality of -0.42 (worse than it actually is). For the second option (item 205), the judge draws a noise of 0.15 to observe a quality of 0.35 (better than it actually is). Finally, the judge compares the two observed qualities. If the absolute difference is lower than his decision 1420 Figure 1: Plate notation (left) showing the independence tiated subnetwork structure of the IRT Models. Example instan- (right) for the Gaussian parameterization. Shaded rectangles are hyperparameters. Shaded ellipses are variables observable from a set of comparisons. radius (which here is 0.5), then he states no preference (i.e. a preference of 0). Otherwise he prefers the item with the higher observed quality. The categorical parameterization is similar to the Gaussian parameterization, with the following differences. Item quality is not continuous, but rather a member of the discrete set {1, 2, ..., Λ}. rTahteh srtau d menetm ability tdhiest rdiibsuctrieotens are categorical distributions over {1, 2, ..., Λ}, and the student ability prior sis o a symmetric ,DΛir}ic,h alnetd dw tihthe strength αa. Finally, the observed quality is the item quality λ plus an integer-valued noise ν ∈ {1 − λ, ..., Λ λ}. Noise ν is drawn from a di∈scre {ti1ze −d zero-mean λG}a.u Nssoiisaen wν i sth d srtaawndna frrdo mdev ai daitsiocnre σobs. Specifically, Pr(ν) is proportional to the value of the probability density function of the zero-mean Gaussian N(0, σobs). aWuses ieasntim Na(0te,dσ the model parameters with Gibbs sampling (Geman and Geman, 1984). We found that Gibbs sampling converged quickly and consistently10 for both parameterizations. Given the parameter estimates, we obtain a preference model Q(π|s1 , s2) through the inference query: Pr(comp.c0.pref = π | item.i0.author = s1, item.i00.author = s2 , comp.c0.opt1 = i0, comp.c0.opt2 = i00) − 10We ran 200 iterations with a burn-in of 50. where c0, i0, i00 are new comparison and item ids that do not appear in the training data. We call these models Item-Response Theoretic (IRT) models, to acknowledge their roots in the psychometrics (Thurstone, 1927; Bradley and Terry, 1952; Luce, 1959) and item-response theory (Hambleton, 1991 ; van der Linden and Hambleton, 1996; Baker, 2001) literature. Itemresponse theory is the basis of modern testing theory and drives adaptive standardized tests like the Graduate Record Exam (GRE). In particular, the Gaussian parameterization of our IRT models strongly resembles11 the Thurstone (Thurstone, 1927) and Bradley-Terry-Luce (Bradley and Terry, 1952; Luce, 1959) models of paired comparison and the 1PL normal-ogive and Rasch (Rasch, 1960) models of student testing. From the testing perspective, we can view each comparison as two students simultaneously posing a test question to the other: “Give me a translation of the source text which is better than mine.” The students can answer the question correctly, incorrectly, or they can provide a translation of analogous quality. An extra dimension of our models is judge noise, not a factor when modeling multiple-choice tests, for which the right answer is not subject to opinion. 11These models are not traditionally expressed using graphical models, although it is not unprecedented (Mislevy and Almond, 1997; Mislevy et al., 1999). 1421 (number of comparisons). Figure 2: WMT10 model perplexities. The perplexity of the uniform preference model is 3.0 for all training sizes. 8 Experiments We organized the competition data as described at the end of Section 4. To compare the preference models, we did the following: • • • Randomly chose a subset of k comparRisoannsd mfrloym hthosee training set, kfor c km ∈ {100, 200, 400, 800, 1600, 3200}.12 Trained the preference model on these comparisons. Evaluated the perplexity of the trained model on athluea tteedst t preferences, as dtheesc trriabienedd din m Soedec-l tion 4. For each model and training size, we averaged the perplexities from 5 trials of each competition track. We then plotted average perplexity as a function of training size. These graphs are shown 12If k was greater than the total number of training comparisons, then we took the entire set. Figure 3: WMT1 1model perplexities. Figure 4: WMT12 model perplexities. in Figure 2 (WMT10)13, and Figure 4 (WMT12). For WMT10 and WMT1 1, the best models were the IRT models, with the Gaussian parameterization converging the most rapidly and reaching the lowest perplexity. For WMT12, in which reference translations were excluded from the competition, four models were nearly indistinguishable: the two IRT models and the two averaged Independent Student models. This somewhat validates the organizers’ decision to exclude the references, particularly given WMT’s use of the BOJAR ranking heuristic (the nucleus of the Independent Student models) for its official rankings. 13Results for WMT10 exclude the German-English and English-German tracks, since we used these to tune our model hyperparameters. These were set as follows. The Dirichlet strength for each baseline was 1. For IRT-Gaussian: σ0 = 1.0, σobs = 1.0, σa = 0.5, and the decision radius was 0.4. For IRT-Categorical: Λ = 8, σobs = 1.0, αa = 0.5, and the decision radius was 0. 1422 Figure 6: English-Czech WMT1 1 results (average of 5 trainings on 1600 comparisons). Error bars (left) indicate one stddev of the estimated ability means. In the heatmap (right), cell (s1, s2) is darker if preference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews in favor of student s2. Figure 5: WMT10 model perplexities sourced versus expert training). (crowd- The IRT models proved the most robust at handling judge noise. We repeated the WMT10 experiment using the same test sets, but using the unfiltered crowdsourced comparisons (rather than “expert”14 comparisons) for training. Figure 5 shows the results. Whereas the crowdsourced noise considerably degraded the Geometric Independent Students model, the IRT models were remarkably robust. IRT-Gaussian in particular came close to replicating the performance of Geometric Independent Students trained on the much cleaner expert data. This is rather impressive, since the crowdsourced judges agree only 46.6% of the time, compared to a 65.8% agreement rate among 14I.e., machine translation specialists. expert judges (Callison-Burch et al., 2010). Another nice property of the IRT models is that they explicitly model student ability, so they yield a natural ranking. For training size 1600 of the WMT1 1 English-Czech track, Figure 6 (left) shows the mean student abilities learned by the IRT-Gaussian model. The error bars show one standard deviation of the ability means (recall that we performed 5 trials, each with a random training subset of size 1600). These results provide further insight into a case analyzed by (Lopez, 2012), which raised concern about the relative ordering of online-B, cu-bojar, and cu-marecek. According to IRT-Gaussian’s analysis of the data, these three students are so close in ability that any ordering is essentially arbitrary. Short of a full ranking, the analysis does suggest four strata. Viewing one of IRT-Gaussian’s induced preference models as a heatmap15 (Figure 6, right), four bands are discernable. First, the reference sentences are clearly the darkest (best). Next come students 2-7, followed by the slightly lighter (weaker) students 810, followed by the lightest (weakest) student 11. 9 Conclusion WMT has faced a crisis of confidence lately, with researchers raising (real and conjectured) issues with its analytical methodology. In this paper, we showed how WMT can restore confidence in 15In the heatmap, cell (s1, s2) is darker ifpreference model Q(π|s1 , s2) skews in favor of student s1, lighter if it skews iQn (fπa|vsor of student s2. 1423 its conclusions – by shifting the focus from rank- ings to relative ability. Estimates of relative ability (the expected head-to-head performance of system pairs over a probability space of judges and source text) can be empirically compared, granting substance to previously nebulous questions like: 1. Is my analysis better than your analysis? Rather than the current anecdotal approach to comparing competition analyses (e.g. presenting example rankings that seem somehow wrong), we can empirically compare the predictive power of the models on test data. 2. How much of an impact does judge noise have on my conclusions? We showed that judge noise can have a significant impact on the quality of our conclusions, if we use the wrong models. However, the IRTGaussian appears to be quite noise-tolerant, giving similar-quality conclusions on both expert and crowdsourced comparisons. 3. How many comparisons should Ielicit? Many of our preference models (including IRT-Gaussian and Geometric Independent Students) are close to convergence at around 1000 comparisons. This suggests that we can elicit far fewer comparisons and still derive confident conclusions. This is the first time a concrete answer to this question has been provided. References F.B. Baker. 2001. The basics of item response theory. ERIC. Ondej Bojar, Milo sˇ Ercegov cˇevi ´c, Martin Popel, and Omar Zaidan. 2011. A grain of salt for the wmt manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–1 1, Edinburgh, Scotland, July. Association for Computational Linguistics. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324– 345. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O.F. Zaidan. 2010. Findings of the 2010joint workshop on statistical machine trans- lation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17– 53. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741 . R.K. Hambleton. 1991 . Fundamentals of item response theory, volume 2. Sage Publications, Incorporated. D. Koller and N. Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press. S. Kullback and R.A. Leibler. 195 1. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In Proceedings of WMT. R. Ducan Luce. 1959. Individual Choice Behavior a Theoretical Analysis. John Wiley and sons. R.J. Mislevy and R.G. Almond. 1997. Graphical models and computerized adaptive testing. UCLA CSE Technical Report 434. R.J. Mislevy, R.G. Almond, D. Yan, and L.S. Steinberg. 1999. Bayes nets in educational assessment: Where the numbers come from. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence, pages 437–446. Morgan Kaufmann Publishers Inc. G. Rasch. 1960. Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests. Louis L Thurstone. 1927. A law of comparative judgment. Psychological review, 34(4):273–286. W.J. van der Linden and R.K. Hambleton. Handbook of modern item response Springer. 1424 1996. theory.

6 0.56042832 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

7 0.5577057 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

8 0.55476087 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

9 0.55336469 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

10 0.54762459 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

11 0.54713732 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

12 0.54611826 382 acl-2013-Variational Inference for Structured NLP Models

13 0.54525566 246 acl-2013-Modeling Thesis Clarity in Student Essays

14 0.54478222 275 acl-2013-Parsing with Compositional Vector Grammars

15 0.5444665 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

16 0.54407841 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

17 0.54307139 333 acl-2013-Summarization Through Submodularity and Dispersion

18 0.54236805 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

19 0.54231918 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

20 0.54058236 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms