acl acl2010 acl2010-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Emily Pitler ; Annie Louis ; Ani Nenkova
Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. [sent-3, score-0.347]
2 Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. [sent-7, score-1.302]
3 Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input. [sent-8, score-0.491]
4 1 Introduction Efforts for the development of automatic text summarizers have focused almost exclusively on improving content selection capabilities of systems, ignoring the linguistic quality of the system output. [sent-9, score-0.318]
5 Few metrics, however, have been proposed for evaluating linguistic 1http : / / duc . [sent-11, score-0.34]
6 In this work, we focus on linguistic quality evaluation for automatic systems only. [sent-22, score-0.342]
7 We begin in Section 2 by reviewing the various aspects of linguistic quality that are relevant for machine-produced summaries and currently used in manual evaluations. [sent-25, score-0.686]
8 In Section 3, we introduce and motivate diverse classes of features to capture vocabulary, sentence fluency, and local coherence properties of summaries. [sent-26, score-0.544]
9 Results are presented in Section 6, showing the robustness of each class and their abilities to reproduce human rankings of systems and summaries with high accuracy. [sent-31, score-0.546]
10 2 Aspects of linguistic quality We focus on the five aspects of linguistic qual- ity that were used to evaluate summaries in DUC: grammaticality, ity, focus, and non-redundancy, referential clar- structure/coherence. [sent-32, score-0.987]
11 3 For each of the questions, all summaries were manually rated on a scale from 1to 5, in which 5 is the best. [sent-33, score-0.407]
12 Referential clarity: It should be easy to identify who or what the pronouns and noun phrases in the summary are referring to. [sent-42, score-0.328]
13 The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic. [sent-47, score-0.338]
14 3 Indicators of linguistic quality Multiple factors influence the linguistic quality of text in general, including: word choice, the reference form of entities, and local coherence. [sent-50, score-0.576]
15 In addition, we investigate some models of grammaticality (Chae and Nenkova, 2009) and coherence (Graesser et al. [sent-54, score-0.437]
16 To extract these, all summaries were parsed by the Stanford parser (Klein and Manning, 2003). [sent-63, score-0.407]
17 We focus on named entities because they appear often in summaries of news documents and are often not known to the reader beforehand. [sent-75, score-0.534]
18 In addi- tion, first mentions of entities in text introduce the entity into the discourse and so must be informative and properly descriptive (Prince, 1981 ; Fraurud, 1990; Elsner and Charniak, 2008). [sent-76, score-0.396]
19 First mentions to people Feature exploration on our development set found that under-specified 545 references to people are much more disruptive to a summary than short references to organizations or locations. [sent-79, score-0.345]
20 In this class, we include features that reflect the modification properties of noun phrases (NPs) in the summary that are first mentions to people. [sent-81, score-0.424]
21 Summarization specific Most summarization systems today are extractive and create summaries using complete sentences from the source documents. [sent-85, score-0.613]
22 A subsequent mention of an entity in a source document which is extracted to be the first mention of the entity in the summary is probably not informative enough. [sent-86, score-0.459]
23 For each type of named entity (PERSON, ORGANIZATION, LOCATION), we separately record the number of in- stances which appear as first mentions in the summary but correspond to non-first mentions in the source documents. [sent-87, score-0.497]
24 3 Reference form: NP syntax Some summaries might not include people and other named entities at all. [sent-89, score-0.537]
25 5 Local coherence: Continuity This class of linguistic quality indicators is a combination of factors related to coreference, adjacent sentence similarity, and summary-specific context of surface cohesive devices. [sent-100, score-0.549]
26 Summarization specific Extractive multidocument summaries often lack appropriate antecedents for pronouns and proper context for the use of discourse connectives. [sent-101, score-0.596]
27 A manual analysis of automatic summaries (Otterbacher et al. [sent-103, score-0.451]
28 , 2002) also revealed that anaphoric references that cannot be resolved and unclear discourse relations constitute more than 30% of all revisions required to manually rewrite summaries into a more coherent form. [sent-104, score-0.558]
29 To identify these potential problems, we adapt the features for surface cohesive devices to indicate whether referring expressions and discourse connectives appear in the summary with the same context as in the input documents. [sent-105, score-0.617]
30 4—demonstratives, pronouns, definite descriptions, and sentence-initial discourse connectives—we compare the previous sentence in the summary with the previous sentence in the input article. [sent-107, score-0.402]
31 Two features are computed for each type of cohesive device: (1) number of times the preceding sentence in the summary is the same 546 as the preceding sentence in the input and (2) the number of times the preceding sentence in summary is different from that in the input. [sent-108, score-0.841]
32 Since the previous sentence in the input text often contains the antecedent of pronouns in the current sentence, if the previous sentence from the input is also included in the summary, the pronoun is highly likely to have a proper antecedent. [sent-109, score-0.379]
33 We also compute the proportion of adjacent sentences in the summary that were extracted from the same input document. [sent-110, score-0.375]
34 (2007) compare the coreference chains in input documents and in summaries in order to locate potential problems. [sent-112, score-0.524]
35 Our features check the existence of proper antecedents for pronouns in the summary without reference to the text of the input documents. [sent-114, score-0.471]
36 However, the predictions and confidence scores still reflect whether or not possible antecedents exist in previous sentences that match in gender/number, and so may still be useful for coherence evaluation. [sent-119, score-0.439]
37 Cosine similarity We use cosine similarity to compute the overlap of words in adjacent sentences si and si+1 as a measure of continuity. [sent-120, score-0.34]
38 Cosine similarity is thus indicative of both continuity and redundancy. [sent-127, score-0.464]
39 Since all of these features are calculated over individual sentences, we use the average value over all the sentences in a summary in our experiments. [sent-133, score-0.331]
40 (2004) The Coh-Metrix tool5 provides an implementation of 54 features known in the psycholinguistic literature to correlate with the coherence of humanwritten texts (Graesser et al. [sent-136, score-0.375]
41 Given the heterogeneity of features in this class, we expect that they will provide reasonable accuracies for all the linguistic quality measures. [sent-143, score-0.483]
42 8 Word coherence: Soricut and Marcu (2006) Word co-occurrence patterns across adjacent sentences provide a way ofmeasuring local coherence that is not linguistically informed but which can be easily computed using large amounts of unannotated text (Lapata, 2003; Soricut and Marcu, 2006). [sent-148, score-0.548]
43 Soricut and Marcu (2006) make an analogy to machine translation: two words are likely to be translations of each other if they often appear in parallel sentences; in texts, two words are likely to signal local coherence if they often appear in adjacent sentences. [sent-151, score-0.456]
44 The two features we computed are forward likelihood, the likelihood of observing the words in sentence si conditioned on si−1, and backward likelihood, the likelihood of observing the words in sentence si conditioned on sentence si+1. [sent-152, score-0.361]
45 The actual entity coherence features are the fraction of each type of these transitions in the entire entity grid for the text. [sent-162, score-0.681]
46 Entity coherence features are the only ones that have been previously applied with success for pre- dicting summary coherence. [sent-171, score-0.564]
47 These were the most recent years in which the summaries were evaluated according to specific linguistic quality questions. [sent-176, score-0.639]
48 Each input consists of a set of 25 related documents on a topic and the target length of summaries is 250 words. [sent-177, score-0.44]
49 1 System performance on linguistic quality Each summary was evaluated according to the five linguistic quality questions introduced in Section 2: grammaticality, non-redundancy, referential clarity, focus, and structure. [sent-184, score-0.873]
50 For each of these questions, all summaries were manually rated on a scale from 1to 5, in which 5 is the best. [sent-185, score-0.407]
51 548 edu/˜melsner/ Figure 1: Distribution of system scores on the five linguistic quality questions TFaCNGRoberflnacuemt-rsn1e:dtuSnpea. [sent-191, score-0.333]
52 Some of the linguistic quality ratings are significantly correlated with each other, particularly referential clarity, focus, and structure (Table 1). [sent-202, score-0.445]
53 More importantly, the systems that produce summaries with good content8 are not necessarily the systems producing the most readable summaries. [sent-203, score-0.407]
54 Notice from the first row of Table 1 that none of the system rankings based on these measures of linguistic quality are significantly positively correlated with system rankings of content. [sent-204, score-0.4]
55 The development of automatic linguistic quality measurements will allow researchers to optimize both content and linguistic quality. [sent-205, score-0.379]
56 8as measured by summary responsiveness ratings to 5 scale, without regard to linguistic quality on a 1 5 Experimental setup We use the summaries from DUC 2006 for training and feature development and DUC 2007 served as the test set. [sent-206, score-0.888]
57 We use a Ranking SVM (SV Mlight (Joachims, 2002)) to score summaries using our features. [sent-209, score-0.407]
58 1 Combining predictions To combine information from the different feature classes, we train a meta ranker using the predictions from each class as features. [sent-214, score-0.403]
59 We then apply these rankers to the summaries produced for the held-out input. [sent-217, score-0.438]
60 To test on a new summary pair in 2007, we first apply each individual ranker to get its predictions, and then apply the meta ranker. [sent-220, score-0.394]
61 2 Evaluation of rankings We examine the predictive power of our features for each of the five linguistic quality questions in two settings. [sent-223, score-0.527]
62 In inputlevel evaluation, we would like to rank all summaries produced for a single given input. [sent-225, score-0.446]
63 For input-level evaluation, the pairs are formed from summaries of the same input. [sent-226, score-0.407]
64 or system-level evaluation, we treat the realvalued output of the SVM ranker for each summary as the linguistic quality score. [sent-234, score-0.532]
65 The 45 individual scores for summaries produced by a given system are averaged to obtain an overall score for ? [sent-235, score-0.407]
66 The gold-standard system-level quality rating is equal to the average human ratings for the system’s summaries over the 45 inputs. [sent-238, score-0.596]
67 For both evaluation settings, a random baseline which ranked the summaries in a random order would have an expected pairwise accuracy of 50%. [sent-240, score-0.475]
68 For each of the linguistic quality questions, the corresponding best class of features gives prediction accuracies around 90%. [sent-244, score-0.553]
69 The state-of-the-art entity coherence features perform well but are not the best for any of the five aspects oflinguistic quality. [sent-247, score-0.584]
70 For all four other questions, the best feature set is Continuity, which is a combination of summarization specific features, coreference features and cosine similarity of adjacent sentences. [sent-249, score-0.476]
71 Continuity features outperform entity coherence by 3 to 4% absolute difference on referential quality, focus, and coherence. [sent-250, score-0.614]
72 Accuracies from the language model features are within 1% of entity coherence for these three aspects of summary quality. [sent-251, score-0.731]
73 Coh-Metrix, which has been proposed as a comprehensive characterization of text, does not perform as well as the language model and the entity coherence classes, which contain considerably fewer features related to only one aspect of text. [sent-252, score-0.527]
74 It is apparent from the results that continuity, entity coherence, sentence fluency and language models are the most powerful classes of features that should be used in automation of evaluation and against which novel predictors of text quality should be compared. [sent-254, score-0.651]
75 For instance, entity coherence and continuity features predict grammaticality with very high ac- curacy of around 90%, and are surpassed only by the sentence fluency features. [sent-258, score-1.233]
76 These findings warrant further investigation because we would not expect characteristics of local transitions indicative of text structure to have anything to do with sentence grammaticality or fluency. [sent-259, score-0.432]
77 While for system-level predictions the meta ranker was only useful for grammaticality, at the input level it outperforms every individual feature class for each of the five questions, obtaining ac- curacies around 70%. [sent-264, score-0.415]
78 Word co-occurrence which obtained good accuracies at the system level is the least useful class at the input level with accuracies just above chance in all cases. [sent-274, score-0.341]
79 3 Components of continuity The class of features capturing sentence-tosentence continuity in the summary (Section 3. [sent-276, score-1.135]
80 Results obtained after excluding each of the components of continuity is shown in Table 4; each line in the table represents Continuity minus a feature subclass. [sent-279, score-0.391]
81 Summary specific fea- tures, which compare the context of a sentence in the summary with the context in the original document where it appeared, also contribute substantially to the success of the Continuity class in predicting structure and referential clarity. [sent-281, score-0.477]
82 However, the coreference features do not seem to contribute much towards predicting summary linguistic quality. [sent-283, score-0.512]
83 The accuracies of the Continuity class are not affected at all when these coreference features are not included. [sent-284, score-0.366]
84 The summarization specific continuity features reward systems that include the necessary preceding context from the original document. [sent-288, score-0.615]
85 Therefore, there is a tension between strategies for optimizing linguistic quality and for optimizing content, which warrants the development of abstractive methods. [sent-291, score-0.32]
86 As the field moves towards more abstractive summaries, we expect to see differences in both a) summary linguistic quality and b) the features predictive of linguistic aspects. [sent-292, score-0.78]
87 5 Results on human-written abstracts Since abstractive summaries would have markedly different properties from extracts, it would be in- teresting to know how well these sets of features would work for predicting the quality of machineproduced abstracts. [sent-308, score-0.808]
88 In both DUC 2006 and DUC 2007, ten NIST assessors wrote summaries for the various inputs. [sent-311, score-0.451]
89 There are four human-written summaries for each input and these summaries were judged on the same five linguistic quality aspects as the machine-written summaries. [sent-312, score-1.168]
90 We train on the human-written summaries from DUC 2006 and test on the human-written summaries from DUC 2007, using the same set-up as in Section 5. [sent-313, score-0.814]
91 This result is promising, as it shows that similar features for evaluating linguistic quality will be valid for abstractive summaries as well. [sent-317, score-0.819]
92 While for the machines Continuity feature class is the best predictor of referential clarity, focus, and structure (Table 3), for humans, language models and sentence fluency are best for human-written summaries (%) these three aspects of linguistic quality. [sent-319, score-0.941]
93 A possible explanation for this difference could be that in system-produced extracts, incoherent organization influences human perception of linguistic quality to a great extent and so local coherence features turned out very predictive. [sent-320, score-0.677]
94 But in human summaries, sentences are clearly well-organized and here, continuity features appear less useful. [sent-321, score-0.533]
95 Sentence level fluency seems to be more predictive of the linguistic quality of these summaries. [sent-322, score-0.407]
96 7 Conclusion We have presented an analysis of a wide variety of features for the linguistic quality of summaries. [sent-323, score-0.324]
97 Continuity between adjacent sentences was consistently indicative of the quality of machine generated summaries. [sent-324, score-0.318]
98 Language model and entity coherence features also performed well and should be considered in future endeavors for automatic linguistic quality evaluation. [sent-326, score-0.771]
99 The high prediction accuracies for input-level evaluation and the even higher accuracies for system-level evaluation confirm that questions regarding the linguistic quality of summaries can be answered reasonably using existing computational techniques. [sent-327, score-1.031]
100 Centering: a framework for modelling the local coherence of discourse. [sent-420, score-0.353]
wordName wordTfidf (topN-words)
[('summaries', 0.407), ('continuity', 0.391), ('coherence', 0.283), ('duc', 0.237), ('summary', 0.189), ('grammaticality', 0.154), ('fluency', 0.14), ('quality', 0.129), ('entity', 0.12), ('referential', 0.119), ('accuracies', 0.118), ('ranker', 0.111), ('nenkova', 0.106), ('adjacent', 0.103), ('soricut', 0.103), ('linguistic', 0.103), ('clarity', 0.102), ('summarization', 0.102), ('meta', 0.094), ('features', 0.092), ('cohesive', 0.089), ('abstractive', 0.088), ('coreference', 0.084), ('connectives', 0.084), ('graesser', 0.079), ('mentions', 0.076), ('discourse', 0.074), ('barzilay', 0.074), ('pronouns', 0.072), ('class', 0.072), ('local', 0.07), ('chae', 0.069), ('rankings', 0.067), ('lapata', 0.065), ('elsner', 0.064), ('predictions', 0.063), ('ratings', 0.06), ('questions', 0.059), ('cosine', 0.058), ('devices', 0.056), ('cohesion', 0.056), ('si', 0.055), ('extractive', 0.054), ('entities', 0.054), ('sentence', 0.053), ('pronoun', 0.051), ('sentences', 0.05), ('abstracts', 0.048), ('repetition', 0.047), ('aspects', 0.047), ('classes', 0.046), ('inputs', 0.045), ('marcu', 0.045), ('automatic', 0.044), ('lms', 0.044), ('assessors', 0.044), ('predicting', 0.044), ('antecedents', 0.043), ('coherent', 0.043), ('antecedent', 0.042), ('five', 0.042), ('text', 0.042), ('centering', 0.042), ('nist', 0.041), ('expect', 0.041), ('people', 0.04), ('pairwise', 0.039), ('demonstratives', 0.039), ('haberlandt', 0.039), ('inputlevel', 0.039), ('vsi', 0.039), ('metrics', 0.039), ('prediction', 0.039), ('similarity', 0.037), ('occurring', 0.037), ('focus', 0.037), ('indicative', 0.036), ('named', 0.036), ('transitions', 0.036), ('predictive', 0.035), ('correlated', 0.034), ('revisions', 0.034), ('paice', 0.034), ('phrases', 0.034), ('noun', 0.033), ('input', 0.033), ('descriptions', 0.033), ('aspect', 0.032), ('halliday', 0.031), ('otterbacher', 0.031), ('rankers', 0.031), ('grid', 0.03), ('preceding', 0.03), ('informative', 0.03), ('conroy', 0.029), ('steinberger', 0.029), ('deerwester', 0.029), ('grosz', 0.029), ('nyt', 0.029), ('evaluation', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Author: Emily Pitler ; Annie Louis ; Ani Nenkova
Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.
2 0.28834331 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Author: Xiaojun Wan ; Huiying Li ; Jianguo Xiao
Abstract: Cross-language document summarization is a task of producing a summary in one language for a document set in a different language. Existing methods simply use machine translation for document translation or summary translation. However, current machine translation services are far from satisfactory, which results in that the quality of the cross-language summary is usually very poor, both in readability and content. In this paper, we propose to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process. First, the translation quality of each English sentence in the document set is predicted with the SVM regression method, and then the quality score of each sentence is incorporated into the summarization process. Finally, the English sentences with high translation quality and high informativeness are selected and translated to form the Chinese summary. Experimental results demonstrate the effectiveness and usefulness of the proposed approach. 1
3 0.24952902 264 acl-2010-Wrapping up a Summary: From Representation to Generation
Author: Josef Steinberger ; Marco Turchi ; Mijail Kabadjov ; Ralf Steinberger ; Nello Cristianini
Abstract: The main focus of this work is to investigate robust ways for generating summaries from summary representations without recurring to simple sentence extraction and aiming at more human-like summaries. This is motivated by empirical evidence from TAC 2009 data showing that human summaries contain on average more and shorter sentences than the system summaries. We report encouraging preliminary results comparable to those attained by participating systems at TAC 2009.
4 0.2050204 124 acl-2010-Generating Image Descriptions Using Dependency Relational Patterns
Author: Ahmet Aker ; Robert Gaizauskas
Abstract: This paper presents a novel approach to automatic captioning of geo-tagged images by summarizing multiple webdocuments that contain information related to an image’s location. The summarizer is biased by dependency pattern models towards sentences which contain features typically provided for different scene types such as those of churches, bridges, etc. Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries. Summaries generated using dependency patterns also lead to more readable summaries than those generated without dependency patterns.
5 0.1893803 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
Author: Jackie Chi Kit Cheung ; Gerald Penn
Abstract: One goal of natural language generation is to produce coherent text that presents information in a logical order. In this paper, we show that topological fields, which model high-level clausal structure, are an important component of local coherence in German. First, we show in a sentence ordering experiment that topological field information improves the entity grid model of Barzilay and Lapata (2008) more than grammatical role and simple clausal order information do, particularly when manual annotations of this information are not available. Then, we incorporate the model enhanced with topological fields into a natural language generation system that generates constituent orders for German text, and show that the added coherence component improves performance slightly, though not statistically significantly.
6 0.18828186 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization
7 0.18244798 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
8 0.17475234 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining
9 0.17301978 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization
10 0.1654475 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
11 0.15827371 39 acl-2010-Automatic Generation of Story Highlights
12 0.13749973 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years
13 0.1261656 233 acl-2010-The Same-Head Heuristic for Coreference
14 0.11866649 171 acl-2010-Metadata-Aware Measures for Answer Summarization in Community Question Answering
15 0.11627862 122 acl-2010-Generating Fine-Grained Reviews of Songs from Album Reviews
16 0.11506322 28 acl-2010-An Entity-Level Approach to Information Extraction
17 0.11048301 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information
18 0.094655789 33 acl-2010-Assessing the Role of Discourse References in Entailment Inference
19 0.093390383 149 acl-2010-Incorporating Extra-Linguistic Information into Reference Resolution in Collaborative Task Dialogue
20 0.091566004 59 acl-2010-Cognitively Plausible Models of Human Language Processing
topicId topicWeight
[(0, -0.256), (1, 0.098), (2, -0.13), (3, -0.116), (4, -0.076), (5, 0.126), (6, 0.001), (7, -0.427), (8, -0.02), (9, 0.013), (10, 0.056), (11, -0.063), (12, -0.087), (13, 0.044), (14, -0.0), (15, -0.02), (16, 0.04), (17, 0.016), (18, 0.037), (19, 0.03), (20, 0.074), (21, 0.069), (22, -0.02), (23, 0.063), (24, -0.007), (25, 0.057), (26, -0.042), (27, 0.009), (28, 0.048), (29, 0.065), (30, 0.024), (31, -0.009), (32, 0.0), (33, -0.046), (34, 0.047), (35, 0.0), (36, -0.03), (37, 0.006), (38, 0.006), (39, -0.016), (40, 0.042), (41, 0.048), (42, -0.106), (43, -0.002), (44, -0.037), (45, 0.014), (46, -0.012), (47, -0.022), (48, 0.033), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.94781852 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Author: Emily Pitler ; Annie Louis ; Ani Nenkova
Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.
2 0.82844198 264 acl-2010-Wrapping up a Summary: From Representation to Generation
Author: Josef Steinberger ; Marco Turchi ; Mijail Kabadjov ; Ralf Steinberger ; Nello Cristianini
Abstract: The main focus of this work is to investigate robust ways for generating summaries from summary representations without recurring to simple sentence extraction and aiming at more human-like summaries. This is motivated by empirical evidence from TAC 2009 data showing that human summaries contain on average more and shorter sentences than the system summaries. We report encouraging preliminary results comparable to those attained by participating systems at TAC 2009.
3 0.77847719 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization
Author: Shih-Hsiang Lin ; Berlin Chen
Abstract: In this paper, we formulate extractive summarization as a risk minimization problem and propose a unified probabilistic framework that naturally combines supervised and unsupervised summarization models to inherit their individual merits as well as to overcome their inherent limitations. In addition, the introduction of various loss functions also provides the summarization framework with a flexible but systematic way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. Experiments on speech summarization show that the methods deduced from our framework are very competitive with existing summarization approaches. 1
4 0.73444456 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur
Abstract: Scoring sentences in documents given abstract summaries created by humans is important in extractive multi-document summarization. In this paper, we formulate extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference. We calculate scores for sentences in document clusters based on their latent characteristics using a hierarchical topic model. Then, using these scores, we train a regression model based on the lexical and structural characteristics of the sentences, and use the model to score sentences of new documents to form a summary. Our system advances current state-of-the-art improving ROUGE scores by ∼7%. Generated summaries are less rbeydu ∼n7d%an.t a Gnedn more dc sohuemremnatr bieasse adre upon manual quality evaluations.
5 0.73413062 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Author: Xiaojun Wan ; Huiying Li ; Jianguo Xiao
Abstract: Cross-language document summarization is a task of producing a summary in one language for a document set in a different language. Existing methods simply use machine translation for document translation or summary translation. However, current machine translation services are far from satisfactory, which results in that the quality of the cross-language summary is usually very poor, both in readability and content. In this paper, we propose to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process. First, the translation quality of each English sentence in the document set is predicted with the SVM regression method, and then the quality score of each sentence is incorporated into the summarization process. Finally, the English sentences with high translation quality and high informativeness are selected and translated to form the Chinese summary. Experimental results demonstrate the effectiveness and usefulness of the proposed approach. 1
6 0.71977317 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
7 0.71328962 124 acl-2010-Generating Image Descriptions Using Dependency Relational Patterns
8 0.70638609 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining
9 0.68490958 39 acl-2010-Automatic Generation of Story Highlights
10 0.65369689 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization
11 0.6439032 122 acl-2010-Generating Fine-Grained Reviews of Songs from Album Reviews
12 0.6396054 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
13 0.58863634 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
14 0.53189093 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images
15 0.51089036 196 acl-2010-Plot Induction and Evolutionary Search for Story Generation
16 0.45840541 28 acl-2010-An Entity-Level Approach to Information Extraction
17 0.43141434 157 acl-2010-Last but Definitely Not Least: On the Role of the Last Sentence in Automatic Polarity-Classification
18 0.43075842 171 acl-2010-Metadata-Aware Measures for Answer Summarization in Community Question Answering
19 0.40864334 149 acl-2010-Incorporating Extra-Linguistic Information into Reference Resolution in Collaborative Task Dialogue
20 0.40199539 229 acl-2010-The Influence of Discourse on Syntax: A Psycholinguistic Model of Sentence Processing
topicId topicWeight
[(4, 0.01), (25, 0.037), (42, 0.019), (44, 0.01), (59, 0.064), (73, 0.035), (76, 0.01), (78, 0.024), (83, 0.522), (84, 0.032), (98, 0.12)]
simIndex simValue paperId paperTitle
1 0.99308473 256 acl-2010-Vocabulary Choice as an Indicator of Perspective
Author: Beata Beigman Klebanov ; Eyal Beigman ; Daniel Diermeier
Abstract: We establish the following characteristics of the task of perspective classification: (a) using term frequencies in a document does not improve classification achieved with absence/presence features; (b) for datasets allowing the relevant comparisons, a small number of top features is found to be as effective as the full feature set and indispensable for the best achieved performance, testifying to the existence of perspective-specific keywords. We relate our findings to research on word frequency distributions and to discourse analytic studies of perspective.
2 0.98736888 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data
Author: Katrin Tomanek ; Udo Hahn ; Steffen Lohmann ; Jurgen Ziegler
Abstract: We report on an experiment to track complex decision points in linguistic metadata annotation where the decision behavior of annotators is observed with an eyetracking device. As experimental conditions we investigate different forms of textual context and linguistic complexity classes relative to syntax and semantics. Our data renders evidence that annotation performance depends on the semantic and syntactic complexity of the decision points and, more interestingly, indicates that fullscale context is mostly negligible with – the exception of semantic high-complexity cases. We then induce from this observational data a cognitively grounded cost model of linguistic meta-data annotations and compare it with existing non-cognitive models. Our data reveals that the cognitively founded model explains annotation costs (expressed in annotation time) more adequately than non-cognitive ones.
3 0.98586208 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information
Author: Marta Recasens ; Eduard Hovy
Abstract: This paper explores the effect that different corpus configurations have on the performance of a coreference resolution system, as measured by MUC, B3, and CEAF. By varying separately three parameters (language, annotation scheme, and preprocessing information) and applying the same coreference resolution system, the strong bonds between system and corpus are demonstrated. The experiments reveal problems in coreference resolution evaluation relating to task definition, coding schemes, and features. They also ex- pose systematic biases in the coreference evaluation metrics. We show that system comparison is only possible when corpus parameters are in exact agreement.
same-paper 4 0.97821271 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Author: Emily Pitler ; Annie Louis ; Ani Nenkova
Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference informa- tion, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.
Author: Jenny Rose Finkel ; Christopher D. Manning
Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.
6 0.88798207 31 acl-2010-Annotation
7 0.8780241 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."
8 0.86777419 73 acl-2010-Coreference Resolution with Reconcile
9 0.79671276 81 acl-2010-Decision Detection Using Hierarchical Graphical Models
10 0.79019421 33 acl-2010-Assessing the Role of Discourse References in Entailment Inference
11 0.78654468 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years
12 0.78130609 112 acl-2010-Extracting Social Networks from Literary Fiction
13 0.77786005 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data
14 0.77212965 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields
15 0.76638389 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information
16 0.76617426 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes
17 0.76138198 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People
18 0.7590915 122 acl-2010-Generating Fine-Grained Reviews of Songs from Album Reviews
19 0.75672883 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese
20 0.75474268 197 acl-2010-Practical Very Large Scale CRFs