acl acl2011 acl2011-47 knowledge-graph by maker-knowledge-mining

47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Source: pdf

Author: Samuel Brody ; Paul Kantor

Abstract: Common approaches to assessing document quality look at shallow aspects, such as grammar and vocabulary. For many real-world applications, deeper notions of quality are needed. This work represents a first step in a project aimed at developing computational methods for deep assessment of quality in the domain of intelligence reports. We present an automated system for ranking intelligence reports with regard to coverage of relevant material. The system employs methodologies from the field of automatic summarization, and achieves performance on a par with human judges, even in the absence of the underlying information sources.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Common approaches to assessing document quality look at shallow aspects, such as grammar and vocabulary. [sent-2, score-0.298]

2 For many real-world applications, deeper notions of quality are needed. [sent-3, score-0.16]

3 This work represents a first step in a project aimed at developing computational methods for deep assessment of quality in the domain of intelligence reports. [sent-4, score-0.416]

4 We present an automated system for ranking intelligence reports with regard to coverage of relevant material. [sent-5, score-0.877]

5 The system employs methodologies from the field of automatic summarization, and achieves performance on a par with human judges, even in the absence of the underlying information sources. [sent-6, score-0.097]

6 1 Introduction Distinguishing between high- and low-quality documents is an important skill for humans, and a challenging task for machines. [sent-7, score-0.081]

7 The majority of previous research on the subject has focused on low-level measures of quality, such as spelling, vocabulary and grammar. [sent-8, score-0.045]

8 However, in many real-world situations, it is necessary to employ deeper criteria, which look at the content of the document and the structure of argumentation. [sent-9, score-0.233]

9 One example where such criteria are essential is decision-making in the intelligence community. [sent-10, score-0.217]

10 This is also a domain where computational methods can play an important role. [sent-11, score-0.091]

11 In a typical situation, an intelligence officer faced with an important decision receives reports from a team of analysts on a specific topic of interest. [sent-12, score-0.845]

12 Each decision may involve several areas of interest, resulting in several collections of reports. [sent-13, score-0.075]

13 Addi491 Reports Paul Kantor School of Communication and Information Rutgers University paul . [sent-14, score-0.044]

14 edu tionally, the officer may be engaged in many decision processes within a small window of time. [sent-16, score-0.157]

15 Our project aims to provide a system that will assist intelligence officers in the decision making process by quickly and accurately ranking reports according to the most important criteria for the task. [sent-20, score-0.772]

16 Coverage is perhaps the most important element in a time-sensitive scenario, where an intelligence officer may need to choose among several reports while ensuring no relevant and important topics are overlooked. [sent-23, score-0.728]

17 2 Related Work Much of the work on automatic assessment of document quality has focused on student essays (e. [sent-24, score-0.281]

18 2004) , for the purpose of grading or assisting the writers (e. [sent-27, score-0.17]

19 This research looks primarily at issues of grammar, lexical selection, etc. [sent-30, score-0.04]

20 For the purpose of judging the quality of intelligence reports, these aspects are relatively peripheral, and relevant mostly through their effect on the overall readability of the document. [sent-31, score-0.352]

21 The criteria judged most important for determining the quality of an intelligence report (see Sec. [sent-32, score-0.464]

22 1) are more complex and deal with a deeper level of representation. [sent-34, score-0.06]

23 In this work, we chose to start with criteria related to content choice. [sent-35, score-0.166]

24 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 491–495, we propose that the most closely related prior research is that on automatic summarization, specifically multi-document extractive summarization. [sent-38, score-0.05]

25 Extractive summarization works along the following lines (Goldstein et al. [sent-39, score-0.184]

26 , 2000) : (1) analyze the input document(s) for important themes; (2) select the best sentences to include in the summary, taking into account the summarization aspects (coverage, relevance, redundancy) and generation aspects (grammaticality, sentence flow, etc. [sent-40, score-0.345]

27 Since we are interested in content choice, we focus on the summarization aspects, starting with coverage. [sent-42, score-0.241]

28 Effective ways of representing content and ensuring coverage are the subject of ongoing research in the field (e. [sent-43, score-0.339]

29 However, they must be adapted to our task of quality assessment and must take into account the specific characteristics of our domain of intelligence reports. [sent-48, score-0.376]

30 1 The ARDA Challenge Workshop Given the nature of our domain, real-world data and gold standard evaluations are difficult to obtain. [sent-53, score-0.089]

31 We were fortunate to gain access to the reports and evaluations from the ARDA workshop (Morse et al. [sent-54, score-0.323]

32 The workshop was designed to demonstrate the feasibility of assessing the effectiveness of information retrieval systems. [sent-56, score-0.135]

33 During the workshop, seven intelligence analysts were each asked to use one of several IR systems to obtain information about eight different scenarios and write a report about each. [sent-57, score-0.616]

34 The same seven analysts were then asked to judge each of the 56 reports (including their own) on several criteria on a scale of 0 (worst) to 5 (best) . [sent-59, score-0.853]

35 These criteria, listed in Table 1, were chosen by the researchers as desirable in a “high-quality” intelligence report. [sent-60, score-0.108]

36 From an NLP perspective they can be divided into three broad categories: content selection, structure, and readability. [sent-61, score-0.057]

37 The written reports, along with their associated human quality judgments, form the dataset used in our experiments. [sent-62, score-0.16]

38 sessing coverage, it is only meaningful to compare reports on the same scenario. [sent-65, score-0.323]

39 Therefore, we regard our dataset as 8 collections (Scenario A to Scenario H) , each containing 7 reports. [sent-66, score-0.095]

40 1 Methodology In the ARDA workshop, the analysts were tasked to extract and present the information which was relevant to the query subject. [sent-68, score-0.263]

41 In fact, a high quality report shares many of the characteristics of a good document summary. [sent-70, score-0.23]

42 In particular, it seeks to cover as much of the impor- tant information as possible, while avoiding redundancy and irrelevant information. [sent-71, score-0.042]

43 When seeking to assess these qualities, we can treat the analysts’ reports as output from (human) summarization systems, and employ methods from automatic summarization to evaluate how well they did. [sent-72, score-0.792]

44 This limitation is inherent to the domain, and will necessarily impact the assessment of coverage, since we have no means of determining whether an analyst has included all the relevant information to which she, in particular, had access. [sent-74, score-0.243]

45 We can only assess coverage with respect to what was included in the other analysts’ reports. [sent-75, score-0.235]

46 For our task, however, this is sufficient, since our purpose is to identify, for the person who must choose among them, the report which is most comprehensive in its coverage, or indicate a subset of reports which cover all topics discussed in the collection as a whole1 . [sent-76, score-0.464]

47 1The absence of the sources also means the system is only able to compare reports on the same subject, as opposed to humans, who might rank the coverage quality As a first step in modeling relevant concepts we employ a word-gram representation, and use frequency as a measure of relevance. [sent-77, score-0.888]

48 Examination of high-quality human summaries has shown that frequency is an important factor (Nenkova et al. [sent-78, score-0.101]

49 , 2006) , and word-gram representations are employed in many summarization systems (e. [sent-79, score-0.184]

50 Following Gillick and Favre (2009) , we use a bigram representation of concepts2 . [sent-83, score-0.056]

51 For each document collection D, we calculate the average prevalence of every bigram concept in the collection: prevD(c) =|D1|XCountr(c) (1) Where r labels a report in the collection, and Countr (c) is the number of times the concept c appears in report r. [sent-84, score-0.382]

52 This scoring function gives higher weight to concepts which many reports mentioned many times. [sent-85, score-0.464]

53 These are, presumably, the terms considered important to the subject of interest. [sent-86, score-0.086]

54 We ignore concepts (bigrams) composed entirely of stop words. [sent-87, score-0.141]

55 To model the coverage of a report, we calculate a weighted sum of the concepts it mentions (multiple mentions do not increase this score) , using the prevalence score as the weight, as shown in Equation 2. [sent-88, score-0.42]

56 CoverScore(r ∈ D) = X prevD(c) c∈CoXncepts(r) (2) Here, Concepts(r) is the set of concepts appearing at least once in report r. [sent-89, score-0.208]

57 The system produces a ranking of the reports in order of their coverage score (where highest is considered best) . [sent-90, score-0.622]

58 2 Evaluation As a gold standard, we use the average of the scores given to each report by the human of two reports on completely different subjects, based on external knowledge. [sent-92, score-0.539]

59 2We also experimented with unigram and trigram representations, which did not do as well as the bigram representation (as suggested by Gillick and Favre 2009) . [sent-94, score-0.056]

60 Since we are interested in ranking reports by coverage, we convert the scores from the original numerical scale to a ranked list. [sent-96, score-0.435]

61 We evaluate the performance of the algorithms (and of the individual judges) using Kendall’s Tau to measure concordance with the gold standard. [sent-97, score-0.201]

62 , Jijkoun and Hofmann 2009) to compare rankings, and looks at the number of pairs of ranked items that agree or disagree with the ordering in the gold standard. [sent-100, score-0.129]

63 Let T = {(ai, aj) : ai ≺g aj } denote the set of pairs Tord =er {e(da in the go≺ld sta}n ddaerndo (ea ti epre sected oefs p aj) . [sent-101, score-0.155]

64 Let R = {(al , am) : al ≺r am} denote the set of pairs Ro r=de {r(ead by a ranking algorithm. [sent-102, score-0.112]

65 C th =e sTet∩ oRf pisa tirhse sredte rofe dco bnyc aor rdaannkti pairs, ri. [sent-103, score-0.078]

66 h , pairs o=rd Te∩reRd the same way in the gold standard and in the ranking, and D = T ∩ R is the set of discordant pairs. [sent-105, score-0.089]

67 gK,e annddal Dl’s =ra Tnk ∩ ∩coRrr isel tathieon se cto oefff diciisecnotr τk nist defined as follows: τk=|C||T −| |D| (3) The value of τk ranges from -1 (reversed ranking) to 1 (perfect agreement) , with 0 being equivalent to a random ranking (50% agreement) . [sent-106, score-0.177]

68 As a simple baseline system, we rank the reports according to their length in words, which asserts that a longer document has “more coverage” . [sent-107, score-0.425]

69 For comparison, we also examine agreement between individual human judges and the gold standard. [sent-108, score-0.593]

70 In each scenario, we calculate the average agreement (Tau value) between an individual judge and the gold standard, and also look at the highest and lowest Tau value from among the individual judges. [sent-109, score-0.556]

71 3 Results Figure 1 presents the results of our ranking experiments on each of the eight scenarios. [sent-111, score-0.194]

72 Human Performance There is a relatively wide range of performance among the human 3Since the judges in the NIST experiment were also the writers of the documents, and the workshop report (Morse et al. [sent-112, score-0.45]

73 , 2004) identified a bias of the individual judges when evaluating their own reports, we did not include the score given by the report’s author in this average. [sent-113, score-0.364]

74 e, the gold standard score was the average of the scores given by the 6 judges who were not the author. [sent-115, score-0.341]

75 Scores for the individual human judges highest individual agreement (Judges) are given as a range from lowest to score, with ‘x’ indicating the average. [sent-117, score-0.616]

76 This is indicative of the cognitive complexity of the notion of coverage. [sent-119, score-0.038]

77 We can see that some human judges are better than others at assessing this quality (as represented by the gold standard) . [sent-120, score-0.636]

78 It is interesting to note that there was not a single individual judge who was worst or best across all cases. [sent-121, score-0.319]

79 A system that outperforms some individual human judge on this task can be considered successful, and one that surpasses the average individual agreement even more so. [sent-122, score-0.489]

80 The number of words in a document is significantly correlated with its gold-standard coverage rank. [sent-124, score-0.25]

81 This simple baseline is surprisingly effective, outperforming the worst human judge in seven out of eight scenarios, and doing better than the average individual in two of them. [sent-125, score-0.588]

82 System Performance Our concept-based ranking system exhibits very strong performance4. [sent-126, score-0.112]

83 It outperforms the worst individual human judge in seven of the eight cases, and does better than the average individual agreement in four. [sent-128, score-0.733]

84 494 sources of information available to the writers (and judges) of the reports. [sent-130, score-0.071]

85 When calculating the overall agreement with the gold-standard over all the scenarios, our concept-based system came in second, outperforming all but one of the human judges. [sent-131, score-0.187]

86 The word-count baseline was in the last place, close behind a human judge. [sent-132, score-0.06]

87 A unigram-based system (which was our first attempt at modeling concepts) tied for third place with two human judges. [sent-133, score-0.06]

88 4 Discussion and Future Work We have presented a system for assessing the relative quality of intelligence reports with regard to their coverage. [sent-135, score-0.725]

89 Our method makes use of ideas from the summarization literature designed to capture the notion of content units and relevance. [sent-136, score-0.279]

90 Our system is as accurate as individual human judges for this concept. [sent-137, score-0.424]

91 The bigram representation we employ is only a rough approximation of actual concepts or themes. [sent-138, score-0.25]

92 We are in the process of obtaining more documents in the domain, which will allow the use of more complex models and more sophisticated representations. [sent-139, score-0.04]

93 This work represents a first step in the complex task of assessing the quality of intelligence reports. [sent-143, score-0.343]

94 In this paper we focused on coverage perhaps the most important aspect in determining which single report to read among several. [sent-144, score-0.334]

95 There are many other important factors in assessing quality, as described in Section 2. [sent-145, score-0.176]

96 We will address these in future stages of the quality assessment project. [sent-147, score-0.218]

97 Emile Morse of NIST for her generosity in providing the documents and set of judgments from the ARDA Challenge Workshop project, and Prof. [sent-151, score-0.079]

98 of the 2000 NAACL-ANLP Workshop on Automatic summarization - Volume 4 . [sent-181, score-0.184]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reports', 0.323), ('judges', 0.252), ('analysts', 0.216), ('coverage', 0.187), ('gillick', 0.185), ('summarization', 0.184), ('arda', 0.178), ('morse', 0.178), ('concepts', 0.141), ('tau', 0.136), ('assessing', 0.135), ('judge', 0.125), ('favre', 0.119), ('assessment', 0.118), ('officer', 0.118), ('burstein', 0.114), ('individual', 0.112), ('ranking', 0.112), ('criteria', 0.109), ('intelligence', 0.108), ('rutgers', 0.102), ('quality', 0.1), ('scenario', 0.09), ('gold', 0.089), ('jijkoun', 0.089), ('kantor', 0.089), ('prevd', 0.089), ('worst', 0.082), ('eight', 0.082), ('agreement', 0.08), ('seven', 0.08), ('shermis', 0.078), ('larkey', 0.078), ('aj', 0.071), ('essay', 0.071), ('writers', 0.071), ('radev', 0.069), ('emile', 0.068), ('report', 0.067), ('nist', 0.065), ('document', 0.063), ('scenarios', 0.063), ('lucy', 0.062), ('grading', 0.062), ('jill', 0.062), ('benoit', 0.062), ('aspects', 0.06), ('human', 0.06), ('deeper', 0.06), ('regard', 0.059), ('content', 0.057), ('bigram', 0.056), ('vanderwende', 0.055), ('goldstein', 0.054), ('prevalence', 0.054), ('kendall', 0.054), ('employ', 0.053), ('extractive', 0.05), ('dragomir', 0.05), ('ensuring', 0.05), ('domain', 0.05), ('assess', 0.048), ('relevant', 0.047), ('outperforming', 0.047), ('nenkova', 0.047), ('subject', 0.045), ('ai', 0.045), ('paul', 0.044), ('stroudsburg', 0.043), ('blei', 0.042), ('redundancy', 0.042), ('automated', 0.041), ('important', 0.041), ('looks', 0.04), ('cl', 0.04), ('project', 0.04), ('documents', 0.04), ('usa', 0.04), ('determining', 0.039), ('vibhu', 0.039), ('crossdisciplinary', 0.039), ('tionally', 0.039), ('analyst', 0.039), ('aquaint', 0.039), ('asserts', 0.039), ('dco', 0.039), ('hakkanitur', 0.039), ('leah', 0.039), ('oefs', 0.039), ('rofe', 0.039), ('tord', 0.039), ('decision', 0.039), ('judgments', 0.039), ('calculate', 0.038), ('haghighi', 0.038), ('notion', 0.038), ('purpose', 0.037), ('absence', 0.037), ('collection', 0.037), ('collections', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

2 0.14059241 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

Author: Dong Wang ; Yang Liu

Abstract: This paper presents a pilot study of opinion summarization on conversations. We create a corpus containing extractive and abstractive summaries of speaker’s opinion towards a given topic using 88 telephone conversations. We adopt two methods to perform extractive summarization. The first one is a sentence-ranking method that linearly combines scores measured from different aspects including topic relevance, subjectivity, and sentence importance. The second one is a graph-based method, which incorporates topic and sentiment information, as well as additional information about sentence-to-sentence relations extracted based on dialogue structure. Our evaluation results show that both methods significantly outperform the baseline approach that extracts the longest utterances. In particular, we find that incorporating dialogue structure in the graph-based method contributes to the improved system performance.

3 0.13762736 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

Abstract: Comparative News Summarization aims to highlight the commonalities and differences between two comparable news topics. In this study, we propose a novel approach to generating comparative news summaries. We formulate the task as an optimization problem of selecting proper sentences to maximize the comparativeness within the summary and the representativeness to both news topics. We consider semantic-related cross-topic concept pairs as comparative evidences, and consider topic-related concepts as representative evidences. The optimization problem is addressed by using a linear programming model. The experimental results demonstrate the effectiveness of our proposed model.

4 0.13211684 187 acl-2011-Jointly Learning to Extract and Compress

Author: Taylor Berg-Kirkpatrick ; Dan Gillick ; Dan Klein

Abstract: We learn a joint model of sentence extraction and compression for multi-document summarization. Our model scores candidate summaries according to a combined linear model whose features factor over (1) the n-gram types in the summary and (2) the compressions used. We train the model using a marginbased objective whose loss captures end summary quality. Because of the exponentially large set of candidate summaries, we use a cutting-plane algorithm to incrementally detect and add active constraints efficiently. Inference in our model can be cast as an ILP and thereby solved in reasonable time; we also present a fast approximation scheme which achieves similar performance. Our jointly extracted and compressed summaries outperform both unlearned baselines and our learned extraction-only system on both ROUGE and Pyramid, without a drop in judged linguistic quality. We achieve the highest published ROUGE results to date on the TAC 2008 data set.

5 0.12147462 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.

6 0.11769835 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

7 0.11641768 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

8 0.11201226 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

9 0.11132432 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

10 0.098283909 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

11 0.09720964 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

12 0.095149361 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

13 0.091820903 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

14 0.091098897 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

15 0.088385694 4 acl-2011-A Class of Submodular Functions for Document Summarization

16 0.079674549 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews

17 0.078495026 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

18 0.075790502 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

19 0.074044272 52 acl-2011-Automatic Labelling of Topic Models

20 0.069594309 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.081), (2, -0.053), (3, 0.103), (4, -0.084), (5, -0.032), (6, -0.044), (7, 0.129), (8, 0.021), (9, -0.051), (10, -0.081), (11, -0.021), (12, -0.093), (13, -0.017), (14, -0.138), (15, 0.013), (16, 0.013), (17, -0.004), (18, -0.021), (19, 0.029), (20, -0.01), (21, -0.007), (22, -0.002), (23, -0.015), (24, -0.061), (25, -0.02), (26, 0.001), (27, -0.141), (28, 0.058), (29, 0.041), (30, -0.023), (31, -0.039), (32, -0.037), (33, 0.016), (34, -0.006), (35, 0.069), (36, 0.003), (37, -0.009), (38, 0.06), (39, 0.043), (40, 0.067), (41, -0.056), (42, -0.028), (43, -0.045), (44, -0.041), (45, -0.04), (46, 0.057), (47, 0.0), (48, 0.133), (49, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96495509 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

2 0.78740805 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

3 0.76631337 187 acl-2011-Jointly Learning to Extract and Compress

Author: Taylor Berg-Kirkpatrick ; Dan Gillick ; Dan Klein

4 0.75101817 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

5 0.66431177 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

Abstract: Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e.g. in the DUC/TAC evaluation competitions). Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. While studies assessing the quality of individual scales and comparing different types of rating scales are common in psychology and related fields, such studies hardly exist in NLP, and so at present little is known about whether discrete scales are a suitable rating tool for NLP evaluation tasks, or whether continuous scales might provide a better alternative. A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on .ac .uk discrete and continuous scales. Results tend to differ for different types of data. E.g., results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al., 2006). Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. When measuring dyspnea, Lansing et al. (2003) found a hybrid scale to perform on a par with a discrete scale. Another consideration is the types of data produced by discrete and continuous scales. Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). If these assumptions are violated, then the significance of results is overestimated. Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al., 2003). Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. We start with an overview of assessment scale types (Section 2). We describe the experiments we conducted (Sec- tion 4), the data we used in them (Section 3), and the properties we examined in our inter-scale comparisons (Section 5), before presenting our results Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. 1. Very Poor 2. Poor 3. Barely Acceptable 4. Good 5. Very Good Figure 1: Evaluation of Readability in DUC’06, comprising 5 evaluation criteria, including Grammaticality. Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. (Section 6), and some conclusions (Section 7). 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). An example of a VDS used in NLP is shown in Figure 1. VDSs are used very widely in contexts where computationally generated language is evaluated, including in dialogue, summarisation, MT and data-to-text generation. Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. There is only one examples of a VAS in NLP system evaluation that we are aware of (Gatt et al., 2009). Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. Figure 1). Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. we did not investigate this scale in our study. We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e.g. 7 = “perfectly fluent” for Fluency, and 7 = “perfectly clear” for Clarity). Response categories were presented in a vertical list, with the best category at the bottom. Each category had a tick-box placed next to it; the rater’s task was to tick the box by their chosen rating. VAS: a horizontal, bi-polar line, with no ticks on it, mapping to 0–100. In the image description tests, statements identified the left end as negative, the right end as positive; in the weather forecast tests, the positive end had a smiling face and the label “statement couldn’t be clearer/read better”; the negative end had a frowning face and the label “statement couldn’t be more unclear/read worse”. The raters’ task was to move a pointer (initially in the middle of the line) to the place corresponding to their rating. 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. This data has also been used for comparative system evaluation in previous research (Langner, 2010; Angeli et al., 2010; Belz and Kow, 2009). The following are examples of weather forecast texts from the data: 1: S SE 2 8 -3 2 INCREAS ING 3 6-4 0 BY MID AF TERNOON 2 : S ’ LY 2 6-3 2 BACKING S SE 3 0 -3 5 BY AFTERNOON INCREAS ING 3 5 -4 0 GUSTS 5 0 BY MID EVENING Image descriptions: In the other half of our evaluations, we used human-written and automatically generated image descriptions for the same images. The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. This data was originally created in the TUNA Project (van Deemter et al., 2006). The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4.1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. 4.2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. They were paid and spent about 1hour doing the experiment. In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. They were not paid and spent about 15 minutes doing the experiment. 232 4.3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. Weather VDS-7 (B): exact repeat of Weather VDS-7 (A), including same raters. Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. Image VAS (B): exact repeat of Image VAS (A), including same raters. 4.4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. Evaluations were carried out via a web interface. They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. We report W for all 6 experiments. Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. We report these measures between ratings given in Image VAS (A) vs. those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. those given in Weather VDS-7 (B). We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. We report these measures for all pairs of weather forecast/image description evaluations. We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. We also look at the relative effect of the different experimental factors by computing the F-Ratio for System (the main factor under investigation, so its relative effect should be high), Rater and Text Type (their effect should be low). F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. 6 Results 6.1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb.eqAdFlouthV AD S d-(e7Aq)uac.y945a78n*d(V.F9A2l5uS8e*(—An *c)y,.98o36r.748*e1l9a*(tV.i98(Ao.2578nS019s(*5B b) e- tween Image VDS-7 and Image VAS (A) (the main VAS experiment) are extremely high, meaning that they could substitute for each other here. Reliability: Inter-rater agreement in terms of Kendall’s W in each of the experiments: 233 K ’ s W FAldue qnucayc .6V549D80S* -7* VA.46S7 16(*A * )VA.7S529 (5*B *) W was higher in the VAS data in the case of Fluency, whereas for Adequacy, W was the same for the VDS data and VAS (B), and higher in the VDS data than in the VAS (A) data. 6.2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea.ld bVoDt hS -A7 (d BAeq)ua.c9y851a*nVdD(.8F9S7-lu09*(eBn—*)cy,.9 o43r2957*1e la(*t.8i(o736n025Vs9*6A bS)e- tween Weather VDS-7 (A) (the main VDS-7 experiment) and Weather VAS (A) are again very high, although rank-correlation is somewhat lower. Reliability: Inter-rater agreement Kendall’s W was as follows: in terms of W RClea rdi.tyVDS.5-4739 7(*A * )VDS.4- 7583 (*B * ).4 8 V50*A *S This time the highest agreement for both Clarity and Readability was in the VDS-7 data. 6.3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was .666 for Adequacy, .593 for Fluency. Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was .656 for Clarity, .704 for Readability. (All significant at p < .01.) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). 6.4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. Note that for System a high F-ratio is desirable, but a low F-ratio is desirable for other factors. tem, the main factor under investigation, VDS-7 found 8 for Adequacy and 14 for Fluency; VAS (A) found 7 for Adequacy and 15 for Fluency. 6.5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). tem, VDS-7 (A) found 24 for Clarity, 23 for Readability; VAS found 25 for Adequacy, 26 for Fluency. 6.6 Scale validity test for image data Our final table of results shows Pearson’s correlation coefficients (calculated on means per system) between the Adequacy data from the three image description evaluation experiments on the one hand, and the data from an extrinsic experiment in which we measured the accuracy with which participants identified the intended image described by a description: ThecorIlm at iog ne V bAeDSt w-(A7eB)An dA eqd uqe ac uy a cy.I89nD720d 6AI*Dc .Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between .3 and .5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -.35 and -.29. 7 Discussion and Conclusions Our interchangeability results (Sections 6. 1and 6.2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. Where we were able to measure validity (Section 6.6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. Stability (Section 6.3) was marginally better for VDS-7 data, and Reliability (Sections 6.1 and 6.2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. Finally, the VAS experiments found greater numbers of statistically significant differences between systems in 3 out of 4 cases (Section 6.5). Our own raters strongly prefer working with VAS scales over VDSs. This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). Moreover, if a rater’s judgment falls between two points on a VDS then they must make the false choice between the two points just above and just below their actual judgment. In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. References Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 16–24. H. Champney. 1941. The measurement of parent behavior. Child Development, 12(2): 13 1. M. Freyd. 1923. The graphic rating scale. Biometrical Journal, 42:83–102. A. Gatt, A. Belz, and E. Kow. 2009. The TUNA Challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG’09), pages 198–206. Brian Langner. 2010. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Robert W. Lansing, Shakeeb H. Moosavi, and Robert B. Banzett. 2003. Measurement of dyspnea: word labeled visual analog scale vs. verbal ordinal scale. Respiratory Physiology & Neurobiology, 134(2):77 –83. J. Scott and E. C. Huskisson. 2003. Vertical or horizontal visual analogue scales. Annals of the rheumatic diseases, (38):560. Sidney Siegel. 1957. Non-parametric statistics. The American Statistician, 11(3): 13–19. Elisabeth Svensson. 2000. Comparison of the quality of assessments using continuous and discrete ordinal rating scales. Biometrical Journal, 42(4):417–434. P. M. ten Klooster, A. P. Klaar, E. Taal, R. E. Gheith, J. J. Rasker, A. K. El-Garf, and M. A. van de Laar. 2006. The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. The Clinical Journal of Pain, 22(9):827–30. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation, pages 130–132, Sydney, Australia, July. S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525. 235

6 0.65807027 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

7 0.63593423 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

8 0.63510603 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

9 0.62994218 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

10 0.61624038 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

11 0.61563861 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

12 0.60117292 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

13 0.58830464 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

14 0.56310391 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

15 0.5594514 4 acl-2011-A Class of Submodular Functions for Document Summarization

16 0.55876482 150 acl-2011-Hierarchical Text Classification with Latent Concepts

17 0.53092831 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

18 0.5280301 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

19 0.51850927 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

20 0.5151245 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.033), (17, 0.045), (26, 0.016), (37, 0.091), (39, 0.051), (41, 0.055), (49, 0.204), (55, 0.041), (59, 0.033), (72, 0.077), (88, 0.011), (91, 0.049), (96, 0.212)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86575401 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

2 0.85653061 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

Author: Nadir Durrani ; Helmut Schmid ; Alexander Fraser

Abstract: We present a novel machine translation model which models translation by a linear sequence of operations. In contrast to the “N-gram” model, this sequence includes not only translation but also reordering operations. Key ideas of our model are (i) a new reordering approach which better restricts the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings in a unified way, and (ii) a joint sequence model for the translation and reordering probabilities which is more flexible than standard phrase-based MT. We observe statistically significant improvements in BLEU over Moses for German-to-English and Spanish-to-English tasks, and comparable results for a French-to-English task.

3 0.78748739 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

4 0.77874976 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

5 0.77646351 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty

Abstract: Social networking and micro-blogging sites are stores of opinion-bearing content created by human users. We describe C-Feel-It, a system which can tap opinion content in posts (called tweets) from the micro-blogging website, Twitter. This web-based system categorizes tweets pertaining to a search string as positive, negative or objective and gives an aggregate sentiment score that represents a sentiment snapshot for a search string. We present a qualitative evaluation of this system based on a human-annotated tweet corpus.

6 0.77613825 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

7 0.77595741 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

8 0.7756058 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

9 0.77516186 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

10 0.77380216 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

11 0.77297181 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

12 0.7727288 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

13 0.7725727 187 acl-2011-Jointly Learning to Extract and Compress

14 0.77216566 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

15 0.77210706 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

16 0.77209246 76 acl-2011-Comparative News Summarization Using Linear Programming

17 0.77197933 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

18 0.7719599 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

19 0.77189934 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

20 0.77182817 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation