acl acl2011 acl2011-99 knowledge-graph by maker-knowledge-mining

99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Source: pdf

Author: Anja Belz ; Eric Kow

Abstract: Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e.g. in the DUC/TAC evaluation competitions). Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. While studies assessing the quality of individual scales and comparing different types of rating scales are common in psychology and related fields, such studies hardly exist in NLP, and so at present little is known about whether discrete scales are a suitable rating tool for NLP evaluation tasks, or whether continuous scales might provide a better alternative. A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on .ac .uk discrete and continuous scales. Results tend to differ for different types of data. E.g., results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al., 2006). Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. When measuring dyspnea, Lansing et al. (2003) found a hybrid scale to perform on a par with a discrete scale. Another consideration is the types of data produced by discrete and continuous scales. Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). If these assumptions are violated, then the significance of results is overestimated. Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al., 2003). Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. We start with an overview of assessment scale types (Section 2). We describe the experiments we conducted (Sec- tion 4), the data we used in them (Section 3), and the properties we examined in our inter-scale comparisons (Section 5), before presenting our results Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. 1. Very Poor 2. Poor 3. Barely Acceptable 4. Good 5. Very Good Figure 1: Evaluation of Readability in DUC’06, comprising 5 evaluation criteria, including Grammaticality. Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. (Section 6), and some conclusions (Section 7). 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). An example of a VDS used in NLP is shown in Figure 1. VDSs are used very widely in contexts where computationally generated language is evaluated, including in dialogue, summarisation, MT and data-to-text generation. Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. There is only one examples of a VAS in NLP system evaluation that we are aware of (Gatt et al., 2009). Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. Figure 1). Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. we did not investigate this scale in our study. We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e.g. 7 = “perfectly fluent” for Fluency, and 7 = “perfectly clear” for Clarity). Response categories were presented in a vertical list, with the best category at the bottom. Each category had a tick-box placed next to it; the rater’s task was to tick the box by their chosen rating. VAS: a horizontal, bi-polar line, with no ticks on it, mapping to 0–100. In the image description tests, statements identified the left end as negative, the right end as positive; in the weather forecast tests, the positive end had a smiling face and the label “statement couldn’t be clearer/read better”; the negative end had a frowning face and the label “statement couldn’t be more unclear/read worse”. The raters’ task was to move a pointer (initially in the middle of the line) to the place corresponding to their rating. 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. This data has also been used for comparative system evaluation in previous research (Langner, 2010; Angeli et al., 2010; Belz and Kow, 2009). The following are examples of weather forecast texts from the data: 1: S SE 2 8 -3 2 INCREAS ING 3 6-4 0 BY MID AF TERNOON 2 : S ’ LY 2 6-3 2 BACKING S SE 3 0 -3 5 BY AFTERNOON INCREAS ING 3 5 -4 0 GUSTS 5 0 BY MID EVENING Image descriptions: In the other half of our evaluations, we used human-written and automatically generated image descriptions for the same images. The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. This data was originally created in the TUNA Project (van Deemter et al., 2006). The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4.1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. 4.2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. They were paid and spent about 1hour doing the experiment. In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. They were not paid and spent about 15 minutes doing the experiment. 232 4.3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. Weather VDS-7 (B): exact repeat of Weather VDS-7 (A), including same raters. Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. Image VAS (B): exact repeat of Image VAS (A), including same raters. 4.4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. Evaluations were carried out via a web interface. They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. We report W for all 6 experiments. Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. We report these measures between ratings given in Image VAS (A) vs. those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. those given in Weather VDS-7 (B). We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. We report these measures for all pairs of weather forecast/image description evaluations. We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. We also look at the relative effect of the different experimental factors by computing the F-Ratio for System (the main factor under investigation, so its relative effect should be high), Rater and Text Type (their effect should be low). F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. 6 Results 6.1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb.eqAdFlouthV AD S d-(e7Aq)uac.y945a78n*d(V.F9A2l5uS8e*(—An *c)y,.98o36r.748*e1l9a*(tV.i98(Ao.2578nS019s(*5B b) e- tween Image VDS-7 and Image VAS (A) (the main VAS experiment) are extremely high, meaning that they could substitute for each other here. Reliability: Inter-rater agreement in terms of Kendall’s W in each of the experiments: 233 K ’ s W FAldue qnucayc .6V549D80S* -7* VA.46S7 16(*A * )VA.7S529 (5*B *) W was higher in the VAS data in the case of Fluency, whereas for Adequacy, W was the same for the VDS data and VAS (B), and higher in the VDS data than in the VAS (A) data. 6.2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea.ld bVoDt hS -A7 (d BAeq)ua.c9y851a*nVdD(.8F9S7-lu09*(eBn—*)cy,.9 o43r2957*1e la(*t.8i(o736n025Vs9*6A bS)e- tween Weather VDS-7 (A) (the main VDS-7 experiment) and Weather VAS (A) are again very high, although rank-correlation is somewhat lower. Reliability: Inter-rater agreement Kendall’s W was as follows: in terms of W RClea rdi.tyVDS.5-4739 7(*A * )VDS.4- 7583 (*B * ).4 8 V50*A *S This time the highest agreement for both Clarity and Readability was in the VDS-7 data. 6.3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was .666 for Adequacy, .593 for Fluency. Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was .656 for Clarity, .704 for Readability. (All significant at p < .01.) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). 6.4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. Note that for System a high F-ratio is desirable, but a low F-ratio is desirable for other factors. tem, the main factor under investigation, VDS-7 found 8 for Adequacy and 14 for Fluency; VAS (A) found 7 for Adequacy and 15 for Fluency. 6.5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). tem, VDS-7 (A) found 24 for Clarity, 23 for Readability; VAS found 25 for Adequacy, 26 for Fluency. 6.6 Scale validity test for image data Our final table of results shows Pearson’s correlation coefficients (calculated on means per system) between the Adequacy data from the three image description evaluation experiments on the one hand, and the data from an extrinsic experiment in which we measured the accuracy with which participants identified the intended image described by a description: ThecorIlm at iog ne V bAeDSt w-(A7eB)An dA eqd uqe ac uy a cy.I89nD720d 6AI*Dc .Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between .3 and .5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -.35 and -.29. 7 Discussion and Conclusions Our interchangeability results (Sections 6. 1and 6.2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. Where we were able to measure validity (Section 6.6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. Stability (Section 6.3) was marginally better for VDS-7 data, and Reliability (Sections 6.1 and 6.2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. Finally, the VAS experiments found greater numbers of statistically significant differences between systems in 3 out of 4 cases (Section 6.5). Our own raters strongly prefer working with VAS scales over VDSs. This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). Moreover, if a rater’s judgment falls between two points on a VDS then they must make the false choice between the two points just above and just below their actual judgment. In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. References Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 16–24. H. Champney. 1941. The measurement of parent behavior. Child Development, 12(2): 13 1. M. Freyd. 1923. The graphic rating scale. Biometrical Journal, 42:83–102. A. Gatt, A. Belz, and E. Kow. 2009. The TUNA Challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG’09), pages 198–206. Brian Langner. 2010. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Robert W. Lansing, Shakeeb H. Moosavi, and Robert B. Banzett. 2003. Measurement of dyspnea: word labeled visual analog scale vs. verbal ordinal scale. Respiratory Physiology & Neurobiology, 134(2):77 –83. J. Scott and E. C. Huskisson. 2003. Vertical or horizontal visual analogue scales. Annals of the rheumatic diseases, (38):560. Sidney Siegel. 1957. Non-parametric statistics. The American Statistician, 11(3): 13–19. Elisabeth Svensson. 2000. Comparison of the quality of assessments using continuous and discrete ordinal rating scales. Biometrical Journal, 42(4):417–434. P. M. ten Klooster, A. P. Klaar, E. Taal, R. E. Gheith, J. J. Rasker, A. K. El-Garf, and M. A. van de Laar. 2006. The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. The Clinical Journal of Pain, 22(9):827–30. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation, pages 130–132, Sydney, Australia, July. S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525. 235

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 z Abstract Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. [sent-6, score-0.499]

2 In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. [sent-7, score-0.653]

3 We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. [sent-8, score-0.293]

4 We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. [sent-10, score-0.544]

5 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). [sent-11, score-0.317]

6 First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. [sent-12, score-0.048]

7 Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e. [sent-13, score-0.41]

8 Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. [sent-16, score-0.359]

9 A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on . [sent-18, score-0.026]

10 , results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al. [sent-24, score-0.474]

11 Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. [sent-26, score-0.359]

12 (2003) found a hybrid scale to perform on a par with a discrete scale. [sent-28, score-0.216]

13 Another consideration is the types of data produced by discrete and continuous scales. [sent-29, score-0.254]

14 Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. [sent-30, score-0.254]

15 However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). [sent-31, score-0.07]

16 Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. [sent-33, score-0.163]

17 Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al. [sent-34, score-0.186]

18 Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. [sent-36, score-0.617]

19 We start with an overview of assessment scale types (Section 2). [sent-37, score-0.111]

20 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e. [sent-40, score-0.058]

21 Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. [sent-49, score-0.101]

22 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). [sent-51, score-0.09]

23 Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. [sent-54, score-0.366]

24 Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. [sent-55, score-0.11]

25 Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. [sent-56, score-0.404]

26 The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. [sent-57, score-0.029]

27 Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. [sent-60, score-0.154]

28 Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). [sent-61, score-0.062]

29 We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e. [sent-62, score-0.058]

30 extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. [sent-65, score-0.078]

31 Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. [sent-67, score-0.126]

32 We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e. [sent-69, score-0.14]

33 Response categories were presented in a vertical list, with the best category at the bottom. [sent-72, score-0.033]

34 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. [sent-77, score-0.934]

35 The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. [sent-78, score-0.212]

36 The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. [sent-82, score-0.349]

37 The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4. [sent-85, score-0.061]

38 1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. [sent-86, score-0.988]

39 Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. [sent-87, score-0.758]

40 This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. [sent-88, score-0.566]

41 2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. [sent-90, score-0.552]

42 They were paid and spent about 1hour doing the experiment. [sent-91, score-0.045]

43 In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. [sent-92, score-0.792]

44 They were not paid and spent about 15 minutes doing the experiment. [sent-93, score-0.045]

45 3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. [sent-95, score-0.799]

46 Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. [sent-97, score-0.444]

47 Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. [sent-98, score-0.697]

48 Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. [sent-99, score-0.444]

49 4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). [sent-102, score-0.085]

50 Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. [sent-103, score-0.222]

51 They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. [sent-105, score-0.361]

52 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). [sent-106, score-0.061]

53 Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. [sent-107, score-0.202]

54 A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. [sent-108, score-0.317]

55 Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. [sent-110, score-0.089]

56 In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. [sent-111, score-0.195]

57 We report these measures between ratings given in Image VAS (A) vs. [sent-112, score-0.037]

58 those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. [sent-113, score-0.037]

59 We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. [sent-115, score-0.364]

60 We report these measures for all pairs of weather forecast/image description evaluations. [sent-116, score-0.428]

61 We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. [sent-117, score-0.33]

62 F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. [sent-119, score-0.036]

63 1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb. [sent-121, score-0.724]

64 2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea. [sent-134, score-0.848]

65 3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was . [sent-146, score-0.926]

66 Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was . [sent-149, score-0.259]

67 ) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). [sent-154, score-0.022]

68 4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. [sent-156, score-0.336]

69 5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). [sent-160, score-0.367]

70 Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). [sent-165, score-0.361]

71 For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between . [sent-166, score-0.026]

72 5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -. [sent-168, score-0.326]

73 7 Discussion and Conclusions Our interchangeability results (Sections 6. [sent-171, score-0.13]

74 2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. [sent-173, score-0.29]

75 Where we were able to measure validity (Section 6. [sent-174, score-0.067]

76 6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. [sent-175, score-0.624]

77 2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. [sent-179, score-0.83]

78 Our own raters strongly prefer working with VAS scales over VDSs. [sent-182, score-0.537]

79 This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). [sent-183, score-0.67]

80 In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). [sent-185, score-0.021]

81 Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. [sent-186, score-0.099]

82 Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. [sent-187, score-0.582]

83 Measurement of dyspnea: word labeled visual analog scale vs. [sent-225, score-0.102]

84 Comparison of the quality of assessments using continuous and discrete ordinal rating scales. [sent-241, score-0.433]

85 The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. [sent-258, score-0.63]

86 Kees van Deemter, Ielka van der Sluis, and Albert Gatt. [sent-260, score-0.048]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vas', 0.592), ('weather', 0.367), ('image', 0.3), ('scales', 0.29), ('raters', 0.222), ('forecast', 0.163), ('discrete', 0.138), ('interchangeability', 0.13), ('continuous', 0.116), ('rating', 0.114), ('adequacy', 0.096), ('svensson', 0.093), ('vds', 0.093), ('vdss', 0.093), ('rater', 0.085), ('pearson', 0.081), ('scale', 0.078), ('validity', 0.067), ('reliability', 0.064), ('clarity', 0.062), ('description', 0.061), ('stability', 0.058), ('belz', 0.056), ('lansing', 0.056), ('vass', 0.056), ('fluency', 0.054), ('kow', 0.049), ('psychology', 0.048), ('assessing', 0.047), ('parametric', 0.045), ('readability', 0.044), ('assessments', 0.042), ('assess', 0.04), ('staff', 0.04), ('graphic', 0.04), ('pain', 0.038), ('ratings', 0.037), ('verbal', 0.037), ('biometrical', 0.037), ('datelines', 0.037), ('descriptor', 0.037), ('dyspnea', 0.037), ('forecasts', 0.037), ('increas', 0.037), ('klooster', 0.037), ('summary', 0.037), ('criterion', 0.036), ('descriptions', 0.036), ('experiment', 0.034), ('coefficients', 0.034), ('spearman', 0.034), ('kendall', 0.034), ('vertical', 0.033), ('assessment', 0.033), ('angeli', 0.033), ('gatt', 0.033), ('tuna', 0.033), ('verbally', 0.033), ('criteria', 0.033), ('horizontal', 0.031), ('students', 0.03), ('deemter', 0.03), ('couldn', 0.03), ('mid', 0.03), ('grammaticality', 0.029), ('response', 0.029), ('evaluations', 0.028), ('analogue', 0.028), ('anja', 0.028), ('intended', 0.028), ('responses', 0.028), ('measuring', 0.027), ('authored', 0.027), ('anova', 0.027), ('evaluator', 0.027), ('formatting', 0.027), ('face', 0.027), ('nlp', 0.026), ('fields', 0.026), ('measurement', 0.026), ('williams', 0.026), ('ranged', 0.026), ('normally', 0.025), ('prefer', 0.025), ('sensitivity', 0.025), ('brighton', 0.025), ('tem', 0.025), ('line', 0.025), ('visual', 0.024), ('van', 0.024), ('paid', 0.023), ('ordinal', 0.023), ('perfectly', 0.022), ('spent', 0.022), ('correlation', 0.022), ('agree', 0.021), ('virtually', 0.021), ('end', 0.021), ('ungrammatical', 0.021), ('generator', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

2 0.088616431 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

3 0.076795571 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

4 0.054535788 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.

5 0.05198827 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben

Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.

6 0.051592529 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

7 0.05030822 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

8 0.049463265 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

9 0.046561453 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

10 0.041909866 76 acl-2011-Comparative News Summarization Using Linear Programming

11 0.040882122 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

12 0.038228538 52 acl-2011-Automatic Labelling of Topic Models

13 0.034242846 105 acl-2011-Dr Sentiment Knows Everything!

14 0.034164082 253 acl-2011-PsychoSentiWordNet

15 0.032816183 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

16 0.031885713 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

17 0.031785827 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

18 0.03134523 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

19 0.030578747 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

20 0.030196447 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.075), (1, 0.018), (2, -0.012), (3, 0.027), (4, -0.042), (5, 0.023), (6, 0.006), (7, 0.02), (8, -0.002), (9, -0.04), (10, -0.031), (11, -0.024), (12, -0.029), (13, -0.039), (14, -0.047), (15, 0.009), (16, -0.005), (17, 0.005), (18, -0.013), (19, -0.002), (20, 0.075), (21, 0.036), (22, -0.034), (23, -0.006), (24, -0.025), (25, -0.018), (26, -0.009), (27, -0.044), (28, 0.001), (29, -0.029), (30, -0.059), (31, -0.019), (32, -0.002), (33, 0.069), (34, 0.008), (35, 0.067), (36, 0.024), (37, -0.02), (38, 0.059), (39, 0.035), (40, 0.015), (41, -0.034), (42, -0.022), (43, -0.002), (44, 0.027), (45, -0.005), (46, 0.069), (47, 0.094), (48, -0.007), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92658484 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

2 0.63462198 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

3 0.6010412 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

Author: Kirill Kireyev ; Thomas K Landauer

Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !

4 0.5929808 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

Abstract: Common approaches to assessing document quality look at shallow aspects, such as grammar and vocabulary. For many real-world applications, deeper notions of quality are needed. This work represents a first step in a project aimed at developing computational methods for deep assessment of quality in the domain of intelligence reports. We present an automated system for ranking intelligence reports with regard to coverage of relevant material. The system employs methodologies from the field of automatic summarization, and achieves performance on a par with human judges, even in the absence of the underlying information sources.

5 0.58151358 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

Author: Saif Mohammad

Abstract: Colour is a key component in the successful dissemination of information. Since many real-world concepts are associated with colour, for example danger with red, linguistic information is often complemented with the use of appropriate colours in information visualization and product marketing. Yet, there is no comprehensive resource that captures concept–colour associations. We present a method to create a large word–colour association lexicon by crowdsourcing. A wordchoice question was used to obtain sense-level annotations and to ensure data quality. We focus especially on abstract concepts and emotions to show that even they tend to have strong colour associations. Thus, using the right colours can not only improve semantic coherence, but also inspire the desired emotional response.

6 0.55054849 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

7 0.54563469 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

8 0.53498298 252 acl-2011-Prototyping virtual instructors from human-human corpora

9 0.50224018 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

10 0.47870961 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

11 0.44760177 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

12 0.44201982 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

13 0.44097659 25 acl-2011-A Simple Measure to Assess Non-response

14 0.43234593 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

15 0.41515106 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

16 0.41264635 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

17 0.40974399 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

18 0.39764449 74 acl-2011-Combining Indicators of Allophony

19 0.38899457 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

20 0.38716641 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.035), (5, 0.022), (17, 0.03), (26, 0.026), (31, 0.011), (37, 0.047), (39, 0.053), (41, 0.054), (55, 0.031), (59, 0.058), (72, 0.041), (85, 0.332), (91, 0.047), (96, 0.119)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73860693 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

2 0.6271652 154 acl-2011-How to train your multi bottom-up tree transducer

Author: Andreas Maletti

Abstract: The local multi bottom-up tree transducer is introduced and related to the (non-contiguous) synchronous tree sequence substitution grammar. It is then shown how to obtain a weighted local multi bottom-up tree transducer from a bilingual and biparsed corpus. Finally, the problem of non-preservation of regularity is addressed. Three properties that ensure preservation are introduced, and it is discussed how to adjust the rule extraction process such that they are automatically fulfilled.

3 0.56573832 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to improve event extraction by learning to identify secondary role filler contexts in the absence of event keywords. We propose a multilayered event extraction architecture that progressively “zooms in” on relevant information. Our extraction model includes a document genre classifier to recognize event narratives, two types of sentence classifiers, and noun phrase classifiers to extract role fillers. These modules are organized as a pipeline to gradually zero in on event-related information. We present results on the MUC-4 event extraction data set and show that this model performs better than previous systems.

4 0.50144798 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Author: Y. Albert Park ; Roger Levy

Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.

5 0.45417422 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions

Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe

Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1

6 0.45384201 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

7 0.45286596 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

8 0.45223069 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

9 0.45222276 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

10 0.44840971 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

11 0.4473705 178 acl-2011-Interactive Topic Modeling

12 0.44690841 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

13 0.44673297 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

14 0.44648707 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

15 0.44604558 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

16 0.44449812 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

17 0.44412011 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons

18 0.44177547 187 acl-2011-Jointly Learning to Extract and Compress

19 0.44175816 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

20 0.44166628 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing