acl acl2011 acl2011-20 knowledge-graph by maker-knowledge-mining

20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Source: pdf

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. [sent-6, score-0.245]

2 In particular, we use rank preference learning to explicitly model the grade relationships between scripts. [sent-7, score-0.381]

3 A comparison between regression and rank preference models further supports our method. [sent-9, score-0.309]

4 Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. [sent-10, score-0.374]

5 1 Introduction The task of automated assessment of free text focuses on automatically analysing and assessing the quality of writing competence. [sent-12, score-0.474]

6 Automated assessment systems exploit textual features in order to measure the overall quality and assign a score to a text. [sent-13, score-0.236]

7 More recent systems have used more sophisticated automated text processing techniques to measure grammaticality, textual coherence, prespecified errors, and so forth. [sent-15, score-0.168]

8 uk l Deployment of automated assessment systems gives a number of advantages, such as the reduced workload in marking texts, especially when applied to large-scale assessments. [sent-22, score-0.46]

9 Additionally, automated systems guarantee the application of the same marking criteria, thus reducing inconsistency, which may arise when more than one human examiner is employed. [sent-23, score-0.451]

10 A recent review identifies twelve different automated free-text scoring systems (Williamson, 2009). [sent-28, score-0.251]

11 Several of these are now deployed in highstakes assessment of examination scripts. [sent-32, score-0.32]

12 We make such a dataset of ESOL examination scripts available1 (see Section 2 for more details), describe our novel approach to the task, and provide results for our system on this dataset. [sent-37, score-0.269]

13 We address automated assessment as a supervised discriminative machine learning problem and particularly as a rank preference problem (Joachims, 2002). [sent-38, score-0.585]

14 Additionally, rank preference techniques (Joachims, 2002) allow us to explicitly learn an optimal ranking model of text quality. [sent-40, score-0.214]

15 Learning a ranking directly, rather than fitting a classifier score to a grade point scale after training, is both a more generic approach to the task and one which exploits the labelling information in the training data efficiently and directly. [sent-41, score-0.167]

16 However, although our corpus of manually-marked texts was produced by learners of English in response to prompts eliciting free-text answers, the marking criteria are primarily based on the accurate use of a range of different linguistic constructions. [sent-43, score-0.422]

17 For this reason, we believe that an approach which directly measures linguistic compe- tence will be better suited to ESOL text assessment, and will have the additional advantage that it may not require retraining for new prompts or tasks. [sent-44, score-0.156]

18 As far as we know, this is the first application of a rank preference model to automated assessment (hereafter AA). [sent-45, score-0.585]

19 com/ 181 We report a consistent, comparable and replicable set of results based entirely on the new dataset and on public-domain tools and data, whilst also experimentally motivating some novel feature types for the AA task, thus extending the work described in (Briscoe et al. [sent-49, score-0.141]

20 In the following sections we describe in more detail the dataset used for training and testing, the system developed, the evaluation methodology, as well as ablation experiments aimed at studying the contribution of different feature types to the AA task. [sent-51, score-0.194]

21 We show experimentally that discriminative models with appropriate feature types can achieve performance close to the upper bound, as defined by the agreement between human examiners on the same test corpus. [sent-52, score-0.275]

22 For the purpose of this work, we extracted scripts produced by learners taking the First Certificate in English (FCE) exam, which assesses English at an upper-intermediate level. [sent-54, score-0.257]

23 Our data consists of 1141 scripts from the year 2000 for training written by 1141 distinct learners, and 97 scripts from the year 2001 for testing written by 97 distinct learners. [sent-69, score-0.3]

24 The prompts eliciting the free text are provided with the dataset. [sent-71, score-0.144]

25 3 Approach We treat automated assessment of ESOL text (see Section 2) as a rank preference learning problem (see Section 1). [sent-77, score-0.585]

26 1 Rank preference model SVMs have been extensively used for learning classification, regression and ranking functions. [sent-81, score-0.223]

27 In rank preference SVMs, the goal is to learn a ranking function which outputs a score for each data point, from which a global ordering of the data is constructed. [sent-86, score-0.214]

28 ri ( x~n, rn)} < rj, (1) where A rank preference model is not trained directly on this set of data objects and their labels; rather a set of pair-wise difference vectors is created. [sent-93, score-0.214]

29 The principal advantage of applying rank preference learning to the AA task is that we explicitly model the grade relationships between scripts and do not need to apply a further regression step to fit the classifier output to the scoring scheme. [sent-99, score-0.709]

30 Lexical ngrams (a) Word unigrams (b) Word bigrams ii. [sent-106, score-0.143]

31 Part-of-speech (PoS) ngrams (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams iii. [sent-107, score-0.195]

32 Therefore, we believe that many (longer-distance) grammatical constructions and errors found in texts can be (implicitly) captured by this feature type. [sent-127, score-0.288]

33 However, we also use an error-rate calculated from the CLC error tags to obtain an upper bound for the performance of an automated error estimator (true CLC error-rate). [sent-134, score-0.319]

34 Next, we extend our language model with trigrams extracted from a subset of the texts contained in the system predicted values. [sent-137, score-0.267]

35 As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model. [sent-139, score-0.328]

36 The script length is based on the number of words and is mainly added to balance the effect the length of a script has on other features. [sent-144, score-0.176]

37 4 Evaluation In order to evaluate our AA system, we use two correlation measures, Pearson’s product-moment cor- relation coefficient and Spearman’s rank correlation coefficient (hereafter Pearson’s and Spearman’s correlation respectively). [sent-146, score-0.587]

38 Pearson’s correlation determines the degree to which two linearly dependent variables are related. [sent-147, score-0.167]

39 As Pearson’s correlation is sensitive to the distribution of data and, due to outliers, its value can be misleading, we also report Spearman’s correlation. [sent-148, score-0.167]

40 As our data contains some tied values, we calculate Spearman’s correlation by using Pearson’s correlation on the ranks. [sent-151, score-0.334]

41 Table 1 presents the Pearson’s and Spearman’s correlation between the CLC scores and the AA system predicted values, when incrementally adding to the model the feature types described in Section 3. [sent-152, score-0.312]

42 Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0. [sent-155, score-0.219]

43 An evaluation of our best error detection method shows a Pearson correlation of 0. [sent-161, score-0.2]

44 In order to assess the independent as opposed to the order-dependent additive contribution of each feature type to the overall performance of the system, we run a number of ablation tests. [sent-165, score-0.15]

45 An ablation test consists of removing one feature of the system at a time and re-evaluating the model on the test set. [sent-166, score-0.158]

46 Table 2 presents Pearson’s and Spearman’s correlation between the CLC and our system, when removing one feature at a time. [sent-167, score-0.223]

47 One of the main approaches adopted by previous systems involves the identification of features that measure writing skill, and then the application of linear or stepwise regression to find optimal feature weights so that the correlation with manually assigned scores is maximised. [sent-177, score-0.416]

48 We trained a SVM regression model with our full set of feature types and compared it to the SVM rank preference model. [sent-178, score-0.365]

49 The rank preference model improves Pearson’s and Spearman’s correlation by 0. [sent-180, score-0.381]

50 067 respectively, and these differences are significant, suggesting that rank preference is a more appropriate model for the AA task. [sent-182, score-0.214]

51 Four senior and experienced ESOL examiners remarked the 97 FCE test scripts drawn from 2001 exams, using the marking scheme from that year (see Section 2). [sent-183, score-0.516]

52 In order to obtain a ceiling for the perfor- mance ofour system, we calculate the average correlation between the CLC and the examiners’ scores, and find an upper bound of 0. [sent-184, score-0.252]

53 In order to evaluate the overall performance of our system, we calculate its correlation with the four senior examiners in addition to the RASCH-adjusted CLC scores. [sent-187, score-0.437]

54 The average correlation ofthe AA system with the CLC and the examiner scores shows that it is close 185 Table 4: Pearson’s correlation ofthe AA system predicted values with the CLC and the examiners’ scores, where E1 refers to the first examiner, E2 to the second etc. [sent-189, score-0.658]

55 Table 5: Spearman’s correlation of the AA system predicted values with the CLC and the examiners’ scores, where E1 refers to the first examiner, E2 to the second etc. [sent-190, score-0.256]

56 Human–machine agreement is comparable to that of human–human agreement, with the exception of Pearson’s correlation with examiner E4 and Spearman’s correlation with examiners E1 and E4, where the discrepancies are higher. [sent-192, score-0.738]

57 However, our system is not measuring some properties of the scripts, such as discourse cohesion or relevance to the prompt eliciting the text, that examiners will take into account. [sent-194, score-0.445]

58 (2002) invited writing experts to trick the scoring capabilities of an earlier version of e-Rater (Burstein et al. [sent-198, score-0.146]

59 Our goal here is to determine the extent to which knowledge of the feature types deployed poses a threat to the validity of our system, where certain text generation strategies may give rise to large positive discrepancies. [sent-203, score-0.228]

60 As mentioned in Section 2, the marking criteria for FCE scripts are primarily based on the accurate use of a range of different grammatical constructions relevant to specific communicative goals, but our system assesses this indirectly. [sent-204, score-0.43]

61 We extracted 6 high-scoring FCE scripts from the CLC that do not overlap with our training and test data. [sent-205, score-0.15]

62 Randomly order: (a) (b) (c) (d) word unigrams within a sentence word bigrams within a sentence word trigrams within a sentence sentences within a script ii. [sent-207, score-0.233]

63 Swap words that have the same PoS within a sentence Although the above modifications do not exhaust the potential challenges a deployed AA system might face, they represent a threat to the validity of our system since we are using a highly related fea- ture set. [sent-208, score-0.254]

64 In total, we create 30 such ‘outlier’ texts, which were given to an ESOL examiner for marking. [sent-209, score-0.194]

65 Using the ‘outlier’ scripts as well as their original/unmodified versions, we ran our system on each modification separately and calculated the correlation between the predicted values and the examiner’s scores. [sent-210, score-0.446]

66 The predicted values of the system have a high correlation with the examiner’s scores when tested on ‘outlier’ texts of modification types i(a), i(b) and 186 examiner’s scores on ‘outlier’ texts. [sent-212, score-0.422]

67 However, as i(c) has a lower correlation com- pared to i(a) and i(b), it is likely that a random ordering of ngrams with N > 3 will further decrease performance. [sent-214, score-0.217]

68 A modification of type ii, where words with the same PoS within a sentence are swapped, results in a Pearson and Spearman correlation of 0. [sent-215, score-0.207]

69 This can be explained by the fact that texts produced using modification type ii contain a small portion of correct sentences. [sent-219, score-0.166]

70 However, the marking criteria are based on the overall writing quality. [sent-220, score-0.185]

71 As our system is not measuring discourse cohesion, discrepancies are much higher; the system’s predicted scores are high whilst the ones assigned by the examiner are very low. [sent-222, score-0.411]

72 However, for a writer to be able to generate text of this type already requires significant linguistic competence, whilst a number of generic methods for assessing text and/or discourse cohesion have been developed and could be deployed in an extended version of our system. [sent-223, score-0.247]

73 Recent comments in the British media have focussed on this issue, reporting that, for example, one deployed essay marking system assigned Winston Churchill’s speech ‘We Shall Fight on the Beaches’ a low score because of excessive repetition5. [sent-225, score-0.398]

74 Linear regression is used to assign optimal feature weights that maximise the cor- relation with the examiner’s scores. [sent-236, score-0.191]

75 The main issue with this system is that features such as word length and script length are easy to manipulate independently of genuine writing ability, potentially undermining the validity of the system. [sent-237, score-0.254]

76 Additional features, representing stereotypical grammatical errors for example, are extracted using manually-coded task-specific detectors based, in part, on typical marking criteria. [sent-240, score-0.161]

77 Feature weights and/or scores can be fitted to a marking scheme by stepwise or linear regression. [sent-242, score-0.166]

78 However, the system contains manually developed task-specific components and requires retraining or tuning for each new prompt and assessment task. [sent-244, score-0.4]

79 , 2003) uses Latent Semantic Analysis (LSA) (Lan- dauer and Foltz, 1998) to compute the semantic similarity between texts, at a specific grade point, and a test text. [sent-246, score-0.167]

80 The system is trained on topic and/or prompt specific texts while test texts are assigned a score based on the ones in the training set that are most similar. [sent-249, score-0.391]

81 Again, the system requires retraining or tuning for new prompts and assessment tasks. [sent-251, score-0.4]

82 This approach bears some similarities to our use of grammatical complexity and extragrammaticality features, but grammatical features represent only one component of our overall system, and of the task. [sent-255, score-0.177]

83 This system shows that treating AA as a text classification problem is viable, but the feature types are all fairly shallow, and the approach doesn’t make efficient use of the training data as a separate classifier is trained for each grade point. [sent-260, score-0.264]

84 Texts are clustered according to their grade and given an initial Z-score. [sent-263, score-0.167]

85 We have shown experimentally how rank preference models can be effectively deployed for automated assessment ofESOL free-text answers. [sent-271, score-0.66]

86 Based on a range of feature types automatically extracted using generic text processing techniques, our system achieves performance close to the upper bound for the task. [sent-272, score-0.182]

87 Ablation tests highlight the contribution of each feature type to the overall performance, while significance of the resulting improvements in correlation with human scores has been calculated. [sent-273, score-0.256]

88 A comparison between regression and rank preference models further supports our approach. [sent-274, score-0.309]

89 Preliminary experiments based on a set of ‘outlier’ texts have shown the types of texts for which the system’s scoring capability can be undermined. [sent-275, score-0.335]

90 We plan to experiment with better error detection techniques, since the overall error-rate of a script is one of the most discriminant features. [sent-276, score-0.154]

91 (2010) describe an approach to automatic offprompt detection which does not require retraining for each new question prompt and which we plan to integrate with our system. [sent-278, score-0.156]

92 It is clear from the ‘outlier’ experiments reported here that our system would benefit from features assessing discourse coherence, and to a lesser extent from features assessing semantic (selectional) coherence over longer bounds than those captured by ngrams. [sent-279, score-0.224]

93 188 Finally, we hope that the release of the training and test dataset described here will facilitate further research on the AA task for ESOL free text and, in particular, precise comparison of different systems, feature types, and grade fitting methods. [sent-281, score-0.259]

94 We are also grateful to Cambridge Assessment for arranging for the test scripts to be remarked by four of their senior examiners. [sent-283, score-0.25]

95 Burstein, editors, Automated essay scoring: A cross-disciplinary perspective, pages 71–86. [sent-330, score-0.193]

96 Burstein, editors, Automated essay scoring: A crossdisciplinary perspective, pages 87–1 12. [sent-384, score-0.193]

97 Evaluation of text coherence for electronic essay scoring systems. [sent-396, score-0.333]

98 Burstein, editors, Automated essay scoring: A cross-disciplinary perspective, pages 43– 54. [sent-412, score-0.193]

99 Stumping e-rater: challenging the validity of automated essay scoring. [sent-429, score-0.423]

100 An overview of current research on automated essay grading. [sent-463, score-0.361]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clc', 0.353), ('esol', 0.24), ('aa', 0.214), ('assessment', 0.203), ('examiner', 0.194), ('essay', 0.193), ('examiners', 0.177), ('spearman', 0.17), ('automated', 0.168), ('correlation', 0.167), ('grade', 0.167), ('pearson', 0.156), ('scripts', 0.15), ('burstein', 0.129), ('preference', 0.128), ('texts', 0.126), ('outlier', 0.112), ('rasp', 0.106), ('fce', 0.1), ('rudner', 0.1), ('briscoe', 0.098), ('prompt', 0.098), ('prompts', 0.098), ('regression', 0.095), ('marking', 0.089), ('attali', 0.088), ('script', 0.088), ('rank', 0.086), ('scoring', 0.083), ('ukwac', 0.08), ('deployed', 0.075), ('grammatical', 0.072), ('landauer', 0.071), ('cambridge', 0.068), ('learners', 0.063), ('writing', 0.063), ('validity', 0.062), ('miltsakaki', 0.061), ('ablation', 0.061), ('intellimetric', 0.06), ('lonsdale', 0.06), ('senior', 0.06), ('retraining', 0.058), ('svms', 0.057), ('coherence', 0.057), ('gr', 0.056), ('feature', 0.056), ('shermis', 0.053), ('trigrams', 0.052), ('ngrams', 0.05), ('unigrams', 0.049), ('powers', 0.049), ('whilst', 0.049), ('predicted', 0.048), ('ij', 0.047), ('eliciting', 0.046), ('grades', 0.046), ('kingdom', 0.046), ('discourse', 0.046), ('ps', 0.046), ('bigrams', 0.044), ('kukich', 0.044), ('assesses', 0.044), ('bound', 0.043), ('lsa', 0.043), ('upper', 0.042), ('examination', 0.042), ('fitted', 0.042), ('grading', 0.042), ('karen', 0.042), ('system', 0.041), ('assessing', 0.04), ('bloom', 0.04), ('deployment', 0.04), ('ferraresi', 0.04), ('iea', 0.04), ('maximise', 0.04), ('medlock', 0.04), ('proxies', 0.04), ('rasch', 0.04), ('remarked', 0.04), ('valenti', 0.04), ('waken', 0.04), ('modification', 0.04), ('cohesion', 0.037), ('joachims', 0.036), ('dataset', 0.036), ('hereafter', 0.035), ('assessor', 0.035), ('publically', 0.035), ('sleator', 0.035), ('stepwise', 0.035), ('threat', 0.035), ('pos', 0.034), ('constructions', 0.034), ('error', 0.033), ('overall', 0.033), ('committed', 0.033), ('competence', 0.033), ('discrepancies', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

2 0.21696123 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

3 0.17382966 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea

Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.

4 0.13467075 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

Author: Miao Chen ; Klaus Zechner

Abstract: This paper focuses on identifying, extracting and evaluating features related to syntactic complexity of spontaneous spoken responses as part of an effort to expand the current feature set of an automated speech scoring system in order to cover additional aspects considered important in the construct of communicative competence. Our goal is to find effective features, selected from a large set of features proposed previously and some new features designed in analogous ways from a syntactic complexity perspective that correlate well with human ratings of the same spoken responses, and to build automatic scoring models based on the most promising features by using machine learning methods. On human transcriptions with manually annotated clause and sentence boundaries, our best scoring model achieves an overall Pearson correlation with human rater scores of r=0.49 on an unseen test set, whereas correlations of models using sentence or clause boundaries from automated classifiers are around r=0.2. 1

5 0.12147462 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

Abstract: Common approaches to assessing document quality look at shallow aspects, such as grammar and vocabulary. For many real-world applications, deeper notions of quality are needed. This work represents a first step in a project aimed at developing computational methods for deep assessment of quality in the domain of intelligence reports. We present an automated system for ranking intelligence reports with regard to coverage of relevant material. The system employs methodologies from the field of automatic summarization, and achieves performance on a par with human judges, even in the absence of the underlying information sources.

6 0.11849819 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

7 0.10293627 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

8 0.097753681 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations

9 0.085815378 44 acl-2011-An exponential translation model for target language morphology

10 0.084686235 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

11 0.083864771 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

12 0.082703091 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

13 0.071057305 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

14 0.066204555 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

15 0.065524429 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

16 0.065063551 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

17 0.06297221 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

18 0.062580891 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

19 0.062279649 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

20 0.062201042 52 acl-2011-Automatic Labelling of Topic Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, 0.026), (2, -0.044), (3, 0.017), (4, -0.081), (5, -0.018), (6, 0.035), (7, -0.014), (8, 0.009), (9, -0.032), (10, -0.044), (11, -0.088), (12, -0.015), (13, 0.043), (14, -0.074), (15, 0.081), (16, -0.074), (17, -0.034), (18, -0.022), (19, -0.019), (20, 0.058), (21, 0.011), (22, -0.095), (23, 0.019), (24, -0.039), (25, 0.008), (26, -0.038), (27, -0.062), (28, 0.025), (29, 0.035), (30, -0.044), (31, 0.002), (32, -0.068), (33, 0.157), (34, -0.05), (35, 0.044), (36, -0.041), (37, 0.002), (38, -0.001), (39, 0.145), (40, -0.026), (41, 0.005), (42, 0.006), (43, -0.113), (44, 0.096), (45, -0.058), (46, 0.198), (47, 0.132), (48, 0.099), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91229445 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

2 0.73127127 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

Author: Miao Chen ; Klaus Zechner

3 0.7177943 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

Abstract: Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e.g. in the DUC/TAC evaluation competitions). Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. While studies assessing the quality of individual scales and comparing different types of rating scales are common in psychology and related fields, such studies hardly exist in NLP, and so at present little is known about whether discrete scales are a suitable rating tool for NLP evaluation tasks, or whether continuous scales might provide a better alternative. A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on .ac .uk discrete and continuous scales. Results tend to differ for different types of data. E.g., results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al., 2006). Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. When measuring dyspnea, Lansing et al. (2003) found a hybrid scale to perform on a par with a discrete scale. Another consideration is the types of data produced by discrete and continuous scales. Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). If these assumptions are violated, then the significance of results is overestimated. Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al., 2003). Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. We start with an overview of assessment scale types (Section 2). We describe the experiments we conducted (Sec- tion 4), the data we used in them (Section 3), and the properties we examined in our inter-scale comparisons (Section 5), before presenting our results Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. 1. Very Poor 2. Poor 3. Barely Acceptable 4. Good 5. Very Good Figure 1: Evaluation of Readability in DUC’06, comprising 5 evaluation criteria, including Grammaticality. Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. (Section 6), and some conclusions (Section 7). 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). An example of a VDS used in NLP is shown in Figure 1. VDSs are used very widely in contexts where computationally generated language is evaluated, including in dialogue, summarisation, MT and data-to-text generation. Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. There is only one examples of a VAS in NLP system evaluation that we are aware of (Gatt et al., 2009). Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. Figure 1). Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. we did not investigate this scale in our study. We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e.g. 7 = “perfectly fluent” for Fluency, and 7 = “perfectly clear” for Clarity). Response categories were presented in a vertical list, with the best category at the bottom. Each category had a tick-box placed next to it; the rater’s task was to tick the box by their chosen rating. VAS: a horizontal, bi-polar line, with no ticks on it, mapping to 0–100. In the image description tests, statements identified the left end as negative, the right end as positive; in the weather forecast tests, the positive end had a smiling face and the label “statement couldn’t be clearer/read better”; the negative end had a frowning face and the label “statement couldn’t be more unclear/read worse”. The raters’ task was to move a pointer (initially in the middle of the line) to the place corresponding to their rating. 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. This data has also been used for comparative system evaluation in previous research (Langner, 2010; Angeli et al., 2010; Belz and Kow, 2009). The following are examples of weather forecast texts from the data: 1: S SE 2 8 -3 2 INCREAS ING 3 6-4 0 BY MID AF TERNOON 2 : S ’ LY 2 6-3 2 BACKING S SE 3 0 -3 5 BY AFTERNOON INCREAS ING 3 5 -4 0 GUSTS 5 0 BY MID EVENING Image descriptions: In the other half of our evaluations, we used human-written and automatically generated image descriptions for the same images. The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. This data was originally created in the TUNA Project (van Deemter et al., 2006). The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4.1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. 4.2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. They were paid and spent about 1hour doing the experiment. In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. They were not paid and spent about 15 minutes doing the experiment. 232 4.3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. Weather VDS-7 (B): exact repeat of Weather VDS-7 (A), including same raters. Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. Image VAS (B): exact repeat of Image VAS (A), including same raters. 4.4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. Evaluations were carried out via a web interface. They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. We report W for all 6 experiments. Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. We report these measures between ratings given in Image VAS (A) vs. those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. those given in Weather VDS-7 (B). We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. We report these measures for all pairs of weather forecast/image description evaluations. We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. We also look at the relative effect of the different experimental factors by computing the F-Ratio for System (the main factor under investigation, so its relative effect should be high), Rater and Text Type (their effect should be low). F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. 6 Results 6.1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb.eqAdFlouthV AD S d-(e7Aq)uac.y945a78n*d(V.F9A2l5uS8e*(—An *c)y,.98o36r.748*e1l9a*(tV.i98(Ao.2578nS019s(*5B b) e- tween Image VDS-7 and Image VAS (A) (the main VAS experiment) are extremely high, meaning that they could substitute for each other here. Reliability: Inter-rater agreement in terms of Kendall’s W in each of the experiments: 233 K ’ s W FAldue qnucayc .6V549D80S* -7* VA.46S7 16(*A * )VA.7S529 (5*B *) W was higher in the VAS data in the case of Fluency, whereas for Adequacy, W was the same for the VDS data and VAS (B), and higher in the VDS data than in the VAS (A) data. 6.2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea.ld bVoDt hS -A7 (d BAeq)ua.c9y851a*nVdD(.8F9S7-lu09*(eBn—*)cy,.9 o43r2957*1e la(*t.8i(o736n025Vs9*6A bS)e- tween Weather VDS-7 (A) (the main VDS-7 experiment) and Weather VAS (A) are again very high, although rank-correlation is somewhat lower. Reliability: Inter-rater agreement Kendall’s W was as follows: in terms of W RClea rdi.tyVDS.5-4739 7(*A * )VDS.4- 7583 (*B * ).4 8 V50*A *S This time the highest agreement for both Clarity and Readability was in the VDS-7 data. 6.3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was .666 for Adequacy, .593 for Fluency. Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was .656 for Clarity, .704 for Readability. (All significant at p < .01.) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). 6.4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. Note that for System a high F-ratio is desirable, but a low F-ratio is desirable for other factors. tem, the main factor under investigation, VDS-7 found 8 for Adequacy and 14 for Fluency; VAS (A) found 7 for Adequacy and 15 for Fluency. 6.5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). tem, VDS-7 (A) found 24 for Clarity, 23 for Readability; VAS found 25 for Adequacy, 26 for Fluency. 6.6 Scale validity test for image data Our final table of results shows Pearson’s correlation coefficients (calculated on means per system) between the Adequacy data from the three image description evaluation experiments on the one hand, and the data from an extrinsic experiment in which we measured the accuracy with which participants identified the intended image described by a description: ThecorIlm at iog ne V bAeDSt w-(A7eB)An dA eqd uqe ac uy a cy.I89nD720d 6AI*Dc .Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between .3 and .5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -.35 and -.29. 7 Discussion and Conclusions Our interchangeability results (Sections 6. 1and 6.2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. Where we were able to measure validity (Section 6.6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. Stability (Section 6.3) was marginally better for VDS-7 data, and Reliability (Sections 6.1 and 6.2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. Finally, the VAS experiments found greater numbers of statistically significant differences between systems in 3 out of 4 cases (Section 6.5). Our own raters strongly prefer working with VAS scales over VDSs. This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). Moreover, if a rater’s judgment falls between two points on a VDS then they must make the false choice between the two points just above and just below their actual judgment. In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. References Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 16–24. H. Champney. 1941. The measurement of parent behavior. Child Development, 12(2): 13 1. M. Freyd. 1923. The graphic rating scale. Biometrical Journal, 42:83–102. A. Gatt, A. Belz, and E. Kow. 2009. The TUNA Challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG’09), pages 198–206. Brian Langner. 2010. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Robert W. Lansing, Shakeeb H. Moosavi, and Robert B. Banzett. 2003. Measurement of dyspnea: word labeled visual analog scale vs. verbal ordinal scale. Respiratory Physiology & Neurobiology, 134(2):77 –83. J. Scott and E. C. Huskisson. 2003. Vertical or horizontal visual analogue scales. Annals of the rheumatic diseases, (38):560. Sidney Siegel. 1957. Non-parametric statistics. The American Statistician, 11(3): 13–19. Elisabeth Svensson. 2000. Comparison of the quality of assessments using continuous and discrete ordinal rating scales. Biometrical Journal, 42(4):417–434. P. M. ten Klooster, A. P. Klaar, E. Taal, R. E. Gheith, J. J. Rasker, A. K. El-Garf, and M. A. van de Laar. 2006. The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. The Clinical Journal of Pain, 22(9):827–30. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation, pages 130–132, Sydney, Australia, July. S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525. 235

4 0.70896631 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge

Author: Kirill Kireyev ; Thomas K Landauer

Abstract: While computational estimation of difficulty of words in the lexicon is useful in many educational and assessment applications, the concept of scalar word difficulty and current corpus-based methods for its estimation are inadequate. We propose a new paradigm called word meaning maturity which tracks the degree of knowledge of each word at different stages of language learning. We present a computational algorithm for estimating word maturity, based on modeling language acquisition with Latent Semantic Analysis. We demonstrate that the resulting metric not only correlates well with external indicators, but captures deeper semantic effects in language. 1 Motivation It is no surprise that through stages of language learning, different words are learned at different times and are known to different extents. For example, a common word like “dog” is familiar to even a first-grader, whereas a more advanced word like “focal” does not usually enter learners’ vocabulary until much later. Although individual rates of learning words may vary between highand low-performing students, it has been observed that “children [… ] acquire word meanings in roughly the same sequence” (Biemiller, 2008). The aim of this work is to model the degree of knowledge of words at different learning stages. Such a metric would have extremely useful applications in personalized educational technologies, for the purposes of accurate assessment and personalized vocabulary instruction. … 299 .l andaue r } @pear s on .com 2 Rethinking Word Difficulty Previously, related work in education and psychometrics has been concerned with measuring word difficulty or classifying words into different difficulty categories. Examples of such approaches include creation of word lists for targeted vocabulary instruction at various grade levels that were compiled by educational experts, such as Nation (1993) or Biemiller (2008). Such word difficulty assignments are also implicitly present in some readability formulas that estimate difficulty of texts, such as Lexiles (Stenner, 1996), which include a lexical difficulty component based on the frequency of occurrence of words in a representative corpus, on the assumption that word difficulty is inversely correlated to corpus frequency. Additionally, research in psycholinguistics has attempted to outline and measure psycholinguistic dimensions of words such as age-of-acquisition and familiarity, which aim to track when certain words become known and how familiar they appear to an average person. Importantly, all such word difficulty measures can be thought of as functions that assign a single scalar value to each word w: !

5 0.67945248 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

Author: Wenting Xiong ; Diane Litman

Abstract: Identifying peer-review helpfulness is an important task for improving the quality of feedback that students receive from their peers. As a first step towards enhancing existing peerreview systems with new functionality based on helpfulness detection, we examine whether standard product review analysis techniques also apply to our new context of peer reviews. In addition, we investigate the utility of incorporating additional specialized features tailored to peer review. Our preliminary results show that the structural features, review unigrams and meta-data combined are useful in modeling the helpfulness of both peer reviews and product reviews, while peer-review specific auxiliary features can further improve helpfulness prediction.

6 0.65226167 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

7 0.60951507 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

8 0.59455341 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

9 0.54395443 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

10 0.52611822 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

11 0.52162272 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

12 0.50093287 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

13 0.49555004 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

14 0.49326479 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

15 0.48015112 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

16 0.47667286 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

17 0.47183239 74 acl-2011-Combining Indicators of Allophony

18 0.46858013 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

19 0.46505037 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

20 0.46055934 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.014), (2, 0.237), (5, 0.038), (17, 0.035), (26, 0.025), (31, 0.012), (37, 0.086), (39, 0.062), (41, 0.054), (55, 0.014), (59, 0.055), (72, 0.057), (91, 0.033), (96, 0.183), (98, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90226823 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

Author: Sharon Gower Small ; Jennifer Strommer-Galley ; Tomek Strzalkowski

Abstract: We describe an annotation tool developed to assist in the creation of multimodal actioncommunication corpora from on-line massively multi-player games, or MMGs. MMGs typically involve groups of players (5-30) who control their perform various activities (questing, competing, fighting, etc.) and communicate via chat or speech using assumed screen names. We collected a corpus of 48 group quests in Second Life that jointly involved 206 players who generated over 30,000 messages in quasisynchronous chat during approximately 140 hours of recorded action. Multiple levels of coordinated annotation of this corpus (dialogue, movements, touch, gaze, wear, etc) are required in order to support development of automated predictors of selected real-life social and demographic characteristics of the players. The annotation tool presented in this paper was developed to enable efficient and accurate annotation of all dimensions simultaneously. avatars1, 1

same-paper 2 0.80262631 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock

3 0.69646525 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

4 0.69334871 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

5 0.69298381 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

6 0.69144648 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

7 0.69130337 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

8 0.68997121 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

9 0.68994141 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

10 0.68946666 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

11 0.68888098 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

12 0.68878329 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

13 0.68819606 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

14 0.6879462 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

15 0.68779451 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

16 0.68702614 178 acl-2011-Interactive Topic Modeling

17 0.68698657 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.68630821 117 acl-2011-Entity Set Expansion using Topic information

19 0.68620306 187 acl-2011-Jointly Learning to Extract and Compress

20 0.68609762 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates