acl acl2011 acl2011-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages.
Reference: text
sentIndex sentText sentNum sentScore
1 s g , Abstract We present a novel approach to grammatical error correction based on Alternating Structure Optimization. [sent-4, score-0.364]
2 As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. [sent-5, score-0.38]
3 We conduct an extensive evaluation for article and preposition errors using various feature sets. [sent-6, score-0.531]
4 Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. [sent-7, score-0.414]
5 Our approach also outperforms two commercial grammar checking software packages. [sent-8, score-0.187]
6 1 Introduction Grammatical error correction (GEC) has been recognized as an interesting as well as commercially attractive problem in natural language process- ing (NLP), in particular for learners of English as a foreign or second language (EFL/ESL). [sent-9, score-0.31]
7 Despite the growing interest, research has been hindered by the lack of a large annotated corpus of learner text that is available for research purposes. [sent-10, score-0.375]
8 Learning GEC models directly from annotated learner corpora is not well explored, as are methods that combine learner and non-learner text. [sent-12, score-0.642]
9 Previous work has either evaluated on artificial test instances as a substitute for real learner errors or on proprietary data that is not available to 915 other researchers. [sent-14, score-0.453]
10 Our approach is able to train models on annotated learner corpora while still taking advantage of large non-learner corpora. [sent-18, score-0.34]
11 Second, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. [sent-19, score-0.38]
12 We conduct an extensive evaluation for article and preposition errors using six different feature sets proposed in previous work. [sent-20, score-0.531]
13 We compare our proposed ASO method with two baselines trained on non-learner text and learner text, respectively. [sent-21, score-0.393]
14 To the best of our knowledge, this is the first extensive comparison of different feature sets on real learner text which is another contribution of our work. [sent-22, score-0.399]
15 It also outperforms two commercial grammar checking software packages in a manual evaluation. [sent-24, score-0.279]
16 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 915–923, 2 Related Work In this section, we give a brief overview on related work on article and preposition errors. [sent-35, score-0.395]
17 The seminal work on grammatical error correction was done by Knight and Chander (1994) on article errors. [sent-38, score-0.524]
18 Work on preposition errors has used a similar classification approach and mainly differs in terms of the features employed (Chodorow et al. [sent-44, score-0.332]
19 Recent work has shown that training on annotated learner text can give better performance (Han et al. [sent-49, score-0.406]
20 , 2010) and that the observed word used by the writer is an important feature (Rozovskaya and Roth, 2010b). [sent-50, score-0.213]
21 Almost no work has investigated ways to combine learner and non-learner text for training. [sent-54, score-0.337]
22 The only exception is Gamon (2010), who combined features from the output of logistic-regression classifiers and language models trained on non-learner text in a meta-classifier trained on learner text. [sent-55, score-0.446]
23 In this work, we show a more direct way to combine learner and non-learner text in a single model. [sent-56, score-0.337]
24 3 Task Description In this work, we focus on article and preposition errors, as they are among the most frequent types of errors made by EFL learners. [sent-61, score-0.469]
25 Correction Task There is an important difference between training on annotated learner text and training on non-learner text, namely whether the observed word can be used as a feature or not. [sent-64, score-0.539]
26 The word choice of the writer is “blanked out” from the text and serves as the correct class. [sent-66, score-0.216]
27 This selection task formulation is convenient as training examples can be created “for free” from any text that is assumed to be free of grammatical errors. [sent-69, score-0.196]
28 We define the more realistic correction task as follows: given a particular word and its context, propose an appropriate correction. [sent-70, score-0.19]
29 The proposed correction can be identical to the observed word, i. [sent-71, score-0.251]
30 The main difference is that the word choice of the writer can be encoded as part of the features. [sent-74, score-0.152]
31 2 Article Errors For article errors, the classes are the three articles a, the, and the zero-article. [sent-76, score-0.245]
32 When training on learner text, the correct class is the article provided by the human annotator. [sent-79, score-0.549]
33 The correct class is the article provided by the human annotator when testing on learner text or the observed article when testing on non-learner text. [sent-83, score-0.802]
34 3 Preposition Errors The approach to preposition errors is similar to articles but typically focuses on preposition substitution errors. [sent-85, score-0.629]
35 Every prepositional phrase (PP) that is governed by one of the 36 prepositions is one training or test example. [sent-87, score-0.211]
36 1 Linear Classifiers We use classifiers to approximate the unknown relation between articles or prepositions and their contexts in learner text, and their valid corrections. [sent-91, score-0.568]
37 The articles or prepositions and their contexts are represented as feature vectors X ∈ X. [sent-92, score-0.275]
38 Preposition Errors DeFelice The system in (De Felice, 2008) for preposition errors uses a s (iDmeila Fre lriicche, s 2et0 0of8 syn- tactic and semantic features as the system for article errors. [sent-122, score-0.469]
39 For each of the above feature sets, we add the observed article or preposition as an additional feature when training on learner text. [sent-127, score-0.871]
40 5 Alternating Structure Optimization This section describes the ASO algorithm and shows how it can be used for grammatical error correction. [sent-128, score-0.174]
41 Instead, we can automatically create auxiliary problems for the sole purpose of learning a better Θ. [sent-144, score-0.172]
42 Let us assume that we have k target problems and m auxiliary problems. [sent-145, score-0.172]
43 2 ASO for Grammatical Error Correction The key observation in our work is that the selection task on non-learner text is a highly informative aux- iliary problem for the correction task on learner text. [sent-165, score-0.589]
44 For example, a classifier that can predict the presence or absence of the preposition on can be helpful for correcting wrong uses of on in learner text, 918 e. [sent-166, score-0.586]
45 , if the classifier’s confidence for on is low but the writer used the preposition on, the writer might have made a mistake. [sent-168, score-0.457]
46 As the auxiliary problems can be created automatically, we can leverage the power of very large corpora of non-learner text. [sent-169, score-0.172]
47 Let us assume a grammatical error correction task with m classes. [sent-170, score-0.364]
48 The feature space of the auxiliary problems is a restriction of the original feature space X to all features except the observed fweaotrudr: X\{Xobs}. [sent-172, score-0.315]
49 aTllhe f weight vxeccetpotrs t oef othbsee auxiliary problems f}o. [sent-173, score-0.199]
50 , k from the annotated learner text using the complete feature space X. [sent-178, score-0.416]
51 nce of transfer learning (Pan and Yang, 2010), as the auxiliary problems are trained on data from a different domain (nonlearner text) and have a slightly different feature space (X\{Xobs}). [sent-180, score-0.241]
52 It contains over one million words which are completely annotated with error tags and corrections. [sent-185, score-0.162]
53 This figure is considerably lower compared to other learner corpora (Leacock et al. [sent-193, score-0.302]
54 3 Selection Task Experiments on WSJ Test Data The first set of experiments investigates predicting articles and prepositions in non-learner text. [sent-205, score-0.234]
55 This primarily serves as a reference point for the correction task described in the next section. [sent-206, score-0.19]
56 We train with up to 10 million training instances, which corresponds to about 37 million words of text for articles and 112 million words of text for prepositions. [sent-208, score-0.306]
57 The observed article or preposition choice of the writer is the class 1LDC2009T13 2www. [sent-210, score-0.635]
58 Therefore, the article or preposition cannot be part of the input features. [sent-215, score-0.395]
59 Our proposed ASO method is not included in these experiments, as it uses the observed article or preposition as a feature which is only applicable when testing on learner text. [sent-216, score-0.799]
60 4 Correction Task Experiments on NUCLE Test Data The second set of experiments investigates the primary goal of this work: to automatically correct grammatical errors in learner text. [sent-218, score-0.516]
61 In contrast to the previous selection task, the observed word choice of the writer can be different from the correct class and the observed word is available during testing. [sent-220, score-0.37]
62 The system only flags an error if the difference between the classifier’s confidence for its first choice and the confidence for the observed word is higher than a threshold t. [sent-224, score-0.231]
63 The classifier is trained in the same way as the Gigaword model, except that the observed word choice of the writer is included as a feature. [sent-230, score-0.29]
64 The correct class during training is the correction provided by the human annotator. [sent-231, score-0.277]
65 During training, the instances that do not contain an error greatly outnumber the instances that do contain an error. [sent-234, score-0.186]
66 To reduce this imbalance, we keep all instances that contain an error and retain a random sample of q percent of the instances that do not contain an error. [sent-235, score-0.186]
67 We create binary auxiliary problems for articles or prepositions, i. [sent-239, score-0.281]
68 , there are 3 auxiliary problems for articles and 36 auxiliary problems for prepositions. [sent-241, score-0.429]
69 We train the classifiers for the auxiliary problems on the complete 10 million instances from Gigaword in the same ways as in the selection task experiment. [sent-242, score-0.356]
70 The weight vectors of the auxiliary problems form the matrix U. [sent-243, score-0.22]
71 The target problems are again binary classification problems for each article or preposition, but this time trained on NUCLE. [sent-246, score-0.357]
72 The observed word choice of the writer is included as a feature for the target problems. [sent-247, score-0.254]
73 We again undersample the instances that do not contain an error and tune the parameter q on the NUCLE development data. [sent-248, score-0.175]
74 The learning curves of the correction task experiments on NUCLE test data are shown in Figure 2 and 3. [sent-263, score-0.252]
75 Each sub-plot shows the curves of three models as described in the last section: ASO trained on NUCLE and Gigaword, the baseline classifier trained on NUCLE, and the baseline classifier trained on Gigaword. [sent-264, score-0.262]
76 The first observation is that high accuracy for the selection task on non-learner text does not automatically entail high F1-measure on learner text. [sent-266, score-0.399]
77 We also note that feature sets with similar performance on nonlearner text can show very different performance on 920 Number of training examples (a) Articles Number of training examples (b) Prepositions Figure 1: Accuracy for the selection task on WSJ test data. [sent-267, score-0.244]
78 The second observation is that training on annotated learner text can significantly improve performance. [sent-269, score-0.428]
79 In three experiments (articles DeFelice, Han, prepositions DeFelice), the NUCLE model outperforms the Gigaword model trained on 10 million instances. [sent-270, score-0.217]
80 10 4628 10 01 GIA1WNe+UO0CA6RLSDEO1e+07 Number of traninig exampels (a) DeFelice 1F0 0 . [sent-278, score-0.238]
81 0 09876543210 01 0GIA1WNeU+OCA0RLS6DEO1e+07 Number of traninig exampels Number of traninig exampels (b) Han (c) Lee Figure 2: F1-measure for the article correction task on NUCLE test data. [sent-280, score-0.852]
82 0 10 62549081370 1 0 0GI A1WNeU+OCA0RLS6DEO1e+07 Number of traninig exampels (a) DeFelice 1F0 0. [sent-283, score-0.238]
83 10 286401 01 GIA1WNeU+OCA0RLS6DEO1e+07 Number of traninig exampels (b) TetreaultChunk Figure 3: F1-measure for the preposition correction task Number of traninig exampels (c) TetreaultParse on NUCLE test data. [sent-285, score-0.927]
84 921 5 Table 1: Best results for the correction task on NUCLE test data. [sent-303, score-0.216]
85 For example, the preposition in should be on in the sentence “. [sent-310, score-0.235]
86 The ASO model is able to take advantage of both the annotated learner text and the large non-learner text, thus achieving overall high F1-measure. [sent-315, score-0.375]
87 1 Manual Evaluation We carried out a manual evaluation of the best ASO models and compared their output with two com- mercial grammar checking software packages which we call System A and System B. [sent-319, score-0.196]
88 We randomly sampled 1000 test instances for articles and 2000 test instances for prepositions and manually categorized each test instance into one of the following categories: (1) Correct means that both human and system flag an error and suggest the same correction. [sent-320, score-0.503]
89 If the system’s correction differs from the human but is equally acceptable, it is considered (2) Both Ok. [sent-321, score-0.19]
90 If the system identifies an error but fails to correct it, we consider it (3) Both Wrong, as both the writer and the system are wrong. [sent-322, score-0.224]
91 (4) Other Error means that the system’s correction does not result in a grammatical sentence because of another grammatical error that is outside the scope of article or preposition errors, e. [sent-323, score-0.849]
92 tion shows that even commercial software packages achieve low F1-measure for article and preposition errors, which confirms the difficulty of these tasks. [sent-339, score-0.562]
93 9 Conclusion We have presented a novel approach to grammatical error correction based on Alternating Structure Optimization. [sent-340, score-0.364]
94 We have introduced the NUS Corpus of Learner English (NUCLE), a fully annotated corpus of learner text. [sent-341, score-0.34]
95 Our experiments for article and preposition errors show the advantage of our ASO approach over two baseline methods. [sent-342, score-0.491]
96 Our ASO approach also outperforms two commercial grammar checking software packages in a manual evaluation. [sent-343, score-0.279]
97 Detecting errors in English article usage by non-native speakers. [sent-410, score-0.234]
98 Using an error-annotated learner corpus to develop an ESL/EFL error correction system. [sent-421, score-0.576]
99 The ups and downs of preposition error detection in ESL writing. [sent-529, score-0.342]
100 Using parse features for preposition selection and error detection. [sent-536, score-0.359]
wordName wordTfidf (topN-words)
[('nucle', 0.456), ('aso', 0.451), ('learner', 0.302), ('preposition', 0.235), ('correction', 0.19), ('gec', 0.179), ('article', 0.16), ('defelice', 0.159), ('han', 0.133), ('prepositions', 0.128), ('exampels', 0.119), ('traninig', 0.119), ('tetreault', 0.111), ('writer', 0.111), ('auxiliary', 0.111), ('tetreaultparse', 0.099), ('grammatical', 0.09), ('articles', 0.085), ('chodorow', 0.084), ('error', 0.084), ('alternating', 0.074), ('errors', 0.074), ('rozovskaya', 0.072), ('packages', 0.07), ('gigaword', 0.064), ('commercial', 0.062), ('problems', 0.061), ('observed', 0.061), ('tetreaultchunk', 0.06), ('nus', 0.057), ('ando', 0.057), ('felice', 0.055), ('gamon', 0.053), ('classifiers', 0.053), ('instances', 0.051), ('thresholding', 0.051), ('wj', 0.049), ('classifier', 0.049), ('corrections', 0.048), ('lee', 0.046), ('flags', 0.045), ('checking', 0.044), ('constituency', 0.041), ('feature', 0.041), ('choice', 0.041), ('million', 0.04), ('selection', 0.04), ('nonlearner', 0.04), ('undersample', 0.04), ('xobs', 0.04), ('wsj', 0.039), ('annotated', 0.038), ('leacock', 0.036), ('curves', 0.036), ('learners', 0.036), ('software', 0.035), ('ccg', 0.035), ('cbo', 0.035), ('text', 0.035), ('training', 0.031), ('minnen', 0.03), ('ui', 0.03), ('plot', 0.029), ('izumi', 0.029), ('correct', 0.029), ('trained', 0.028), ('baselines', 0.028), ('annotator', 0.028), ('loss', 0.028), ('weight', 0.027), ('class', 0.027), ('hypernyms', 0.026), ('nagata', 0.026), ('flag', 0.026), ('test', 0.026), ('governed', 0.026), ('grammar', 0.025), ('vj', 0.025), ('svd', 0.024), ('wl', 0.024), ('dividing', 0.024), ('esl', 0.024), ('binary', 0.024), ('roth', 0.024), ('classification', 0.023), ('kudo', 0.023), ('detection', 0.023), ('baseline', 0.022), ('manual', 0.022), ('essays', 0.022), ('bergsma', 0.022), ('pan', 0.022), ('klein', 0.022), ('optimization', 0.022), ('observation', 0.022), ('extensive', 0.021), ('yi', 0.021), ('vectors', 0.021), ('outperforms', 0.021), ('investigates', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages.
2 0.30894551 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman
Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.
3 0.2922157 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
Author: Alla Rozovskaya ; Dan Roth
Abstract: We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction - algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language ofthe writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods.
4 0.2106474 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
Author: Y. Albert Park ; Roger Levy
Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.
5 0.20134687 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya
Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.
6 0.18967223 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
7 0.15065932 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
8 0.074407712 333 acl-2011-Web-Scale Features for Full-Scale Parsing
9 0.071447439 11 acl-2011-A Fast and Accurate Method for Approximate String Search
10 0.069860153 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
11 0.067761891 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
12 0.062407944 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
13 0.057144467 109 acl-2011-Effective Measures of Domain Similarity for Parsing
14 0.057013288 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
15 0.055672493 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
16 0.054476958 282 acl-2011-Shift-Reduce CCG Parsing
17 0.054015577 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
18 0.053915184 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
19 0.053685278 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search
20 0.053391423 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars
topicId topicWeight
[(0, 0.165), (1, 0.005), (2, -0.047), (3, -0.047), (4, -0.082), (5, -0.02), (6, 0.113), (7, -0.048), (8, -0.023), (9, -0.008), (10, -0.187), (11, -0.188), (12, 0.01), (13, 0.171), (14, -0.035), (15, 0.383), (16, 0.038), (17, -0.064), (18, -0.093), (19, 0.082), (20, -0.111), (21, -0.057), (22, 0.015), (23, -0.001), (24, -0.002), (25, 0.024), (26, -0.043), (27, 0.035), (28, 0.063), (29, 0.07), (30, 0.038), (31, -0.036), (32, -0.015), (33, -0.058), (34, 0.023), (35, 0.006), (36, -0.029), (37, -0.003), (38, -0.044), (39, 0.009), (40, -0.103), (41, 0.04), (42, 0.006), (43, 0.004), (44, 0.016), (45, 0.032), (46, 0.011), (47, 0.05), (48, -0.015), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.93043429 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages.
2 0.87437057 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
Author: Alla Rozovskaya ; Dan Roth
Abstract: We consider the problem of correcting errors made by English as a Second Language (ESL) writers and address two issues that are essential to making progress in ESL error correction - algorithm selection and model adaptation to the first language of the ESL learner. A variety of learning algorithms have been applied to correct ESL mistakes, but often comparisons were made between incomparable data sets. We conduct an extensive, fair comparison of four popular learning methods for the task, reversing conclusions from earlier evaluations. Our results hold for different training sets, genres, and feature sets. A second key issue in ESL error correction is the adaptation of a model to the first language ofthe writer. Errors made by non-native speakers exhibit certain regularities and, as we show, models perform much better when they use knowledge about error patterns of the nonnative writers. We propose a novel way to adapt a learned algorithm to the first language of the writer that is both cheaper to implement and performs better than other adaptation methods.
3 0.85971743 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
Author: Ryo Nagata ; Edward Whittaker ; Vera Sheinman
Abstract: The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowparsed. This corpus is available for research and educational purposes on the web. In this paper, we describe it in detail together with its data-collection method and annotation schemes. Another contribution of this paper is that we take the first step toward evaluating the performance of existing POStagging/chunking techniques on learner corpora using the created corpus. These contributions will facilitate further research in related areas such as grammatical error detection and automated essay scoring.
4 0.82463115 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya
Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.
5 0.80145264 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
Author: Chung-chi Huang ; Mei-hua Chen ; Shih-ting Huang ; Jason S. Chang
Abstract: We introduce a new method for learning to detect grammatical errors in learner’s writing and provide suggestions. The method involves parsing a reference corpus and inferring grammar patterns in the form of a sequence of content words, function words, and parts-of-speech (e.g., “play ~ role in Ving” and “look forward to Ving”). At runtime, the given passage submitted by the learner is matched using an extended Levenshtein algorithm against the set of pattern rules in order to detect errors and provide suggestions. We present a prototype implementation of the proposed method, EdIt, that can handle a broad range of errors. Promising results are illustrated with three common types of errors in nonnative writing. 1
6 0.74203694 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
7 0.47162443 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation
8 0.46675941 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
9 0.42250046 11 acl-2011-A Fast and Accurate Method for Approximate String Search
10 0.41653776 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
11 0.39756161 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search
12 0.37868622 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method
13 0.37289664 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
14 0.3662594 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
15 0.3647913 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
16 0.35884142 297 acl-2011-That's What She Said: Double Entendre Identification
17 0.34709266 239 acl-2011-P11-5002 k2opt.pdf
18 0.34162223 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
19 0.34093589 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
20 0.33536214 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
topicId topicWeight
[(5, 0.021), (17, 0.043), (26, 0.029), (37, 0.109), (39, 0.032), (41, 0.053), (53, 0.012), (55, 0.024), (59, 0.044), (61, 0.209), (72, 0.105), (91, 0.069), (96, 0.134), (97, 0.026)]
simIndex simValue paperId paperTitle
1 0.85999954 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
Author: Youngjun Kim ; Ellen Riloff ; Stephane Meystre
Abstract: We present an NLP system that classifies the assertion type of medical problems in clinical notes used for the Fourth i2b2/VA Challenge. Our classifier uses a variety of linguistic features, including lexical, syntactic, lexicosyntactic, and contextual features. To overcome an extremely unbalanced distribution of assertion types in the data set, we focused our efforts on adding features specifically to improve the performance of minority classes. As a result, our system reached 94. 17% micro-averaged and 79.76% macro-averaged F1-measures, and showed substantial recall gains on the minority classes. 1
2 0.82424659 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
Author: Bo Pang ; Ravi Kumar
Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —
same-paper 3 0.79208416 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages.
4 0.78846949 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
Author: Stefan Rud ; Massimiliano Ciaramita ; Jens Muller ; Hinrich Schutze
Abstract: We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries.
5 0.76739603 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
6 0.71748936 261 acl-2011-Recognizing Named Entities in Tweets
7 0.70710713 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks
8 0.69992763 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
9 0.69843692 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
10 0.69682986 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
11 0.69411099 252 acl-2011-Prototyping virtual instructors from human-human corpora
12 0.6879046 292 acl-2011-Target-dependent Twitter Sentiment Classification
13 0.68769217 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
14 0.68754852 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
15 0.68237621 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
16 0.68206942 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
17 0.68140256 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
18 0.68021107 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
19 0.67916983 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD
20 0.67857313 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation