acl acl2011 acl2011-119 knowledge-graph by maker-knowledge-mining

119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning


Source: pdf

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 rehbe in@ co l uni -sb de i Abstract Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. [sent-3, score-0.113]

2 This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. [sent-5, score-1.262]

3 , 2006; Zhu and Hovy, 2007; Chan and Ng, 2007), text classification (Tong and Koller, 1998) or statistical machine translation (Haffari and Sarkar, 2009), and has been shown to reduce the amount of annotated data needed to achieve a certain classifier performance, sometimes by as much as half. [sent-12, score-0.123]

4 Most of these studies, however, have only simulated the active learning process using goldstandard data. [sent-13, score-0.317]

5 This setting is crucially different from a real world sce- nario where we have to deal with erroneous data and inconsistent annotation decisions made by the Proce dPinogrstla ofn tdh,e O 4r9etghon A,n Jnu nael 1 M9-e 2t4i,n2g 0 o1f1 t. [sent-14, score-0.164]

6 c A2s0s1o1ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 43–51, In this paper we present a thorough evaluation of the impact of annotation noise on AL. [sent-16, score-0.486]

7 We simulate different types of coder errors and assess the effect on the learning process. [sent-17, score-0.295]

8 We propose a method to detect inconsistencies and remove them from the training data, and show that our method does alleviate the problem of annotation noise in our experiments. [sent-18, score-0.565]

9 Section 2 reports on recent research on the impact of annotation noise in the context of supervised classification. [sent-20, score-0.519]

10 In Section 4 we present our filtering approach and show its im- pact on AL performance. [sent-22, score-0.103]

11 2 Section 5 concludes and Related Work We are interested in the question whether or not AL can be successfully applied to a supervised classification task where we have to deal with a considerable amount of inconsistencies and noise in the data, which is the case for many NLP tasks (e. [sent-24, score-0.447]

12 Therefore we do not consider part-of-speech tagging or syntactic parsing, where coders are expected to agree on most annotation decisions. [sent-27, score-0.213]

13 Instead, we focus on work on AL for WSD, where intercoder agreement (at least for fine-grained annotation schemes) usually is much lower than for the former tasks. [sent-28, score-0.117]

14 1 Annotation Noise Studies on active learning for WSD have been limited to running simulations of AL using gold standard data and a coarse-grained annotation scheme (Chen et al. [sent-30, score-0.46]

15 A possible reason for this failure is the amount of annotation noise in the training data which might mislead the classifier during the AL process. [sent-34, score-0.572]

16 , 2008; Beigman Klebanov and Beigman, 2009) has been studying annotation noise in a multi-annotator setting, distinguishing between hard cases (unreliably annotated due to genuine ambiguity) and easy cases (reliably annotated data). [sent-38, score-0.567]

17 Fol- lowing this assumption, the authors propose a measure to estimate the amount of annotation noise in the data after removing all hard cases. [sent-40, score-0.539]

18 (2008; 2009) show that, according to their model, high inter-annotator agreement (κ) achieved in an annotation scenario with two annotators is no guarantee for a high-quality data set. [sent-42, score-0.162]

19 Their model, however, assumes that a) all instances where annotators disagreed are in fact hard cases, and b) that for the hard cases the annotators decisions are obtained by coin-flips. [sent-43, score-0.412]

20 Further problems arise in the AL scenario where the instances to be annotated are selected as a function of the sampling method and the annotation judgements made before. [sent-45, score-0.511]

21 Therefore, Beigman and Klebanov Beigman (2009)’s approach of identifying unreliably annotated instances by disagreement is not applicable to AL, as most instances are annotated only once. [sent-46, score-0.509]

22 2 Annotation Noise and Active Learning For AL to be succesful, we need to remove systematic noise in the training data. [sent-48, score-0.409]

23 The challenge we face is that we only have a small set of seed data and no random noise can be tolerated in supervised learn-44 information about the reliability of the annotations assigned by the human coders. [sent-49, score-0.506]

24 (2008) present a method for detecting outliers in the pool of unannotated data to prevent these instances from becoming part of the training data. [sent-51, score-0.463]

25 This approach is different from ours, where we focus on detecting annotation noise in the manually labelled training data produced by the human coders. [sent-52, score-0.555]

26 Schein and Ungar (2007) provide a systematic investigation of 8 different sampling methods for AL and their ability to handle different types of noise in the data. [sent-53, score-0.547]

27 The types of noise investigated are a) prediction residual error (the portion of squared error that is independent of training set size), and b) different levels of confusion among the categories. [sent-54, score-0.561]

28 Type a) models the presence of unknown features that influence the true probabilities of an outcome: a form of noise that will increase residual error. [sent-55, score-0.374]

29 Therefore, type b) errors are of greater interest to us, as it is safe to assume that intrinsically ambiguous categories will lead to biased coder decisions and result in the systematic annotation noise we are interested in. [sent-57, score-0.938]

30 Schein and Ungar observe that none of the 8 sampling methods investigated in their experiment achieved a significant improvement over the random sampling baseline on type b) errors. [sent-58, score-0.402]

31 In fact, entropy sampling and margin sampling even showed a decrease in performance compared to random sampling. [sent-59, score-0.532]

32 For AL to work well on noisy data, we need to identify and remove this type of annotation noise during the AL process. [sent-60, score-0.45]

33 To the best of our knowledge, there is no work on detecting and removing annotation noise by human coders during AL. [sent-61, score-0.584]

34 3 Experimental Setup To make sure that the data we use in our simulation is as close to real-world data as possible, we do not create an artificial data set as done in (Schein and Ungar, 2007; Reidsma and Carletta, 2008) but use real data from a WSD task for the German verb drohen (threaten). [sent-62, score-0.091]

35 To control the amount of noise in the data, we need to be sure that the initial data set is noise-free. [sent-65, score-0.377]

36 3 Our sampling method is uncertainty sampling (Lewis and Gale, 1994), a standard sampling heuristic for AL where new instances are selected based on the confidence of the classifier for predicting the appropriate label. [sent-67, score-0.815]

37 As a measure of uncertainty we use Shannon entropy (1) (Zhang and Chen, 2002) and the margin metric (2) (Schein and Ungar, 2007). [sent-68, score-0.198]

38 The first measure considers the model’s predictions q for each class c and selects those instances from the pool where the Shannon entropy is highest. [sent-69, score-0.43]

39 Mn = |P(c|xn) − P(c′|xn) | (2) The features we use for WSD are a combination of context features (word token with window size 11 and POS context with window size 7), syntactic features based on the output of a dependency parser4 and semantic features based on GermaNet hyperonyms. [sent-72, score-0.162]

40 1 Simulating Coder Errors in AL Before starting the AL trials we automatically sepa- rate the 2,500 sentences into test set (498 sentences) and pool (2,002 sentences),5 retaining the overall distribution of word senses in the data set. [sent-77, score-0.274]

41 We insert a varying amount of noise into the pool data, 2In a pilot study where two human coders assigned labels to a set of 100 sentences, the coders agreed on 99% of the data. [sent-78, score-0.765]

42 We assess the impact of annotation noise on active learning in three different settings. [sent-87, score-0.762]

43 In the first setting, we randomly select new instances from the pool (random sampling; rand). [sent-88, score-0.347]

44 In the second setting, we randomly replace n percent of all labels (from 0 to 30) in the pool by another label before starting the active learning trial, but retain the distribution of the different labels in the pool data (active learning with random errors); (Table 1, ALrand, 30%). [sent-89, score-0.725]

45 In the third setting we simulate biased decisions by a human annotator. [sent-90, score-0.148]

46 For a certain fraction (0 to 30%) of instances of a particular non-majority class, we substitute the majority class label for the gold label, thereby producing a more skewed distribution than in the original pool (active learning with biased errors); (Table 1, ALbias, 30%). [sent-91, score-0.556]

47 2 Results Figure 1 shows active learning curves for the different settings and varying degrees of noise. [sent-95, score-0.467]

48 For all degrees of randomly inserted noise, active learning (ALrand) outperforms random sampling (rand) at an biased errors (ALbias), we see a different picture. [sent-98, score-0.785]

49 When inserting more noise, performance for ALbias decreases, and with around 20% of biased errors in the pool AL performs worse than our random sampling baseline. [sent-100, score-0.601]

50 In the random noise setting (ALrand), even after inserting 30% of errors AL clearly outperforms random sampling. [sent-101, score-0.554]

51 Increasing the size of the seed data reduces the effect slightly, but does not prevent it (not shown here due to space limitations). [sent-102, score-0.208]

52 This confirms the findings that under certain circumstances AL performs worse than random sampling (Dang, 2004; Schein and Ungar, 2007; Rehbein et al. [sent-103, score-0.229]

53 We could also confirm Schein and Ungar (2007)’s obser- vation that margin sampling is less sensitive to certain types of noise than entropy sampling (Table 2). [sent-105, score-0.809]

54 Because of space limitations we only show curves for margin sampling. [sent-106, score-0.158]

55 For entropy sampling, the general trend is the same, with results being slightly lower than for margin sampling. [sent-107, score-0.13]

56 4 Detecting Annotation Noise Uncertainty sampling using the margin metric selects instances for which the difference between classifier predictions for the two most probable classes c, c′ is very small (Section 3, Equation 2). [sent-108, score-0.491]

57 When selecting unlabelled instances from the pool, this metric picks examples which represent regions of uncertainty between classes which have yet to be learned by the classifier and thus will advance the learning process. [sent-109, score-0.329]

58 The filter approach has two objectives: a) to detect incorrect labels assigned by human coders, and b) to prevent the hard cases (following the terminology of Klebanov et al. [sent-111, score-0.168]

59 Our approach makes use of the limited set of seed data S and uses heuristics to detect unreliably annotated instances. [sent-114, score-0.266]

60 We assume that the instances in S have been validated thoroughly. [sent-115, score-0.185]

61 We train an ensemble of classifiers E on subsets of S, and use E to decide whether or not a newly annotated instance should be added to the early stage in the learning process. [sent-116, score-0.222]

62 Therefore, using classifier predictions at this stage to accept or reject new instances could result in poor choices that might harm the learning proceess. [sent-121, score-0.34]

63 To avoid this and to generalise over S to prevent overfitting, we do not directly train our ensemble on instances from S. [sent-122, score-0.334]

64 In the next step we train n = 5 maximum entropy classifiers on subsets of Fgen, excluding the instances last annotated by the oracle. [sent-127, score-0.308]

65 We use the ensemble to predict the labels for the new instances and, based on the predictions, accept or reject these, following the two heuristics below (also see Figure 2). [sent-129, score-0.368]

66 If all n ensemble classifiers agree on one label but disagree with the oracle ⇒ reject. [sent-131, score-0.153]

67 If the sum of the margins predicted by the ensemble classifiers is below a particular theshold The threshold tmargin was set to 0. [sent-133, score-0.229]

68 Figure 2: Heuristics for filtering unreliable data points (parameters used: initial seed size: 9 sentences, c = 10, tmargin ⇒ reject. [sent-135, score-0.263]

69 01) In each iteration of the AL process, one new instance is selected using margin sampling. [sent-137, score-0.09]

70 After 10 new instances have been added, we apply the filter technique which finally decides whether the newly added instances will remain in the seed data or will be removed. [sent-140, score-0.491]

71 Figure 3 shows learning curves for the filter approach. [sent-141, score-0.138]

72 With increasing amount of errors in the pool, a clear pattern emerges. [sent-142, score-0.122]

73 For both sampling methods (ALrand, ALbias), the filtering step clearly improves results. [sent-143, score-0.276]

74 Even for the noisier data sets with up to 26% of errors, ALbias with filtering performs at least as well as random sampling. [sent-144, score-0.194]

75 We want to know whether the approach is able to detect the errors previously inserted into the data, and whether it manages to identify hard cases representing true ambiguities. [sent-147, score-0.21]

76 In 1,200 AL iterations the system rejected 116 instances (Table 3). [sent-149, score-0.348]

77 The major part of the rejections was due to the majority vote of the ensemble classifiers (first heuristic, H1) which rejects all instances where the ensemble classifiers agree with each other but disagree with the human judgement. [sent-150, score-0.523]

78 Out of the 105 instances rejected by H1, 41 were labelled incorrectly. [sent-151, score-0.38]

79 11 instances were filtered out by the margin threshold (H2). [sent-153, score-0.275]

80 instances selected by AL 93 instances rejected by H1+H2 116 instances rejected by H1 105 true errors rejected by H1 instances rejected by H2 true errors rejected by H2 41 11 0 Table 3: Error analysis of the instances rejected by the filtering approach rect label. [sent-155, score-2.162]

81 On first glance H2 seems to be more lenient than H1, considering the number of rejected sentences. [sent-156, score-0.163]

82 The different word senses are evenly distributed over the rejected instances (H1: Commitment 30, drohen1-salsa 38, Run risk 36; H2: Commitment 3, drohen1-salsa 4, Run risk 4). [sent-158, score-0.457]

83 This shows that there is less uncertainty about the majority word sense, Run risk. [sent-159, score-0.1]

84 It is hard to decide whether the correctly labelled instances rejected by the filtering method would have helped or hurt the learning process. [sent-160, score-0.561]

85 5 Conclusions This paper shows that certain types of annotation noise cause serious problems for active learning approaches. [sent-162, score-0.726]

86 We showed how biased coder decisions can result in an accuracy for AL approaches which is below the one for random sampling. [sent-163, score-0.388]

87 In this case, it is necessary to apply an additional filtering step to remove the noisy data from the training set. [sent-164, score-0.138]

88 We presented an approach based on a resampling of the features in the seed data and guided by an ensemble of classifiers trained on the resampled feature vectors. [sent-165, score-0.237]

89 We showed that our approach is able to detect a certain amount of noise in the data. [sent-166, score-0.42]

90 Future work should focus on finding optimal parameter settings to make the filtering method more robust even for noisier data sets. [sent-167, score-0.17]

91 We also plan to improve the filtering heuristics and to explore further ways of detecting human coder errors. [sent-168, score-0.361]

92 Finally, we plan to test our method in a real-world annotation scenario. [sent-169, score-0.117]

93 Domain adaptation with active learning for word sense disambiguation. [sent-183, score-0.322]

94 An empirical study of the behavior of active learning for word sense disambiguation. [sent-187, score-0.322]

95 Stopping criteria for active learning of named entity recognition. [sent-205, score-0.276]

96 Majo - a toolkit for supervised word sense disambiguation and active learning. [sent-224, score-0.322]

97 Reducing class imbalance during active learning for named entity annotation. [sent-250, score-0.319]

98 Support vector machine active learning with applications to text classification. [sent-254, score-0.276]

99 Active learning for word sense disambiguation with methods for addressing the class imbalance problem. [sent-262, score-0.122]

100 Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. [sent-267, score-0.32]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('noise', 0.333), ('albias', 0.252), ('al', 0.246), ('active', 0.243), ('alrand', 0.202), ('beigman', 0.202), ('instances', 0.185), ('coder', 0.184), ('klebanov', 0.177), ('sampling', 0.173), ('schein', 0.164), ('rejected', 0.163), ('pool', 0.162), ('ungar', 0.156), ('rehbein', 0.133), ('annotation', 0.117), ('ensemble', 0.106), ('filtering', 0.103), ('biased', 0.101), ('fgen', 0.101), ('wsd', 0.097), ('coders', 0.096), ('margin', 0.09), ('reidsma', 0.089), ('seed', 0.084), ('size', 0.081), ('errors', 0.078), ('error', 0.076), ('rand', 0.076), ('tmargin', 0.076), ('curves', 0.068), ('uncertainty', 0.068), ('simulations', 0.067), ('unreliably', 0.067), ('ruppenhofer', 0.061), ('circle', 0.058), ('degrees', 0.057), ('random', 0.056), ('zhu', 0.053), ('beata', 0.05), ('drohen', 0.05), ('fseed', 0.05), ('carletta', 0.05), ('classifiers', 0.047), ('decisions', 0.047), ('sense', 0.046), ('annotators', 0.045), ('hard', 0.045), ('senses', 0.045), ('haffari', 0.044), ('ines', 0.044), ('tomanek', 0.044), ('triangle', 0.044), ('inserted', 0.044), ('black', 0.044), ('amount', 0.044), ('prevent', 0.043), ('detect', 0.043), ('class', 0.043), ('classifier', 0.043), ('chan', 0.042), ('simulation', 0.041), ('systematic', 0.041), ('goldstandard', 0.041), ('laws', 0.041), ('lyle', 0.041), ('reject', 0.041), ('residual', 0.041), ('ringger', 0.041), ('entropy', 0.04), ('harm', 0.038), ('jingbo', 0.038), ('commitment', 0.038), ('detecting', 0.038), ('filter', 0.037), ('simulating', 0.037), ('inconsistencies', 0.037), ('intrinsically', 0.037), ('heuristics', 0.036), ('impact', 0.036), ('annotated', 0.036), ('xn', 0.036), ('starting', 0.036), ('shannon', 0.035), ('eyal', 0.035), ('noisier', 0.035), ('training', 0.035), ('varying', 0.034), ('saarland', 0.034), ('tong', 0.034), ('learning', 0.033), ('supervised', 0.033), ('ngai', 0.032), ('seriously', 0.032), ('labelled', 0.032), ('settings', 0.032), ('majority', 0.032), ('risk', 0.032), ('retaining', 0.031), ('inserting', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

2 0.28788769 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

3 0.14946471 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Author: Y. Albert Park ; Roger Levy

Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.

4 0.11979862 238 acl-2011-P11-2093 k2opt.pdf

Author: empty-author

Abstract: We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging. Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set. We also find that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning.

5 0.10141724 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

Author: Els Lefever ; Veronique Hoste ; Martine De Cock

Abstract: This paper describes a set of exploratory experiments for a multilingual classificationbased approach to Word Sense Disambiguation. Instead of using a predefined monolingual sense-inventory such as WordNet, we use a language-independent framework where the word senses are derived automatically from word alignments on a parallel corpus. We built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both the more traditional local context features, as well as binary bag-of-words features that are extracted from the aligned translations. Our results show that the ParaSense multilingual WSD system shows very competitive results compared to the best systems that were evaluated on the SemEval-2010 Cross-Lingual Word Sense Disambiguation task for all five target languages.

6 0.086817652 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

7 0.085057795 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

8 0.075400151 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

9 0.073771231 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

10 0.073111065 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

11 0.073034093 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

12 0.072085761 330 acl-2011-Using Derivation Trees for Treebank Error Detection

13 0.070639335 258 acl-2011-Ranking Class Labels Using Query Sessions

14 0.070336707 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

15 0.068576276 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

16 0.062984452 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

17 0.062071253 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

18 0.061642032 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

19 0.059997465 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

20 0.059472449 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.18), (1, 0.029), (2, -0.049), (3, -0.019), (4, -0.031), (5, -0.004), (6, 0.112), (7, -0.033), (8, -0.016), (9, 0.018), (10, -0.042), (11, -0.131), (12, 0.091), (13, 0.086), (14, -0.015), (15, 0.011), (16, 0.028), (17, 0.028), (18, 0.045), (19, 0.006), (20, 0.0), (21, -0.012), (22, 0.023), (23, 0.058), (24, -0.025), (25, 0.056), (26, 0.068), (27, 0.074), (28, 0.014), (29, -0.035), (30, 0.01), (31, 0.01), (32, -0.025), (33, -0.064), (34, 0.096), (35, 0.084), (36, 0.051), (37, -0.039), (38, 0.048), (39, 0.003), (40, 0.144), (41, -0.055), (42, -0.216), (43, -0.068), (44, -0.156), (45, 0.006), (46, 0.107), (47, -0.066), (48, 0.005), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95946622 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

2 0.90493995 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

3 0.67922294 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

Author: Tetsuo Kiso ; Masashi Shimbo ; Mamoru Komachi ; Yuji Matsumoto

Abstract: In bootstrapping (seed set expansion), selecting good seeds and creating stop lists are two effective ways to reduce semantic drift, but these methods generally need human supervision. In this paper, we propose a graphbased approach to helping editors choose effective seeds and stop list instances, applicable to Pantel and Pennacchiotti’s Espresso bootstrapping algorithm. The idea is to select seeds and create a stop list using the rankings of instances and patterns computed by Kleinberg’s HITS algorithm. Experimental results on a variation of the lexical sample task show the effectiveness of our method.

4 0.62129033 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

Author: Mitesh M. Khapra ; Salil Joshi ; Arindam Chatterjee ; Pushpak Bhattacharyya

Abstract: Recent work on bilingual Word Sense Disambiguation (WSD) has shown that a resource deprived language (L1) can benefit from the annotation work done in a resource rich language (L2) via parameter projection. However, this method assumes the presence of sufficient annotated data in one resource rich language which may not always be possible. Instead, we focus on the situation where there are two resource deprived languages, both having a very small amount of seed annotated data and a large amount of untagged data. We then use bilingual bootstrapping, wherein, a model trained using the seed annotated data of L1 is used to annotate the untagged data of L2 and vice versa using parameter projection. The untagged instances of L1 and L2 which get annotated with high confidence are then added to the seed data of the respective languages and the above process is repeated. Our experiments show that such a bilingual bootstrapping algorithm when evaluated on two different domains with small seed sizes using Hindi (L1) and Marathi (L2) as the language pair performs better than monolingual bootstrapping and significantly reduces annotation cost.

5 0.5653432 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

Author: Clifton McFate ; Kenneth Forbus

Abstract: Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX’s shortcomings primarily fell into two categories, suggesting future research directions. 1

6 0.55648613 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

7 0.5349102 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

8 0.52161694 238 acl-2011-P11-2093 k2opt.pdf

9 0.51109487 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

10 0.48085392 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

11 0.46923876 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

12 0.44350111 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

13 0.44025132 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

14 0.43467742 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

15 0.43305376 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

16 0.4296141 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes

17 0.42568398 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

18 0.42511117 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

19 0.42465919 194 acl-2011-Language Use: What can it tell us?

20 0.42313799 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.047), (17, 0.039), (26, 0.026), (31, 0.016), (34, 0.224), (37, 0.075), (39, 0.065), (41, 0.068), (46, 0.015), (55, 0.063), (59, 0.029), (72, 0.06), (91, 0.058), (96, 0.111), (97, 0.011), (98, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7828933 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

2 0.76412565 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

Author: Colin Cherry ; Shane Bergsma

Abstract: Graph-based dependency parsing can be sped up significantly if implausible arcs are eliminated from the search-space before parsing begins. State-of-the-art methods for arc filtering use separate classifiers to make pointwise decisions about the tree; they label tokens with roles such as root, leaf, or attaches-tothe-left, and then filter arcs accordingly. Because these classifiers overlap substantially in their filtering consequences, we propose to train them jointly, so that each classifier can focus on the gaps of the others. We integrate the various pointwise decisions as latent variables in a single arc-level SVM classifier. This novel framework allows us to combine nine pointwise filters, and adjust their sensitivity using a shared threshold based on arc length. Our system filters 32% more arcs than the independently-trained classifiers, without reducing filtering speed. This leads to faster parsing with no reduction in accuracy.

3 0.71117806 174 acl-2011-Insights from Network Structure for Text Mining

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network, in which words are the nodes and relations linking them are the edges. The results of computational network analysis, especially from the world wide web, are thus applicable. Surprisingly, these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining, how they explain some previously observed phenomena, and how they can be helpful for computational linguistics applications.

4 0.63196892 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

Author: Svetlana Kiritchenko ; Colin Cherry

Abstract: The automatic coding of clinical documents is an important task for today’s healthcare providers. Though it can be viewed as multi-label document classification, the coding problem has the interesting property that most code assignments can be supported by a single phrase found in the input document. We propose a Lexically-Triggered Hidden Markov Model (LT-HMM) that leverages these phrases to improve coding accuracy. The LT-HMM works in two stages: first, a lexical match is performed against a term dictionary to collect a set of candidate codes for a document. Next, a discriminative HMM selects the best subset of codes to assign to the document by tagging candidates as present or absent. By confirming codes proposed by a dictionary, the LT-HMM can share features across codes, enabling strong performance even on rare codes. In fact, we are able to recover codes that do not occur in the training set at all. Our approach achieves the best ever performance on the 2007 Medical NLP Challenge test set, with an F-measure of 89.84.

5 0.62439489 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

6 0.61515546 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

7 0.61420566 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

8 0.61322367 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

9 0.6121164 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.61131251 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

11 0.60963166 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

12 0.60787421 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

13 0.60672152 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

14 0.60543633 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

15 0.60536534 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

16 0.6050173 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

17 0.60487735 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

18 0.60456669 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

19 0.60444993 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

20 0.60344058 121 acl-2011-Event Discovery in Social Media Feeds