acl acl2011 acl2011-261 knowledge-graph by maker-knowledge-mining

261 acl-2011-Recognizing Named Entities in Tweets

Source: pdf

Author: Xiaohua LIU ; Shaodian ZHANG ; Furu WEI ; Ming ZHOU

Abstract: The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semi-supervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of KNN and semisupervised learning.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. [sent-2, score-0.63]

2 We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. [sent-3, score-0.077]

3 The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. [sent-4, score-0.602]

4 The semi-supervised learning plus the gazetteers alleviate the lack of training data. [sent-5, score-0.161]

5 Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of KNN and semisupervised learning. [sent-6, score-0.059]

6 Exceptions include studies on informal text such as emails, blogs, clinical notes (Wang, 2009). [sent-18, score-0.096]

7 Because of the domain mismatch, current systems trained on non-tweets perform poorly on tweets, a new genre of text, which are short, informal, ungrammatical and noise prone. [sent-19, score-0.052]

8 Thus, building a domain specific NER for tweets is necessary, which requires a lot of annotated tweets or rules. [sent-24, score-0.694]

9 Proposed solutions to alleviate this issue include: 1) Domain adaption, which aims to reuse the knowledge of the source domain in a target domain. [sent-26, score-0.077]

10 (2009), which uses data that is informative about the target domain and also easy to be labeled to bridge the two domains, and Chiticariu et al. [sent-28, score-0.071]

11 (2010), which introduces a high-level rule language, called NERL, to build the general and do- main specific NER systems; and 2) semi-supervised learning, which aims to use the abundant unlabeled data to compensate for the lack of annotated data. [sent-29, score-0.065]

12 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 359–367, is the tweet’s informal nature, making conventional features such as part-of-speech (POS) and capitalization not reliable. [sent-35, score-0.057]

13 Tackling this challenge, ideally, requires adapting related NLP tools to fit tweets, or normalizing tweets to accommodate existing tools, both of which are hard tasks. [sent-40, score-0.332]

14 Firstly, a K-Nearest Neighbors (KNN) based classifier is adopted to conduct word level classification, leveraging the similar and recently labeled tweets. [sent-42, score-0.118]

15 Furthermore, the KNN and CRF model are repeatedly retrained with an incrementally augmented training set, into which high confidently labeled tweets are added. [sent-45, score-0.428]

16 Finally, following Lev Ratinov and Dan Roth (2009), 30 gazetteers are used, which cover common names, countries, locations, temporal expressions, etc. [sent-47, score-0.137]

17 The underlying idea of our method is to combine global evidence from KNN and the gazetteers with local contextual information, and to use common knowledge and unlabeled tweets to make up for the lack of training data. [sent-49, score-0.531]

18 12,245 tweets are manually annotated as the test data set. [sent-50, score-0.332]

19 We propose to a novel method that combines a KNN classifier with a conventional CRF based labeler under a semi-supervised learning framework to combat the lack of information in tweet and the unavailability of training data. [sent-56, score-0.48]

20 (2010) use Amazons Mechanical Turk service 2 and CrowdFlower 3 to annotate named entities in tweets and train a CRF model to evaluate the effectiveness of human labeling. [sent-70, score-0.519]

21 In contrast, our work aims to build a system that can automatically identify named entities in tweets. [sent-71, score-0.187]

22 To achieve this, a KNN classifier with a CRF model is combined to leverage cross tweets information, and the semisupervised learning is adopted to leverage unlabeled tweets. [sent-72, score-0.471]

23 For example, Krupka and Hausman (1998) use manual rules to extract entities of predefined types; Zhou and Ju (2002) adopt Hidden Markov Models (HMM) while Finkel et al. [sent-75, score-0.09]

24 A state-of-the-art biomedical NER system (Yoshida and Tsujii, 2007) uses lexical features, orthographic features, semantic features and syntactic features, such as part-of-speech (POS) and shallow parsing. [sent-87, score-0.05]

25 It proves useful when labeled data is scarce and hard to construct while unlabeled data is abundant and easy to access. [sent-99, score-0.106]

26 It iteratively adds data that has been 361 confidently labeled but is also informative to its training set, which is used to re-train its model. [sent-101, score-0.096]

27 (2009) propose another bootstrapping algorithm that selects bridging instances from an unlabeled target domain, which are informative about the target domain and are also easy to be correctly labeled. [sent-105, score-0.09]

28 We adopt bootstrapping as well, but use human labeled tweets as seeds. [sent-106, score-0.399]

29 1 The Tweets A tweet is a short text message containing no more than 140 characters in Twitter, the biggest micro-blog service. [sent-117, score-0.27]

30 Words beginning with the “#” character, like “”#Win”, “#Contest” and “#Giveaway”, are hash tags, usually indicating the topics of the tweet; words starting with “@”, like “@office” and “@momtobedby8”, represent user names, and “http://bit. [sent-120, score-0.046]

31 Twitter users are interested in named entities, such Figure 1: Portion of different types of named entities in tweets. [sent-122, score-0.284]

32 as person names, organization names and product names, as evidenced by the abundant named entities in tweets. [sent-124, score-0.285]

33 According to our investigation on 12,245 randomly sampled tweets that are manually labeled, about 46. [sent-125, score-0.332]

34 Figure 1 shows the portion of named entities of different types. [sent-127, score-0.187]

35 2 The Task Given a tweet as input, our task is to identify both the boundary and the class of each mention of entities of predefined types. [sent-129, score-0.36]

36 We focus on four types of entities in our study, i. [sent-130, score-0.09]

37 Me without you is like an iphone without apps, Justin Bieber without his hair, Lady gaga without her telephone, it just wouldn. [sent-138, score-0.075]

38 Me without you is like an iphone without apps, Justin Bieber without his hair, Lady gaga without her telephone, it just wouldn. [sent-144, score-0.075]

39 Following the common practice , we adopt a sequential labeling approach to jointly resolve these sub-tasks, i. [sent-154, score-0.069]

40 , for each word in the input tweet, a label is assigned to it, indicating both the boundary and entity type. [sent-156, score-0.079]

41 Our method, as illustrated in Algorithm 1, repeatedly adds the new confidently labeled tweets to the training set 4 and retrains itself once the number of new accumulated training data goes above the threshold N. [sent-161, score-0.428]

42 Algorithm 1 also demonstrates one striking characteristic of our method: A KNN classifier is applied to determine the label of the current word before the CRF model. [sent-162, score-0.077]

43 The labels of the words that confidently assigned by the KNN classifier are treated as visible variables for the CRF model. [sent-163, score-0.132]

44 describes how the KNN classifier 4The training set ts has a maximum allowable number of items, which is 10,000 in our work. [sent-167, score-0.129]

45 1: Initialize ls, the CRF labeler: ls = trains (ts). [sent-172, score-0.073]

46 2: Initialize lk, the KNN classifier: lk = traink (ts). [sent-173, score-0.163]

47 4: while Pop a tweet t from iand t null do 5: ef Poro pE aac thw eweotrd t w ∈m ti danod 6: Gachet tohred wfe ∈atu tr deo vector w⃗: w⃗ = reprw (w, t). [sent-175, score-0.341]

48 8: if cf > τ then 9: Pre-label: t = update(t, w, c). [sent-177, score-0.059]

49 15: if cf > γ then 16: Add labeled result t to ts , n = n + 1. [sent-181, score-0.152]

50 17: end if 18: if n > N then 19: Retrain ls : ls = trains (ts) . [sent-182, score-0.114]

51 2: for Each tweet t ∈ ts do 3: facorh tEwaechet tw to ∈rd t,sla dboel pair (w, c) ∈ t do 4: Gachet tohred afebeatlu praei (vwe,ctco)r w t⃗ :d w⃗ = reprw (w, t). [sent-189, score-0.393]

52 5: Add the w⃗ and c pair to the classifier: lk = lk {( w⃗, c)}. [sent-190, score-0.242]

53 6: end for 7: end for 8: return KNN classifier lk. [sent-191, score-0.077]

54 Two desirable properties of KNN make it stand out from its alternatives: 1) It can straightforwardly incorporate evidence from new labeled tweets and retraining is fast; and 2) combining with a CRF 363 Algorithm 3 KNN predication. [sent-194, score-0.435]

55 1: Initialize nb, the neighbors of w⃗: nb = neigbors(lk, w⃗ ). [sent-196, score-0.058]

56 3: Calculate ∑the labeling con)fi ·d ceonsc(e w⃗ ,c w⃗f : cf = ∑(⃗w′,c′)∈nb ∑δ(c,c′)·cos( w⃗, w⃗′) ∑∑(⃗w′,c′)∈nbcos( w⃗, w⃗′) 4: retu∑rn The predicted label c∗ and its confidence cf. [sent-198, score-0.092]

57 We have written a Viterbi decoder that can incorporate partially observed labels to implement the crf function in Algorithm 1. [sent-203, score-0.302]

58 3 Features Given a word in a tweet, the KNN classifier considers a text window of size 5 with the word in the middle (Zhang and Johnson, 2003), and extracts bag-ofword features from the window as features. [sent-205, score-0.077]

59 For each word, our CRF model extracts similar features as Wang (2009) and Ratinov and Roth (2009), namely, orthographic features, lexical features and gazetteers related features. [sent-206, score-0.162]

60 In our work, we use the gazetteers provided by Ratinov and Roth (2009). [sent-207, score-0.137]

61 The stop words used here are mainly from a set of frequently-used words The other is that tweet meta data is normalized, that is, every link becomes *LINK* and every 6. [sent-210, score-0.27]

62 The ineffectiveness of these features is linked to the noisy and informal nature of tweets. [sent-224, score-0.057]

63 We are interested in exploring other tweet representations, which may fit our NER task, for example the LSA models (Guo et al. [sent-227, score-0.27]

64 In our work, gazetteers prove to be substantially useful, which is consistent with the observation of Ratinov and Roth (2009). [sent-230, score-0.137]

65 However, the gazetteers used in our work contain noise, which hurts the performance. [sent-231, score-0.137]

66 Moreover, they are static, directly from Ratinov and Roth (2009), thus with a relatively lower coverage, especially for person names and product names in tweets. [sent-232, score-0.111]

67 In future, we plan to feed the fresh entities correctly identified from tweets back into the gazetteers. [sent-234, score-0.422]

68 The correctness of an entity can rely on its frequency or other evidence. [sent-235, score-0.079]

69 Similarly, to study the effectiveness of the CRF model, it is replaced by its alternations, such as the HMM labeler and a beam search plus a maximum entropy based classifier. [sent-239, score-0.105]

70 Figure 1 shows the portion of named entities of different types. [sent-250, score-0.187]

71 2 Evaluation Metrics For every type of named entity, Precision (Pre. [sent-255, score-0.097]

72 For the overall performance, we use the average Precision, Recall and F1, where the weight of each name entity type is proportional to the number of entities of that type. [sent-259, score-0.191]

73 4 Basic Results Table 1 shows the overall results for the baselines and ours with the name NERCB. [sent-291, score-0.053]

74 Here our system is trained as described in Algorithm 1, combining a KNN classifier and a CRF labeler, with semisupervised learning enabled. [sent-292, score-0.105]

75 Tables 2-5 report the results on each entity type, indicating that our method consistently yields better results on all entity types. [sent-295, score-0.158]

76 We further check the confidently predicted labels of the KNN classifier, which account for about 22. [sent-299, score-0.055]

77 This largely explains why the KNN classifier helps the CRF labeler. [sent-303, score-0.077]

78 The KNN classifier is replaced with its competitors, and only a slight difference in performance is observed. [sent-304, score-0.077]

79 Table 7 shows the overall performance of the CRF labeler with various feature set combinations, where Fo, Fl and Fg denote the orthographic features, the lexical features and the gazetteers related features, respectively. [sent-349, score-0.267]

80 5% of all errors, is largely related to slang expressions and informal abbreviations. [sent-358, score-0.081]

81 For example, our method identifies “Cali”, which actually means “California”, as a PERSON in the tweet “i love Cali so much”. [sent-359, score-0.27]

82 5 Table 7: Overview performance of the CRF labeler (combined with KNN) with different feature sets. [sent-383, score-0.105]

83 component to handle such slang expressions and informal abbreviations. [sent-384, score-0.081]

84 For example, for this tweet “come to see jaxon someday”, our method mistakenly labels “jaxon” as a LOCATION, which actually denotes a PERSON. [sent-387, score-0.317]

85 This error is understandable somehow, since this tweet is one of the earliest tweets that mention “jaxon”, and at that time there was no strong evidence supporting that it represents a person. [sent-388, score-0.63]

86 Possible solutions to these errors include continually enriching the gazetteers and aggregating additional external knowledge from other channels such as traditional news. [sent-389, score-0.183]

87 Consider this tweet “wesley snipes ws cought 4 nt payin tax coz ths celebz dnt take it cirus. [sent-392, score-0.302]

88 ”, in which “wesley snipes” is not identified as a PERSON but simply ignored by our method, because this tweet is too noisy to provide effective features. [sent-393, score-0.27]

89 366 Figure 2: F1 score on 10 test data sets sequentially fed into the system, each with 600 instances. [sent-404, score-0.052]

90 6 Conclusions and Future work We propose a novel NER system for tweets, which combines a KNN classifier with a CRF labeler under a semi-supervised learning framework. [sent-406, score-0.182]

91 The KNN classifier collects global information across recently labeled tweets while the CRF labeler exploits information from a single tweet and from the gazetteers. [sent-407, score-0.825]

92 Firstly, we hope to develop tweet normalization technology to make tweets friendlier to the NER task. [sent-410, score-0.602]

93 Secondly, we are interested in integrating new entities from tweets or other channels into the gazetteers. [sent-411, score-0.445]

94 Domain adaptation with latent semantic association for named entity recognition. [sent-459, score-0.176]

95 An effective two-stage model for exploiting non-local dependencies in named entity recognition. [sent-478, score-0.176]

96 Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. [sent-494, score-0.201]

97 Extracting personal names from email: applying named entity recognition to informal text. [sent-504, score-0.302]

98 Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. [sent-520, score-0.103]

99 Introduction to the CoNLL-2003 shared task: languageindependent named entity recognition. [sent-525, score-0.176]

100 A robust risk minimization based named entity recognition system. [sent-541, score-0.201]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('knn', 0.629), ('tweets', 0.332), ('crf', 0.302), ('ner', 0.279), ('tweet', 0.27), ('gazetteers', 0.137), ('lk', 0.121), ('labeler', 0.105), ('ratinov', 0.102), ('named', 0.097), ('entities', 0.09), ('entity', 0.079), ('classifier', 0.077), ('roth', 0.063), ('cf', 0.059), ('informal', 0.057), ('confidently', 0.055), ('ts', 0.052), ('conducts', 0.048), ('jaxon', 0.047), ('reprw', 0.047), ('finkel', 0.045), ('names', 0.044), ('gaga', 0.042), ('krupka', 0.042), ('traink', 0.042), ('ls', 0.041), ('labeled', 0.041), ('clinical', 0.039), ('chiticariu', 0.038), ('bieber', 0.038), ('lady', 0.038), ('guo', 0.038), ('initialize', 0.037), ('sequential', 0.036), ('nb', 0.035), ('justin', 0.035), ('retraining', 0.034), ('unlabeled', 0.034), ('downey', 0.033), ('iphone', 0.033), ('labeling', 0.033), ('bilou', 0.032), ('gachet', 0.032), ('giveaway', 0.032), ('hausman', 0.032), ('krishnan', 0.032), ('nadeau', 0.032), ('nercb', 0.032), ('nerl', 0.032), ('reprt', 0.032), ('snipes', 0.032), ('trains', 0.032), ('baselines', 0.031), ('abundant', 0.031), ('domain', 0.03), ('sequentially', 0.028), ('semisupervised', 0.028), ('hair', 0.028), ('cali', 0.028), ('unavailability', 0.028), ('evidence', 0.028), ('bootstrapping', 0.026), ('harbin', 0.026), ('minkov', 0.026), ('wesley', 0.026), ('apps', 0.026), ('jansche', 0.026), ('orthographic', 0.025), ('recognition', 0.025), ('beginning', 0.025), ('locations', 0.025), ('biomedical', 0.025), ('alleviate', 0.024), ('fed', 0.024), ('slang', 0.024), ('contest', 0.024), ('tohred', 0.024), ('office', 0.023), ('twitter', 0.023), ('person', 0.023), ('neighbors', 0.023), ('solutions', 0.023), ('yoshida', 0.023), ('channels', 0.023), ('shanghai', 0.023), ('somehow', 0.023), ('fields', 0.022), ('noise', 0.022), ('singh', 0.022), ('bio', 0.022), ('doug', 0.022), ('sang', 0.022), ('name', 0.022), ('win', 0.021), ('hash', 0.021), ('remarkably', 0.021), ('tong', 0.021), ('tjong', 0.021), ('opennlp', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 261 acl-2011-Recognizing Named Entities in Tweets

Author: Xiaohua LIU ; Shaodian ZHANG ; Furu WEI ; Ming ZHOU

2 0.31505099 292 acl-2011-Target-dependent Twitter Sentiment Classification

Author: Long Jiang ; Mo Yu ; Ming Zhou ; Xiaohua Liu ; Tiejun Zhao

Abstract: Sentiment analysis on Twitter data has attracted much attention recently. In this paper, we focus on target-dependent Twitter sentiment classification; namely, given a query, we classify the sentiments of the tweets as positive, negative or neutral according to whether they contain positive, negative or neutral sentiments about that query. Here the query serves as the target of the sentiments. The state-ofthe-art approaches for solving this problem always adopt the target-independent strategy, which may assign irrelevant sentiments to the given target. Moreover, the state-of-the-art approaches only take the tweet to be classified into consideration when classifying the sentiment; they ignore its context (i.e., related tweets). However, because tweets are usually short and more ambiguous, sometimes it is not enough to consider only the current tweet for sentiment classification. In this paper, we propose to improve target-dependent Twitter sentiment classification by 1) incorporating target-dependent features; and 2) taking related tweets into consideration. According to the experimental results, our approach greatly improves the performance of target-dependent sentiment classification. 1

3 0.29880935 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty

Abstract: Social networking and micro-blogging sites are stores of opinion-bearing content created by human users. We describe C-Feel-It, a system which can tap opinion content in posts (called tweets) from the micro-blogging website, Twitter. This web-based system categorizes tweets pertaining to a search string as positive, negative or objective and gives an aggregate sentiment score that represents a sentiment snapshot for a search string. We present a qualitative evaluation of this system based on a human-annotated tweet corpus.

4 0.18680175 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith

Abstract: We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

5 0.17497832 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

Author: Roberto Gonzalez-Ibanez ; Smaranda Muresan ; Nina Wacholder

Abstract: Sarcasm transforms the polarity of an apparently positive or negative utterance into its opposite. We report on a method for constructing a corpus of sarcastic Twitter messages in which determination of the sarcasm of each message has been made by its author. We use this reliable corpus to compare sarcastic utterances in Twitter to utterances that express positive or negative attitudes without sarcasm. We investigate the impact of lexical and pragmatic factors on machine learning effectiveness for identifying sarcastic utterances and we compare the performance of machine learning techniques and human judges on this task. Perhaps unsurprisingly, neither the human judges nor the machine learning techniques perform very well. 1

6 0.16912684 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

7 0.14835617 177 acl-2011-Interactive Group Suggesting for Twitter

8 0.1146281 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System

9 0.084915504 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

10 0.083992422 121 acl-2011-Event Discovery in Social Media Feeds

11 0.073864333 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

12 0.068971835 305 acl-2011-Topical Keyphrase Extraction from Twitter

13 0.066145457 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

14 0.065071531 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

15 0.06462013 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

16 0.060385063 174 acl-2011-Insights from Network Structure for Text Mining

17 0.060339455 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

18 0.057034615 117 acl-2011-Entity Set Expansion using Topic information

19 0.054383278 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

20 0.053900052 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.155), (1, 0.129), (2, 0.026), (3, -0.03), (4, 0.041), (5, -0.003), (6, 0.04), (7, -0.21), (8, -0.081), (9, 0.13), (10, -0.184), (11, 0.159), (12, 0.126), (13, -0.11), (14, -0.092), (15, -0.071), (16, -0.053), (17, -0.005), (18, 0.01), (19, 0.021), (20, -0.093), (21, -0.098), (22, 0.025), (23, 0.0), (24, -0.017), (25, 0.057), (26, -0.037), (27, -0.075), (28, 0.023), (29, 0.09), (30, -0.031), (31, 0.076), (32, 0.026), (33, 0.001), (34, 0.041), (35, 0.11), (36, 0.117), (37, -0.124), (38, -0.032), (39, 0.024), (40, -0.073), (41, -0.032), (42, -0.036), (43, -0.091), (44, -0.02), (45, -0.062), (46, -0.073), (47, 0.054), (48, -0.016), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92230433 261 acl-2011-Recognizing Named Entities in Tweets

Author: Xiaohua LIU ; Shaodian ZHANG ; Furu WEI ; Ming ZHOU

2 0.80683571 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

Author: Roberto Gonzalez-Ibanez ; Smaranda Muresan ; Nina Wacholder

3 0.75290287 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith

4 0.65928364 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty

5 0.62217331 292 acl-2011-Target-dependent Twitter Sentiment Classification

Author: Long Jiang ; Mo Yu ; Ming Zhou ; Xiaohua Liu ; Tiejun Zhao

6 0.54547375 177 acl-2011-Interactive Group Suggesting for Twitter

7 0.42384568 121 acl-2011-Event Discovery in Social Media Feeds

8 0.41540956 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

9 0.41158557 305 acl-2011-Topical Keyphrase Extraction from Twitter

10 0.40939847 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System

11 0.37962881 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

12 0.35565752 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

13 0.34280017 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

14 0.33673608 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

15 0.31185651 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

16 0.31105852 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

17 0.31040049 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

18 0.30284059 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

19 0.30237135 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes

20 0.29736957 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.019), (17, 0.048), (26, 0.024), (37, 0.094), (39, 0.049), (41, 0.048), (55, 0.03), (57, 0.013), (59, 0.04), (72, 0.382), (91, 0.037), (96, 0.11), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90899497 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya

Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.

2 0.87938219 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

Abstract: This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. The system takes monologue and transforms it to two-participant dialogue. After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy.

3 0.87102246 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

4 0.86676878 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

same-paper 5 0.82850218 261 acl-2011-Recognizing Named Entities in Tweets

Author: Xiaohua LIU ; Shaodian ZHANG ; Furu WEI ; Ming ZHOU

6 0.81897378 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

7 0.80966747 252 acl-2011-Prototyping virtual instructors from human-human corpora

8 0.74543417 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

9 0.68215644 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.62619692 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11 0.60639054 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

12 0.60520363 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

13 0.60473663 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

14 0.60294658 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

15 0.58780664 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

16 0.58356851 292 acl-2011-Target-dependent Twitter Sentiment Classification

17 0.58299088 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.58199084 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

19 0.57923645 141 acl-2011-Gappy Phrasal Alignment By Agreement

20 0.57777143 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments