emnlp emnlp2011 emnlp2011-17 knowledge-graph by maker-knowledge-mining

17 emnlp-2011-Active Learning with Amazon Mechanical Turk

Source: pdf

Author: Florian Laws ; Christian Scheible ; Hinrich Schutze

Abstract: Supervised classification needs large amounts of annotated training data that is expensive to create. Two approaches that reduce the cost of annotation are active learning and crowdsourcing. However, these two approaches have not been combined successfully to date. We evaluate the utility of active learning in crowdsourcing on two tasks, named entity recognition and sentiment detection, and show that active learning outperforms random selection of annotation examples in a noisy crowdsourcing scenario.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Two approaches that reduce the cost of annotation are active learning and crowdsourcing. [sent-4, score-0.479]

2 We evaluate the utility of active learning in crowdsourcing on two tasks, named entity recognition and sentiment detection, and show that active learning outperforms random selection of annotation examples in a noisy crowdsourcing scenario. [sent-6, score-1.636]

3 Recently, crowdsourcing services like Amazon Mechanical Turk (MTurk) have become available as an alternative that offers acquisition of non-expert annotations at low cost. [sent-9, score-0.328]

4 The cost of MTurk annotation is low, but a consequence of using non-expert annotators is much lower annotation quality. [sent-11, score-0.583]

5 AL reduces annotation effort by setting up an annotation loop where, starting from a small seed set, only the maximally informative examples are chosen for annotation. [sent-14, score-0.541]

6 Until recently, most AL studies focused on simulating the annotation process by using already available gold standard data. [sent-21, score-0.276]

7 For this reason, some authors have questioned the applicability of AL to noisy annotation scenarios such as MTurk (Baldridge and Palmer, 2009; Rehbein et al. [sent-23, score-0.315]

8 AL and crowdsourcing are complementary approaches: AL reduces the number of annotations used while crowdsourcing reduces the cost per annotation. [sent-25, score-0.68]

9 Our main contribution in this paper is that we show for the first time that AL is significantly better than randomly selected annotation examples in a real crowdsourcing annotation scenario. [sent-27, score-0.701]

10 Our experiments directly address two tasks, named entity recognition and sentiment detection, but our Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1. [sent-28, score-0.351]

11 We also show that the effectiveness of MTurk annotation with AL can be further enhanced by using two techniques that increase label quality: adaptive voting and fragment recovery. [sent-31, score-0.679]

12 (2010) choose an annotation interface where annotators have to drag the mouse to select entities. [sent-41, score-0.349]

13 Carpenter and Poesio (2010) argue that dragging is less convenient for workers than marking tokens. [sent-42, score-0.269]

14 Another important difference is that previous – – studies on NER have used data sets for which no “linguistic” gold annotation is available. [sent-44, score-0.276]

15 (2005) were among the first to investigate the effect of actively sampled instances on agreement of labels and annotation time. [sent-49, score-0.312]

16 (2010) investigate AL with human expert annotators for word sense disambiguation, but do not find convincing evidence that AL reduces annotation cost in a realistic (non-simulated) annotation scenario. [sent-55, score-0.645]

17 (2010) carried out experiments 1547 on sentiment active learning through crowdsourcing. [sent-57, score-0.451]

18 However, in a crowdsourcing scenario, it is not possible to ask specific annotators for a label, as crowdsourcing workers join and leave the site. [sent-60, score-0.799]

19 We are not aware of any study that shows that AL is significantly better than a simple baseline of having annotators annotate randomly selected examples in a highly noisy annotation setting like crowdsourcing. [sent-63, score-0.389]

20 While AL generally is superior to this baseline in simulated experiments, it is not clear that this result carries over to crowdsourcing annotation. [sent-64, score-0.287]

21 3 Annotation System One fundamental design criterion for our annotation system was the ability to select examples in real time to support, e. [sent-66, score-0.277]

22 First, the administrator can manage annotation experiments using a web interface and publish annotation tasks associated with an experiment on MTurk. [sent-73, score-0.472]

23 Second, the frontend web application presents annotation tasks to MTurk workers. [sent-75, score-0.324]

24 An external question contains an URL to our frontend web application, which is queried when a worker views an annotation task. [sent-78, score-0.55]

25 The backend component is responsible for selection of an example to be annotated in response to a worker’s request for an annotation task. [sent-80, score-0.387]

26 The backend implements a diverse choice of random and active selection strategies as well as the multilabeling strategies described in section 3. [sent-81, score-0.513]

27 Lowercase tokens are prelabeled with “O” (no named entity), but annotators are encouraged to change this label if the token is in fact part of an entity phrase. [sent-88, score-0.335]

28 For sentiment annotation, we found in preliminary experiments that using simple radio button selection for the choice of the document label (pos it ive or negat ive) leads to a very high amount of spam submissions, taking the overall classification accuracy down to around 55%. [sent-89, score-0.422]

29 1 Concurrent example selection AL works by setting up an interactive annotation loop where at each iteration, the most informative example is selected for annotation. [sent-93, score-0.33]

30 However, batch selection might not give the optimum selection (examples in a batch are likely to be redundant, see Brinker (2003)) and wait times can still occur between one batch and the next. [sent-103, score-0.349]

31 When performing annotation with MTurk, wait Pˆ times are unacceptable. [sent-104, score-0.294]

32 Thus, we perform the retraining and uncertainty rescoring concurrently with the annotation user interface. [sent-105, score-0.314]

33 The annotation user interface takes the most informative example from the pool and presents it to the annotator. [sent-107, score-0.364]

34 In this way, annotation and example selection can run in parallel. [sent-110, score-0.287]

35 2 Adaptive voting and fragment recovery MTurk labels often have a high error rate. [sent-114, score-0.446]

36 A common strategy for improving label quality is to acquire multiple labels by different workers for each example and then consolidate the annotations into a single label of higher quality. [sent-115, score-0.677]

37 , +f = using fragments; sentiment budget 1130 for run 1, sentiment budget 1756 averaged over 2 runs. [sent-119, score-0.66]

38 voting and is adaptive in the number of repeated annotations. [sent-120, score-0.345]

39 Then majority voting is performed for each token individually. [sent-122, score-0.319]

40 1 Experiments In our NER experiments, we have workers reannotate the English corpus of the CoNLL-2003 NER shared task. [sent-136, score-0.304]

41 We chose this corpus to be able to compare crowdsourced annotations with gold standard 2It can take a while in this scheme for annotators to agree on a final annotation for a sentence. [sent-137, score-0.474]

42 We make tentative labels of a sentence available to the classifier immediately and replace them with the final labels once voting is completed. [sent-138, score-0.473]

43 The sentiment detection task was modeled after a well-known document analysis setup for sentiment classification, introduced by Pang et al. [sent-146, score-0.526]

44 We use their corpus of 1000 positive and 1000 negative movie reviews and the Stanford maximum entropy classifier (Manning and Klein, 2003) to predict the sentiment label of each document d from a unigram representation of d. [sent-148, score-0.367]

45 We compare random sampling (RS) and AL in combination with the proposed voting and fragment strategies with different parameters. [sent-152, score-0.48]

46 We chose voting with at most d = 5 repetitions as our main reannotation strategy for both random and active sampling for NER annotation. [sent-156, score-0.591]

47 We always compare two strategies for the same annotation budget. [sent-168, score-0.281]

48 For example, the number of training sentences in Table 1 differ in the two relevant columns, but all strategies compared use exactly the same annotation budget (5820, 693 1, 1130, and 1756, respectively). [sent-169, score-0.348]

49 2000 sentences or 450 documents would not have been meaningful; therefore we chose to run an extra experiment with the single annotation strategy to match this up with the budgets of the voting strategies. [sent-172, score-0.588]

50 2 Results For sentiment detection, worker accuracy or label quality the percentage of correctly annotated documents is 74. [sent-175, score-0.637]

51 In contrast, for NER, worker accuracy the percentage ofnon-O tokens annotated correctly is only 51. [sent-177, score-0.297]

52 Adaptive voting and – – – – 1550 fragment recovery manage to recover a small part of the lost performance (lines 2–4); each of the three F1 scores is significantly better than the one above it as indicated by † (Approximate Randomization iTte sast (Noreen, 1989; (CAhpinpcrhoxorim eat al. [sent-185, score-0.354]

53 Adaptive voting and fragment recovery again increase worker accuracy (lines 6–8) although total improvement of 3. [sent-191, score-0.553]

54 We carried out two runs of the same experiment for sentiment to validate our first positive result since – – the difference between the two conditions is not as large as in NER (Figure 1, top right). [sent-199, score-0.291]

55 It is likely that some of them can be learned well through random sampling at first; however, active learning can gain accuracy over time because it selects examples with more difficult clues. [sent-206, score-0.34]

56 In Figure 1 (bottom), we compare single annotation with adaptive voting. [sent-207, score-0.321]

57 Adaptive voting trades quantity of sampled sentences for quality of labels and thus incurs higher net costs per sentence. [sent-209, score-0.396]

58 For NER (Figure 1, bottom left), the single annotation strategy has a faster start; so for small budgets, covering a somewhat larger portion of the sample space is beneficial. [sent-211, score-0.286]

59 For sentiment (Figure 1, bottom right), results are similar: voting has no benefit initially, but as finding maximally informative examples to annotate becomes harder in later stages of learning, adaptive voting gains an advantage over single annotations. [sent-217, score-0.954]

60 6% accuracy for sentiment (averaged over two runs at budget 1756). [sent-219, score-0.358]

61 3 Annotation time per token Most AL work assumes constant cost per annotation unit. [sent-222, score-0.34]

62 In annotation with MTurk, cost is not a function 1551 of annotation time because workers are paid a fixed amount per HIT. [sent-226, score-0.78]

63 Nevertheless, annotation time plays a part in whether workers are willing to work on a given task for the offered reward. [sent-227, score-0.489]

64 This is particularly problematic for NER since workers have to examine each token individually. [sent-228, score-0.318]

65 We therefore investigate for NER whether the time MTurk workers spend on annotating sentences differs for random vs. [sent-229, score-0.366]

66 We first compute median and mean annotation times and number of tokens per sentence: strategy mse dci/asnentemnce an toaklelns/sre qntueinrce d rAanLdom 1 7 . [sent-231, score-0.326]

67 # uppercase tokens well as sentences with slightly more uppercase tokens that require annotation. [sent-238, score-0.376]

68 (ii) The noisy labels result in bad intermediate models that then select suboptimal examples to be annotated next. [sent-257, score-0.271]

69 First, we preserve the sequence of sentences chosen by our AL experiments on MTurk, with 5voting for NER and 4-voting for sentiment but replace the noisy worker-provided labels by gold labels. [sent-260, score-0.476]

70 The performance of classifiers trained on this sequence is the dashed line “MTurk selection, gold labels” in Figure 3 for NER (left) and sentiment (right). [sent-261, score-0.319]

71 Here, the selection too is controlled by gold labels, so the selection has a noiseless classifier available for scoring and can perform optimal uncertainty selection. [sent-263, score-0.282]

72 For a fair comparison, we adjust the batchsize to be equal to the average staleness of a selected example in concurrent MTurk active learning. [sent-269, score-0.317]

73 For our concurrent NER system, the average staleness of an example was about 12 (min: 1, max: 40), for sentiment it was about 2. [sent-272, score-0.392]

74 (2010) because there are more annotators accessing our system at the same time via MTurk but not as high for sentiment since documents are longer and retraining the sentiment classifier is faster. [sent-274, score-0.715]

75 We cannot compare on cost here since we do not know what the persentence cost of a “gold” expert annotation is. [sent-280, score-0.398]

76 We attribute this to the fact that the quality of the labels is higher in sentiment than in NER. [sent-291, score-0.415]

77 Our initial experiments on sentiment were all negative (showing no improvement of AL compared to random) because label quality was too low. [sent-292, score-0.382]

78 5 Worker Quality So far we have assumed that all workers provide annotations of the same quality. [sent-296, score-0.368]

79 Figure 4 shows plots of worker accuracy as a function of worker productivity (number of annotated examples). [sent-298, score-0.429]

80 Some workers submit only one or two HITs just to try out the task. [sent-299, score-0.319]

81 For NER, the majority of workers submit between 5 and 10 sentences, with label qualities between 0. [sent-300, score-0.404]

82 For sentiment, most workers submit 1 to 5 documents, with label qualities between 0. [sent-305, score-0.378]

83 While quality for highly productive workers is mediocre in our experiments, other researchers have found extremely bad quality for their most prolific workers (Callison-Burch, 2009). [sent-309, score-0.684]

84 Some of these workers might be spammers who try to submit answers with automatic scripts. [sent-310, score-0.349]

85 We encountered some spammers that our heuristics did not detect (shown in the bottom-right areas of Figure 4, left), but the voting mechanism was able to mitigate their negative influence. [sent-311, score-0.274]

86 Given the large variation in Figure 4, using worker quality in crowdsourcing for improved training set creation seems promising. [sent-312, score-0.488]

87 1 Blocking low-quality workers A simple approach is to refuse annotations from workers that have been determined to provide low quality answers. [sent-315, score-0.697]

88 While the voting strategy prevented a performance decrease with bad annotations, it needed to expend many extra annotations for correction. [sent-320, score-0.408]

89 When low-quality workers are less active, as in the AL dataset, we find no meaningful performance increase for low cutoffs up to 0. [sent-322, score-0.314]

90 2 Trusting high-quality workers The complementary approach is to take annotations from highly rated workers at face value and immediately accept them as the correct label, bypassing the voting procedure. [sent-328, score-0.941]

91 Bypassing saves the cost of 1554 repeated annotation of the same sentence. [sent-329, score-0.291]

92 Figure 5 shows learning curves for two bypass thresholds on worker quality (measured as proportion of correct non-O tokens) for random (c) and AL (d). [sent-330, score-0.356]

93 While our method of sample selection for AL proved to be quite robust even in the presence of noise, higher quality labels do have an influence on the sample selection (see section 4. [sent-342, score-0.286]

94 6 Conclusion We have investigated the use of AL in a real-life annotation experiment with human annotators instead of traditional simulations with gold labels for (a) (b) Cost (c) Cost (d) Cost Cost Figure 5: Blocking low-quality workers: (a) random, (b) AL. [sent-352, score-0.485]

95 The annotation was performed using MTurk in an AL framework that features concurrent example selection without wait times. [sent-355, score-0.421]

96 We also evaluated two strategies, adaptive voting and fragment recovery, to improve label quality at low additional cost. [sent-356, score-0.519]

97 This is clear evidence that active learning and crowdsourcing are complementary methods for lowering annotation cost and should be used together in training set creation for natural language processing tasks. [sent-362, score-0.708]

98 We have also conducted oracle experiments that show that further performance gains and cost savings can be achieved by using information about worker quality. [sent-363, score-0.296]

99 Using crowdsourcing and active learning to track sentiment in online media. [sent-374, score-0.68]

100 Investigating the effects of selective sampling on the annotation task. [sent-409, score-0.273]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mturk', 0.342), ('ner', 0.299), ('al', 0.296), ('workers', 0.269), ('sentiment', 0.263), ('voting', 0.244), ('crowdsourcing', 0.229), ('annotation', 0.22), ('worker', 0.199), ('active', 0.188), ('uppercase', 0.121), ('frontend', 0.104), ('adaptive', 0.101), ('annotations', 0.099), ('labels', 0.092), ('haertel', 0.086), ('wait', 0.074), ('annotators', 0.072), ('cost', 0.071), ('pool', 0.069), ('backend', 0.069), ('donmez', 0.069), ('staleness', 0.069), ('tokens', 0.067), ('budget', 0.067), ('random', 0.067), ('selection', 0.067), ('noisy', 0.065), ('strategies', 0.061), ('quality', 0.06), ('budgets', 0.06), ('bypassing', 0.06), ('concurrent', 0.06), ('label', 0.059), ('simulated', 0.058), ('gold', 0.056), ('recovery', 0.055), ('fragment', 0.055), ('mechanical', 0.055), ('sampling', 0.053), ('ringger', 0.052), ('submit', 0.05), ('token', 0.049), ('uncertainty', 0.047), ('amazon', 0.047), ('retraining', 0.047), ('batch', 0.047), ('classifier', 0.045), ('simulations', 0.045), ('informativeness', 0.045), ('cutoffs', 0.045), ('rehbein', 0.045), ('named', 0.044), ('lewis', 0.044), ('entity', 0.044), ('informative', 0.043), ('hit', 0.042), ('hachey', 0.04), ('strategy', 0.039), ('brew', 0.037), ('expert', 0.036), ('turk', 0.036), ('agreeing', 0.035), ('blocking', 0.035), ('lawson', 0.035), ('pinar', 0.035), ('reannotate', 0.035), ('robbie', 0.035), ('scheible', 0.035), ('schein', 0.035), ('secs', 0.035), ('tomanek', 0.035), ('voyer', 0.035), ('noise', 0.035), ('ive', 0.033), ('laws', 0.033), ('lines', 0.032), ('examples', 0.032), ('interface', 0.032), ('annotated', 0.031), ('queue', 0.03), ('baldridge', 0.03), ('annotating', 0.03), ('bypass', 0.03), ('retrained', 0.03), ('ipeirotis', 0.03), ('questioned', 0.03), ('spammers', 0.03), ('runs', 0.028), ('bottom', 0.027), ('alexis', 0.027), ('carpenter', 0.027), ('crowdsourced', 0.027), ('queried', 0.027), ('reduces', 0.026), ('oracle', 0.026), ('bad', 0.026), ('majority', 0.026), ('documents', 0.025), ('select', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

Author: Florian Laws ; Christian Scheible ; Hinrich Schutze

2 0.18384168 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora

Author: Matteo Negri ; Luisa Bentivogli ; Yashar Mehdad ; Danilo Giampiccolo ; Alessandro Marchetti

Abstract: We address the creation of cross-lingual textual entailment corpora by means of crowdsourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.

3 0.16918598 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

Author: Burr Settles

Abstract: This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (e.g., words) and instances (e.g., documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels. Human annotators in user studies were able to produce near-stateof-the-art classifiers—on several corpora in a variety of application domains—with only a few minutes of effort.

4 0.16030636 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis

Author: Ainur Yessenalina ; Claire Cardie

Abstract: We present a general learning-based approach for phrase-level sentiment analysis that adopts an ordinal sentiment scale and is explicitly compositional in nature. Thus, we can model the compositional effects required for accurate assignment of phrase-level sentiment. For example, combining an adverb (e.g., “very”) with a positive polar adjective (e.g., “good”) produces a phrase (“very good”) with increased polarity over the adjective alone. Inspired by recent work on distributional approaches to compositionality, we model each word as a matrix and combine words using iterated matrix multiplication, which allows for the modeling of both additive and multiplicative semantic effects. Although the multiplication-based matrix-space framework has been shown to be a theoretically elegant way to model composition (Rudolph and Giesbrecht, 2010), training such models has to be done carefully: the optimization is nonconvex and requires a good initial starting point. This paper presents the first such algorithm for learning a matrix-space model for semantic composition. In the context of the phrase-level sentiment analysis task, our experimental results show statistically significant improvements in performance over a bagof-words model.

5 0.15638943 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

Author: Richard Socher ; Jeffrey Pennington ; Eric H. Huang ; Andrew Y. Ng ; Christopher D. Manning

Abstract: We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model’s ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

6 0.13997267 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification

7 0.13902332 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs

8 0.13347371 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

9 0.12118008 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

10 0.10540193 41 emnlp-2011-Discriminating Gender on Twitter

11 0.080482766 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

12 0.078936078 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

13 0.0775778 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms

14 0.075945392 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

15 0.075363614 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

16 0.059451479 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week

17 0.057562198 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

18 0.05625641 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

19 0.053048141 71 emnlp-2011-Identifying and Following Expert Investors in Stock Microblogs

20 0.052235518 141 emnlp-2011-Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, -0.217), (2, 0.134), (3, 0.064), (4, 0.221), (5, 0.024), (6, 0.042), (7, -0.023), (8, -0.064), (9, 0.098), (10, 0.04), (11, -0.048), (12, -0.05), (13, 0.086), (14, -0.022), (15, 0.024), (16, 0.181), (17, -0.157), (18, -0.188), (19, 0.123), (20, 0.046), (21, 0.196), (22, 0.067), (23, -0.077), (24, -0.158), (25, -0.098), (26, -0.139), (27, 0.161), (28, 0.112), (29, -0.12), (30, 0.086), (31, 0.086), (32, 0.095), (33, 0.039), (34, 0.127), (35, 0.075), (36, 0.008), (37, -0.066), (38, -0.029), (39, -0.002), (40, -0.134), (41, -0.06), (42, -0.094), (43, 0.018), (44, -0.057), (45, -0.01), (46, 0.003), (47, 0.009), (48, -0.003), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96633321 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

Author: Florian Laws ; Christian Scheible ; Hinrich Schutze

2 0.69780844 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora

Author: Matteo Negri ; Luisa Bentivogli ; Yashar Mehdad ; Danilo Giampiccolo ; Alessandro Marchetti

3 0.63732564 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources

Author: Keith Vertanen ; Per Ola Kristensson

Abstract: Augmented and alternative communication (AAC) devices enable users with certain communication disabilities to participate in everyday conversations. Such devices often rely on statistical language models to improve text entry by offering word predictions. These predictions can be improved if the language model is trained on data that closely reflects the style of the users’ intended communications. Unfortunately, there is no large dataset consisting of genuine AAC messages. In this paper we demonstrate how we can crowdsource the creation of a large set of fictional AAC messages. We show that these messages model conversational AAC better than the currently used datasets based on telephone conversations or newswire text. We leverage our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. Compared to a model trained only on telephone transcripts, our best performing model reduced perplexity on three test sets of AAC-like communications by 60– 82% relative. This translated to a potential keystroke savings in a predictive keyboard interface of 5–1 1%.

4 0.49744704 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

Author: Burr Settles

5 0.47023326 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

Author: Richard Socher ; Jeffrey Pennington ; Eric H. Huang ; Andrew Y. Ng ; Christopher D. Manning

6 0.45391488 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs

7 0.42560789 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis

8 0.41475981 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

9 0.38482147 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

10 0.37977913 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification

11 0.33309886 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms

12 0.31667405 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

13 0.30972379 41 emnlp-2011-Discriminating Gender on Twitter

14 0.22469746 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

15 0.2192378 96 emnlp-2011-Multilayer Sequence Labeling

16 0.21783306 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

17 0.21556534 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

18 0.21297054 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

19 0.20672032 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

20 0.19974899 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(15, 0.013), (23, 0.165), (32, 0.281), (36, 0.025), (37, 0.049), (45, 0.072), (53, 0.014), (54, 0.027), (57, 0.015), (62, 0.014), (64, 0.012), (66, 0.029), (69, 0.021), (79, 0.039), (82, 0.03), (87, 0.014), (90, 0.016), (96, 0.053), (98, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90002769 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation

Author: Anna Kazantseva ; Stan Szpakowicz

Abstract: This paper presents a new algorithm for linear text segmentation. It is an adaptation of Affinity Propagation, a state-of-the-art clustering algorithm in the framework of factor graphs. Affinity Propagation for Segmentation, or APS, receives a set of pairwise similarities between data points and produces segment boundaries and segment centres data points which best describe all other data points within the segment. APS iteratively passes messages in a cyclic factor graph, until convergence. Each iteration works with information on all available similarities, resulting in highquality results. APS scales linearly for realistic segmentation tasks. We derive the algorithm from the original Affinity Propagation formu– lation, and evaluate its performance on topical text segmentation in comparison with two state-of-the art segmenters. The results suggest that APS performs on par with or outperforms these two very competitive baselines.

same-paper 2 0.78874326 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

Author: Florian Laws ; Christian Scheible ; Hinrich Schutze

3 0.58431381 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

4 0.57611525 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

Author: Keith Hall ; Ryan McDonald ; Jason Katz-Brown ; Michael Ringgaard

Abstract: We present an online learning algorithm for training parsers which allows for the inclusion of multiple objective functions. The primary example is the extension of a standard supervised parsing objective function with additional loss-functions, either based on intrinsic parsing quality or task-specific extrinsic measures of quality. Our empirical results show how this approach performs for two dependency parsing algorithms (graph-based and transition-based parsing) and how it achieves increased performance on multiple target tasks including reordering for machine translation and parser adaptation.

5 0.57587349 136 emnlp-2011-Training a Parser for Machine Translation Reordering

Author: Jason Katz-Brown ; Slav Petrov ; Ryan McDonald ; Franz Och ; David Talbot ; Hiroshi Ichikawa ; Masakazu Seno ; Hideto Kazawa

Abstract: We propose a simple training regime that can improve the extrinsic performance of a parser, given only a corpus of sentences and a way to automatically evaluate the extrinsic quality of a candidate parse. We apply our method to train parsers that excel when used as part of a reordering component in a statistical machine translation system. We use a corpus of weakly-labeled reference reorderings to guide parser training. Our best parsers contribute significant improvements in subjective translation quality while their intrinsic attachment scores typically regress.

6 0.57410836 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

7 0.57022744 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

8 0.56914103 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

9 0.56807685 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

10 0.56747234 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

11 0.56717068 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

12 0.56631821 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction

13 0.56625825 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

14 0.56472903 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

15 0.56343442 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

16 0.56290901 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

17 0.56220686 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

18 0.5597797 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

19 0.5592801 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

20 0.55837137 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction