acl acl2013 acl2013-300 knowledge-graph by maker-knowledge-mining

300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning


Source: pdf

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Reducing Annotation Effort for Quality Estimation via Active Learning Daniel Beck and Lucia Specia and Trevor Cohn Department of Computer Science University of Sheffield Sheffield, United Kingdom , {debeck 1 l specia . [sent-1, score-0.286]

2 Abstract Quality estimation models provide feedback on the quality of machine translated texts. [sent-2, score-0.257]

3 They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. [sent-3, score-0.073]

4 We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. [sent-4, score-0.453]

5 Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. [sent-5, score-0.315]

6 In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. [sent-6, score-0.767]

7 1 Introduction The purpose of machine translation (MT) quality estimation (QE) is to provide a quality prediction for new, unseen machine translated texts, without relying on reference translations (Blatz et al. [sent-8, score-0.467]

8 This task is usually addressed with machine learning models trained on datasets composed of source sentences, their machine translations, and a quality label assigned by humans. [sent-12, score-0.335]

9 A common use of quality predictions is the decision between post-editing a given machine translated sentence and translating its source from scratch, based on whether its post-editing effort is estimated to be lower than the effort of translating the source sentence. [sent-13, score-0.428]

10 Since quality scores for the training of QE models are given by human experts, the annotation process is costly and subject to inconsistencies due to the subjectivity of the task. [sent-14, score-0.356]

11 To avoid inconsistencies because of disagreements among annotators, it is often recommended that a QE model is trained cohn} @ she ffie ld . [sent-15, score-0.048]

12 uk for each translator, based on labels given by such a translator (Specia, 2011). [sent-17, score-0.088]

13 This further increases the annotation costs because different datasets are needed for different tasks. [sent-18, score-0.215]

14 Therefore, strategies to reduce the demand for annotated data are needed. [sent-19, score-0.125]

15 Such strategies can also bring the possibility of selecting data that is less prone to inconsistent annotations, resulting in more robust and accurate predictions. [sent-20, score-0.188]

16 In this paper we investigate Active Learning (AL) techniques to reduce the size of the dataset while keeping the performance of the resulting QE models. [sent-21, score-0.126]

17 AL provides methods to select informative data points from a large pool which, if labelled, can potentially improve the performance of a machine learning algorithm (Settles, 2010). [sent-22, score-0.147]

18 The rationale behind these methods is to help the learning algorithm achieve satisfactory results from only on a subset of the available data, thus incurring less annotation effort. [sent-23, score-0.117]

19 2 Related Work Most research work on QE for machine transla- tion is focused on feature engineering and feature selection, with some recent work on devising more reliable and less subjective quality labels. [sent-24, score-0.225]

20 Quirk (2004) showed that small datasets manually annotated by humans for quality can result in models that outperform those trained on much larger, automatically labelled sets. [sent-29, score-0.257]

21 Since quality labels are subjective to the annotators’ judgements, Specia and Farzindar (2010) evaluated the performance of QE models using HTER (Snover et al. [sent-30, score-0.199]

22 Specia (201 1) compared the performance of models based on labels for 543 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t. [sent-34, score-0.046]

23 For an overview on various feature sets and machine learning algorithms, we refer the reader to a recent shared task on the topic (Callison-Burch et al. [sent-38, score-0.073]

24 1 Datasets We perform experiments using four MT datasets manually annotated for quality: English-Spanish (en-es): 2, 254 sentences translated by Moses (Koehn et al. [sent-44, score-0.225]

25 Effort scores range from 1 (too bad to be post-edited) to 5 (no post-editing needed). [sent-47, score-0.037]

26 Three expert post-editors evaluated each sentence and the final score was obtained by a weighted average between the three scores. [sent-48, score-0.045]

27 We use the default split given in the shared task: 1, 832 sentences for training and 432 for test. [sent-49, score-0.087]

28 French-English (fr-en): 2, 525 sentences translated by Moses as provided in Specia (201 1), annotated by a single translator. [sent-50, score-0.109]

29 Human labels indicate post-editing effort ranging from 1 (too bad to be post-edited) to 4 (little or no post-editing needed). [sent-51, score-0.137]

30 We use a random split of 90% sentences for training and 10% for test. [sent-52, score-0.122]

31 Arabic-English (ar-en): 2, 585 sentences translated by two state-of-the-art SMT systems (denoted ar-en-1 and ar-en-2), as provided in (Specia et al. [sent-53, score-0.109]

32 A random split of 90% sentences for training and 10% for test is used. [sent-55, score-0.122]

33 Human labels in- dicate the adequacy ofthe translation ranging from 1 (completely inadequate) to 4 (adequate). [sent-56, score-0.11]

34 2 Query Methods The core of an AL setting is how the learner will gather new instances to add to its training data. [sent-59, score-0.258]

35 In our setting, we use a pool-based strategy, where the learner queries an instance pool and selects the best instance according to an informativeness measure. [sent-60, score-0.294]

36 The learner then asks an “oracle” (in this case, the human expert) for the true label of the instance and adds it to the training data. [sent-61, score-0.161]

37 Query methods use different criteria to predict how informative an instance is. [sent-62, score-0.06]

38 In the following, we denote M(x) the query score with respect to method M. [sent-64, score-0.196]

39 β The β parameter controls the relative importance of the density term. [sent-66, score-0.058]

40 In our experiments, we set it to 1, giving equal weights to variance and density. [sent-67, score-0.092]

41 The U term is the number of instances in the query pool. [sent-68, score-0.353]

42 With each method, we choose the instance that maximises its respective equation. [sent-70, score-0.093]

43 For each dataset and each query method, we performed 20 active learning simulation experiments and averaged the results. [sent-78, score-0.431]

44 544 started with 50 randomly selected sentences from the training set and used all the remaining training sentences as our query pool, adding one new sentence to the training set at each iteration. [sent-81, score-0.412]

45 Results were evaluated by measuring Mean Absolute Error (MAE) scores on the test set. [sent-82, score-0.037]

46 We also performed an “oracle” experiment: at each iteration, it selects the instance that minimises the MAE on the test set. [sent-83, score-0.134]

47 The oracle results give an upper bound in performance for each test set. [sent-84, score-0.224]

48 Since an SVR does not supply variance values for its predictions, we employ a technique known as query-by-bagging (Abe and Mamitsuka, 1998). [sent-85, score-0.092]

49 The idea is to build an ensemble of N SVRs trained on sub-samples of the training data. [sent-86, score-0.1]

50 When selecting a new query, the ensemble is able to return N predictions for each instance, from where a variance value can be inferred. [sent-87, score-0.186]

51 We used 20 SVRs as our ensemble and 20 as the size of each training sub-sample. [sent-88, score-0.1]

52 2 The variance values are then used as-is in the case of US strategy and combined with query densities in case of the ID strategy. [sent-89, score-0.288]

53 4 Results and Discussion Figure 1 shows the learning curves for all query methods and all datasets. [sent-90, score-0.442]

54 The “random” curves are our baseline since they are equivalent to passive learning (with various numbers of instances). [sent-91, score-0.289]

55 We first evaluated our methods in terms of how many instances they needed to achieve 99% of the MAE score on the full dataset. [sent-92, score-0.206]

56 For three datasets, the AL methods significantly outperformed the random selection baseline, while no improvement was observed on the ar-en-1 dataset. [sent-93, score-0.073]

57 The learning curves in Figure 1 show an interesting behaviour for most AL methods: some of them were able to yield lower MAE scores than models trained on the full dataset. [sent-95, score-0.396]

58 This is particularly interesting in the fr-en case, where both methods were able to obtain better scores using only ∼25% of the available instances, with the oUnSl ym ∼et2h5o%d resulting iani 0a. [sent-96, score-0.037]

59 Figure 1: Learning curves for different query se- lection strategies in the four datasets. [sent-99, score-0.488]

60 The horizontal axis shows the number of instances in the training set and the vertical axis shows MAE scores. [sent-100, score-0.594]

61 number (proportion) of instances used to obtain the best MAE, the second column shows the MAE score obtained and the third column shows the MAE score for random instance selection at the same number of instances. [sent-113, score-0.356]

62 The last column shows the MAE obtained using the full dataset. [sent-114, score-0.075]

63 Best scores are shown in bold and are significantly better (paired t-test, p < 0. [sent-115, score-0.037]

64 datasets than those used currently can be sufficient for machine translation QE. [sent-117, score-0.219]

65 The best MAE scores achieved for each dataset are shown in Table 2. [sent-118, score-0.074]

66 The lower bounds in MAE given by the oracle curves show that AL methods can indeed improve the performance of QE models: an ideal query method would achieve a very large improvement in MAE using fewer than 200 instances in all datasets. [sent-120, score-0.789]

67 The fact that different datasets present similar oracle curves suggests that this is not related for a specific dataset but actually a common behaviour in QE. [sent-121, score-0.702]

68 5 Further analysis on the oracle behaviour By analysing the oracle curves we can observe another interesting phenomenon which is the rapid increase in error when reaching the last ∼200 inisntacnrceeasse eo ifn th erer training rdeaatac. [sent-123, score-0.91]

69 h nAg possible explana3We took the average of the MAE scores obtained from the 20 runs with each query method for that. [sent-124, score-0.233]

70 tion for this behaviour is the existence of erroneous, inconsistent or contradictory labels in the datasets. [sent-125, score-0.199]

71 Quality annotation is a subjective task by nature, and it is thus subject to noise, e. [sent-126, score-0.096]

72 Our hypothesis is that these last sentences are the most difficult to annotate and therefore more prone to disagreements. [sent-129, score-0.166]

73 To investigate this phenomenon, we performed an additional experiment with the en-es dataset, the only dataset for which multiple annotations are available (from three judges). [sent-130, score-0.081]

74 We measure the Kappa agreement index (Cohen, 1960) between all pairs of judges in the subset containing the first 300 instances (the 50 initial random instances plus 250 instances chosen by the oracle). [sent-131, score-0.609]

75 We then measured Kappa in windows of 300 instances until the last instance of the training set is selected by the oracle method. [sent-132, score-0.562]

76 The idea of this experiment is to test whether sentences that are more difficult to annotate (because of their length or subjectivity, generating more disagreement between the judges) add noise to the dataset. [sent-134, score-0.126]

77 The resulting Kappa curves are shown in Fig- ure 2: the agreement between judges is high for the initial set of sentences selected, tends to decrease until it reaches ∼1000 instances, and then csrtaeratsse eto u nintcilr ietas reea again. [sent-135, score-0.325]

78 Contrary to our hy546 Figure 2: Kappa curves for the en-es dataset. [sent-137, score-0.212]

79 The horizontal axis shows the number of instances and the vertical axis shows the kappa values. [sent-138, score-0.653]

80 Each point in the curves shows the kappa index for a window containing the last 300 sentences chosen by the oracle. [sent-139, score-0.435]

81 pothesis, these results suggest that the most difficult sentences chosen by the oracle are those in the middle range instead of the last ones. [sent-140, score-0.346]

82 If we compare this trend against the oracle curve in Figure 1, we can see that those middle instances are the ones that do not change the performance of the oracle. [sent-141, score-0.381]

83 The resulting trends are interesting because they give evidence that sentences that are difficult to annotate do not contribute much to QE performance (although not hurting it either). [sent-142, score-0.153]

84 However, they do not confirm our hypothesis about the oracle behaviour. [sent-143, score-0.224]

85 Another possible source of disagreement is the feature set: the features may not be discriminative enough to distinguish among different instances, i. [sent-144, score-0.038]

86 , instances with very similar features but different labels might be genuinely different, but the current features are not sufficient to indicate that. [sent-146, score-0.236]

87 In future work we plan to further investigate this by hypothesis by using other feature sets and analysing their behaviour. [sent-147, score-0.097]

88 6 Conclusions and Future Work We have presented the first known experiments using active learning for the task of estimating machine translation quality. [sent-148, score-0.301]

89 The results are promising: we were able to reduce the number of instances needed to train the models in three of the four datasets. [sent-149, score-0.251]

90 In addition, in some of the datasets active learning yielded significantly better models using only a small subset of the training instances. [sent-150, score-0.356]

91 The horizontal axis shows the number of instances and the vertical axis shows the length values. [sent-152, score-0.552]

92 Each point in the curves shows the average length for a window containing the last 300 sentences chosen by the oracle. [sent-153, score-0.334]

93 The oracle results give evidence that it is possible to go beyond these encouraging results by employing better selection strategies in active learn- ing. [sent-154, score-0.538]

94 In future work we will investigate more advanced query techniques that consider features other than variance and density of the data points. [sent-155, score-0.39]

95 We also plan to further investigate the behaviour of the oracle curves using not only different feature sets but also different quality scores such as HTER and post-editing time. [sent-156, score-0.737]

96 We believe that a better understanding of this behaviour can guide further developments not only for instance selection techniques but also for the design of better quality features and quality annotation schemes. [sent-157, score-0.475]

97 Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. [sent-183, score-0.21]

98 A literature survey of active machine learning in the context of natural language processing. [sent-197, score-0.237]

99 An analysis of active learning strategies for sequence labeling tasks. [sent-209, score-0.278]

100 A study of translation edit rate with targeted human annotation. [sent-217, score-0.064]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mae', 0.406), ('qe', 0.358), ('specia', 0.286), ('oracle', 0.224), ('curves', 0.212), ('query', 0.196), ('active', 0.164), ('instances', 0.157), ('lucia', 0.139), ('axis', 0.129), ('al', 0.124), ('datasets', 0.116), ('behaviour', 0.113), ('quality', 0.107), ('kappa', 0.101), ('variance', 0.092), ('effort', 0.091), ('blatz', 0.087), ('strategies', 0.08), ('matthieu', 0.075), ('svrs', 0.075), ('pool', 0.074), ('settles', 0.072), ('horizontal', 0.072), ('judges', 0.068), ('vertical', 0.065), ('mt', 0.065), ('translated', 0.064), ('translation', 0.064), ('lewis', 0.063), ('hter', 0.062), ('instance', 0.06), ('learner', 0.059), ('ensemble', 0.058), ('density', 0.058), ('beck', 0.058), ('pedregosa', 0.055), ('id', 0.055), ('svr', 0.053), ('analysing', 0.053), ('baldridge', 0.051), ('burr', 0.051), ('annotation', 0.05), ('abe', 0.049), ('needed', 0.049), ('inconsistencies', 0.048), ('estimation', 0.047), ('sheffield', 0.046), ('labels', 0.046), ('subjective', 0.046), ('reduce', 0.045), ('expert', 0.045), ('sentences', 0.045), ('investigate', 0.044), ('snover', 0.043), ('passive', 0.043), ('annotate', 0.043), ('translator', 0.042), ('last', 0.042), ('training', 0.042), ('selects', 0.041), ('gale', 0.041), ('moses', 0.041), ('costly', 0.04), ('inconsistent', 0.04), ('machine', 0.039), ('disagreement', 0.038), ('selection', 0.038), ('alexandre', 0.037), ('windows', 0.037), ('scores', 0.037), ('dataset', 0.037), ('prone', 0.036), ('predictions', 0.036), ('chris', 0.035), ('cohn', 0.035), ('random', 0.035), ('chosen', 0.035), ('labelled', 0.034), ('learning', 0.034), ('fredrik', 0.033), ('olsson', 0.033), ('pothesis', 0.033), ('maximises', 0.033), ('minimises', 0.033), ('humanannotated', 0.033), ('hajlaoui', 0.033), ('najeh', 0.033), ('incurring', 0.033), ('devising', 0.033), ('qtlaunchpad', 0.033), ('bertrand', 0.033), ('genuinely', 0.033), ('hurting', 0.033), ('column', 0.033), ('subjectivity', 0.032), ('bring', 0.032), ('koehn', 0.032), ('summit', 0.032), ('evidence', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

2 0.29989696 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

Author: Trevor Cohn ; Lucia Specia

Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.

3 0.2817525 289 acl-2013-QuEst - A translation quality estimation framework

Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn

Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.

4 0.14112939 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

Author: Guillaume Wisniewski

Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.

5 0.1372693 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

Author: Maria Skeppstedt

Abstract: For expanding a corpus of clinical text, annotated for named entities, a method that combines pre-tagging with a version of active learning is proposed. In order to facilitate annotation and to avoid bias, two alternative automatic pre-taggings are presented to the annotator, without revealing which of them is given a higher confidence by the pre-tagging system. The task of the annotator is to select the correct version among these two alternatives. To minimise the instances in which none of the presented pre-taggings is correct, the texts presented to the annotator are actively selected from a pool of unlabelled text, with the selection criterion that one of the presented pre-taggings should have a high probability of being correct, while still being useful for improving the result of an automatic classifier.

6 0.13114782 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

7 0.086774945 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

8 0.082523182 135 acl-2013-English-to-Russian MT evaluation campaign

9 0.08131659 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

10 0.078968391 273 acl-2013-Paraphrasing Adaptation for Web Search Ranking

11 0.078591846 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

12 0.077439941 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

13 0.075761512 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

14 0.073834442 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

15 0.070948139 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

16 0.070885547 328 acl-2013-Stacking for Statistical Machine Translation

17 0.070447266 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

18 0.067315117 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.

19 0.065506138 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

20 0.065082565 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.2), (1, -0.021), (2, 0.09), (3, -0.0), (4, 0.018), (5, -0.012), (6, 0.051), (7, -0.062), (8, 0.048), (9, 0.057), (10, -0.034), (11, 0.097), (12, -0.125), (13, 0.068), (14, -0.148), (15, -0.037), (16, -0.124), (17, 0.04), (18, 0.042), (19, -0.015), (20, 0.104), (21, 0.039), (22, -0.177), (23, 0.081), (24, -0.077), (25, -0.034), (26, -0.071), (27, 0.058), (28, -0.073), (29, 0.079), (30, -0.103), (31, -0.029), (32, 0.03), (33, 0.069), (34, -0.168), (35, -0.041), (36, 0.067), (37, 0.142), (38, -0.021), (39, 0.066), (40, -0.148), (41, -0.127), (42, 0.119), (43, -0.014), (44, 0.102), (45, -0.079), (46, -0.006), (47, 0.057), (48, -0.043), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93065548 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

2 0.81952858 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

Author: Trevor Cohn ; Lucia Specia

Abstract: Annotating linguistic data is often a complex, time consuming and expensive endeavour. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bringing different biases, interpretations of the task and levels of consistency. We present novel techniques for learning from the outputs of multiple annotators while accounting for annotator specific behaviour. These techniques use multi-task Gaussian Processes to learn jointly a series of annotator and metadata specific models, while explicitly representing correlations between models which can be learned directly from data. Our experiments on two machine translation quality estimation datasets show uniform significant accuracy gains from multi-task learning, and consistently outperform strong baselines.

3 0.79579943 289 acl-2013-QuEst - A translation quality estimation framework

Author: Lucia Specia ; ; ; Kashif Shah ; Jose G.C. de Souza ; Trevor Cohn

Abstract: We describe QUEST, an open source framework for machine translation quality estimation. The framework allows the extraction of several quality indicators from source segments, their translations, external resources (corpora, language models, topic models, etc.), as well as language tools (parsers, part-of-speech tags, etc.). It also provides machine learning algorithms to build quality estimation models. We benchmark the framework on a number of datasets and discuss the efficacy of features and algorithms.

4 0.68404502 263 acl-2013-On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation

Author: Guillaume Wisniewski

Abstract: This paper tackles the problem of collecting reliable human assessments. We show that knowing multiple scores for each example instead of a single score results in a more reliable estimation of a system quality. To reduce the cost of collecting these multiple ratings, we propose to use matrix completion techniques to predict some scores knowing only scores of other judges and some common ratings. Even if prediction performance is pretty low, decisions made using the predicted score proved to be more reliable than decision based on a single rating of each example.

5 0.65168774 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

Author: Maria Skeppstedt

Abstract: For expanding a corpus of clinical text, annotated for named entities, a method that combines pre-tagging with a version of active learning is proposed. In order to facilitate annotation and to avoid bias, two alternative automatic pre-taggings are presented to the annotator, without revealing which of them is given a higher confidence by the pre-tagging system. The task of the annotator is to select the correct version among these two alternatives. To minimise the instances in which none of the presented pre-taggings is correct, the texts presented to the annotator are actively selected from a pool of unlabelled text, with the selection criterion that one of the presented pre-taggings should have a high probability of being correct, while still being useful for improving the result of an automatic classifier.

6 0.58578938 135 acl-2013-English-to-Russian MT evaluation campaign

7 0.56666023 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

8 0.51265901 250 acl-2013-Models of Translation Competitions

9 0.51112378 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

10 0.46831509 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

11 0.46180519 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

12 0.4528034 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

13 0.44779047 322 acl-2013-Simple, readable sub-sentences

14 0.44356379 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

15 0.44039267 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

16 0.43676442 273 acl-2013-Paraphrasing Adaptation for Web Search Ranking

17 0.43276578 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

18 0.41211432 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

19 0.40667075 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

20 0.40467793 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (6, 0.463), (11, 0.038), (15, 0.01), (24, 0.024), (26, 0.042), (35, 0.065), (42, 0.049), (48, 0.028), (52, 0.015), (70, 0.027), (88, 0.018), (90, 0.059), (95, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91128844 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

2 0.90430117 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

3 0.89350688 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

Author: Dehong Gao ; Wenjie Li ; Renxian Zhang

Abstract: The growth of the Web 2.0 technologies has led to an explosion of social networking media sites. Among them, Twitter is the most popular service by far due to its ease for realtime sharing of information. It collects millions of tweets per day and monitors what people are talking about in the trending topics updated timely. Then the question is how users can understand a topic in a short time when they are frustrated with the overwhelming and unorganized tweets. In this paper, this problem is approached by sequential summarization which aims to produce a sequential summary, i.e., a series of chronologically ordered short subsummaries that collectively provide a full story about topic development. Both the number and the content of sub-summaries are automatically identified by the proposed stream-based and semantic-based approaches. These approaches are evaluated in terms of sequence coverage, sequence novelty and sequence correlation and the effectiveness of their combination is demonstrated. 1 Introduction and Background Twitter, as a popular micro-blogging service, collects millions of real-time short text messages (known as tweets) every second. It acts as not only a public platform for posting trifles about users’ daily lives, but also a public reporter for real-time news. Twitter has shown its powerful ability in information delivery in many events, like the wildfires in San Diego and the earthquake in Japan. Nevertheless, the side effect is individual users usually sink deep under millions of flooding-in tweets. To alleviate this problem, the applications like whatthetrend 1 have evolved from Twitter to provide services that encourage users to edit explanatory tweets about a trending topic, which can be regarded as topic summaries. It is to some extent a good way to help users understand trending topics. 1 whatthetrend.com There is also pioneering research in automatic Twitter trending topic summarization. (O'Connor et al., 2010) explained Twitter trending topics by providing a list of significant terms. Users could utilize these terms to drill down to the tweets which are related to the trending topics. (Sharifi et al., 2010) attempted to provide a one-line summary for each trending topic using phrase reinforcement ranking. The relevance model employed by (Harabagiu and Hickl, 2011) generated summaries in larger size, i.e., 250word summaries, by synthesizing multiple high rank tweets. (Duan et al., 2012) incorporate the user influence and content quality information in timeline tweet summarization and employ reinforcement graph to generate summaries for trending topics. Twitter summarization is an emerging research area. Current approaches still followed the traditional summarization route and mainly focused on mining tweets of both significance and representativeness. Though, the summaries generated in such a way can sketch the most important aspects of the topic, they are incapable of providing full descriptions of the changes of the focus of a topic, and the temporal information or freshness of the tweets, especially for those newsworthy trending topics, like earthquake and sports meeting. As the main information producer in Twitter, the massive crowd keeps close pace with the development of trending topics and provide the timely updated information. The information dynamics and timeliness is an important consideration for Twitter summarization. That is why we propose sequential summarization in this work, which aims to produce sequential summaries to capture the temporal changes of mass focus. Our work resembles update summarization promoted by TAC 2 which required creating summaries with new information assuming the reader has already read some previous documents under the same topic. Given two chronologically ordered documents sets about a topic, the systems were asked to generate two 2 www.nist.gov/tac 567 summaries, and the second one should inform the user of new information only. In order to achieve this goal, existing approaches mainly emphasized the novelty of the subsequent summary (Li and Croft, 2006; Varma et al., 2009; Steinberger and Jezek, 2009). Different from update summarization, we focus more on the temporal change of trending topics. In particular, we need to automatically detect the “update points” among a myriad of related tweets. It is the goal of this paper to set up a new practical summarization application tailored for timely updated Twitter messages. With the aim of providing a full description of the focus changes and the records of the timeline of a trending topic, the systems are expected to discover the chronologically ordered sets of information by themselves and they are free to generate any number of update summaries according to the actual situations instead of a fixed number of summaries as specified in DUC/TAC. Our main contributions include novel approaches to sequential summarization and corresponding evaluation criteria for this new application. All of them will be detailed in the following sections. 2 Sequential Summarization Sequential summarization proposed here aims to generate a series of chronologically ordered subsummaries for a given Twitter trending topic. Each sub-summary is supposed to represent one main subtopic or one main aspect of the topic, while a sequential summary, made up by the subsummaries, should retain the order the information is delivered to the public. In such a way, the sequential summary is able to provide a general picture of the entire topic development. 2.1 Subtopic Segmentation One of the keys to sequential summarization is subtopic segmentation. How many subtopics have attracted the public attention, what are they, and how are they developed? It is important to provide the valuable and organized materials for more fine-grained summarization approaches. We proposed the following two approaches to automatically detect and chronologically order the subtopics. 2.1.1 Stream-based Subtopic Detection and Ordering Typically when a subtopic is popular enough, it will create a certain level of surge in the tweet stream. In other words, every surge in the tweet stream can be regarded as an indicator of the appearance of a subtopic that is worthy of being summarized. Our early investigation provides evidence to support this assumption. By examining the correlations between tweet content changes and volume changes in randomly selected topics, we have observed that the changes in tweet volume can really provide the clues of topic development or changes of crowd focus. The stream-based subtopic detection approach employs the offline peak area detection (Opad) algorithm (Shamma et al., 2010) to locate such surges by tracing tweet volume changes. It regards the collection of tweets at each such surge time range as a new subtopic. Offline Peak Area Detection (Opad) Algorithm 1: Input: TS (tweets stream, each twi with timestamp ti); peak interval window ∆? (in hour), and time stepℎ (ℎ ≪ ∆?); 2: Output: Peak Areas PA. 3: Initial: two time slots: ?′ = ? = ?0 + ∆?; Tweet numbers: ?′ = ? = ?????(?) 4: while (?? = ? + ℎ) < ??−1 5: update ?′ = ?? + ∆? and ?′ = ?????(?′) 6: if (?′ < ? And up-hilling) 7: output one peak area ??? 8: state of down-hilling 9: else 10: update ? = ?′ and ? = ?′ 11: state of up-hilling 12: 13: function ?????(?) 14: Count tweets in time interval T The subtopics detected by the Opad algorithm are naturally ordered in the timeline. 2.1.2 Semantic-based Subtopic Detection and Ordering Basically the stream-based approach monitors the changes of the level of user attention. It is easy to implement and intuitively works, but it fails to handle the cases where the posts about the same subtopic are received at different time ranges due to the difference of geographical and time zones. This may make some subtopics scattered into several time slots (peak areas) or one peak area mixed with more than one subtopic. In order to sequentially segment the subtopics from the semantic aspect, the semantic-based subtopic detection approach breaks the time order of tweet stream, and regards each tweet as an individual short document. It takes advantage of Dynamic Topic Modeling (David and Michael, 2006) to explore the tweet content. 568 DTM in nature is a clustering approach which can dynamically generate the subtopic underlying the topic. Any clustering approach requires a pre-specified cluster number. To avoid tuning the cluster number experimentally, the subtopic number required by the semantic-based approach is either calculated according to heuristics or determined by the number of the peak areas detected from the stream-based approach in this work. Unlike the stream-based approach, the subtopics formed by DTM are the sets of distributions of subtopic and word probabilities. They are time independent. Thus, the temporal order among these subtopics is not obvious and needs to be discovered. We use the probabilistic relationships between tweets and topics learned from DTM to assign each tweet to a subtopic that it most likely belongs to. Then the subtopics are ordered temporally according to the mean values of their tweets’ timestamps. 2.2 Sequential Summary Generation Once the subtopics are detected and ordered, the tweets belonging to each subtopic are ranked and the most significant one is extracted to generate the sub-summary regarding that subtopic. Two different ranking strategies are adopted to conform to two different subtopic detection mechanisms. For a tweet in a peak area, the linear combination of two measures is considered to independently. Each sub-summary is up to 140 characters in length to comply with the limit of tweet, but the annotators are free to choose the number of sub-summaries. It ends up with 6.3 and 4.8 sub-summaries on average in a sequential summary written by the two annotators respectively. These two sets of sequential summaries are regarded as reference summaries to evaluate system-generated summaries from the following three aspects. Sequence Coverage Sequence coverage measures the N-gram match between system-generated summaries and human-written summaries (stopword removed first). Considering temporal information is an important factor in sequential summaries, we evaluate its significance to be a sub-summary: (1) subtopic representativeness measured by the  cosine similarity between the tweet and the centroid of all the tweets in the same peak area; (2) crowding endorsement measured by the times that the tweet is re-tweeted normalized by the total number of re-tweeting. With the DTM model, the significance of the tweets is evaluated directly by word distribution per subtopic. MMR (Carbonell and Goldstein, 1998) is used to reduce redundancy in sub-summary generation. 3 Experiments and Evaluations The experiments are conducted on the 24 Twitter trending topics collected using Twitter APIs 3 . The statistics are shown in Table 1. Due to the shortage of gold-standard sequential summaries, we invite two annotators to read the chronologically ordered tweets, and write a series of sub-summaries for each topic 3https://dev.twitter.com/ propose the position-aware coverage measure by accommodating the position information in matching. Let S={s1, s2, sk} denote a … … …, sequential summary and si the ith sub-summary, N-gram coverage is defined as: ???????? =|? 1?|?∑?∈? ?∑? ? ?∈?∙ℎ ?∑ ? ?∈?-?ℎ? ?∑? ∈-? ?,? ? ? ?∈? ? ? ? ? ? ? (ℎ?(?-?-? ? ? ?) where, ??? = |? − ?| + 1, i and j denote the serial numbers of the sub-summaries in the systemgenerated summary ??? and the human-written summary ?ℎ? , respectively. ? serves as a coefficient to discount long-distance matched sub-summaries. We evaluate unigram, bigram, and skipped bigram matches. Like in ROUGE (Lin, 2004), the skip distance is up to four words.  Sequence Novelty Sequence novelty evaluates the average novelty of two successive sub-summaries. Information content (IC) has been used to measure the novelty of update summaries by (Aggarwal et al., 2009). In this paper, the novelty of a system569 generated sequential summary is defined as the average of IC increments of two adjacent subsummaries, ??????? =|?|1 − 1?∑>1(????− ????, ??−1) × where |?| is the number of sub-summaries in the sequential summary. ???? = ∑?∈?? ??? . ????, ??−1 = ∑?∈??∩??−1 ??? is the overlapped information in the two adjacent sub-summaries. ??? = ???? ?????????(?, ???) where w is a word, ???? is the inverse tweet frequency of w, and ??? is all the tweets in the trending topic. The relevance function is introduced to ensure that the information brought by new sub-summaries is not only novel but also related to the topic.  Sequence Correlation Sequence correlation evaluates the sequential matching degree between system-generated and human-written summaries. In statistics, Kendall’s tau coefficient is often used to measure the association between two sequences (Lapata, 2006). The basic idea is to count the concordant and discordant pairs which contain the same elements in two sequences. Borrowing this idea, for each sub-summary in a human-generated summary, we find its most matched subsummary (judged by the cosine similarity measure) in the corresponding system-generated summary and then define the correlation according to the concordance between the two matched sub-summary sequences. ??????????? 2(|#???????????????| |#???????????????|) − = ?(? − 1) where n is the number of human-written subsummaries. Tables 2 and 3 below present the evaluation results. For the stream-based approach, we set ∆t=3 hours experimentally. For the semanticbased approach, we compare three different approaches to defining the sub-topic number K: (1) Semantic-based 1: Following the approach proposed in (Li et al., 2007), we first derive the matrix of tweet cosine similarity. Given the 1norm of eigenvalues ?????? (? = 1, 2, ,?) of the similarity matrix and the ratios ?? = ??????/?2 , the subtopic number ? = ? + 1 if ?? − ??+1 > ? (? 0.4 ). (2) Semantic-based 2: Using the rule of thumb in (Wan and Yang, 2008), ? = √? , where n is the tweet number. (3) Combined: K is defined as the number of the peak areas detected from the Opad algorithm, meanwhile we use the … = tweets within peak areas as the tweets of DTM. This is our new idea. The experiments confirm the superiority of the semantic-based approach over the stream-based approach in summary content coverage and novelty evaluations, showing that the former is better at subtopic content modeling. The subsummaries generated by the stream-based approach have comparative sequence (i.e., order) correlation with the human summaries. Combining the advantages the two approaches leads to the best overall results. SCebomaSCs beonmtdivr1eac( ∆nrdδ-bm(ta=i∆g0-cs3e.t)5=d32U0 n.3ig510r32a7m B0 .i1g 6r3589a46m87 SB0 k.i1 gp8725r69ame173d Table 2. N-Gram Coverage Evaluation Sem CtraeonTmaA tmicapb-nplibentria ec3os-de.abcd N(a∆hs(o1evt∆=(sdetδ=3l2)t 0y).a4n)dCoN0r .o 73e vl071ea96lti783 oy nEvCalo0ur a. 3 tei3792ol3a489nt650io n 4 Concluding Remarks We start a new application for Twitter trending topics, i.e., sequential summarization, to reveal the developing scenario of the trending topics while retaining the order of information presentation. We develop several solutions to automatically detect, segment and order subtopics temporally, and extract the most significant tweets into the sub-summaries to compose sequential summaries. Empirically, the combination of the stream-based approach and the semantic-based approach leads to sequential summaries with high coverage, low redundancy, and good order. Acknowledgments The work described in this paper is supported by a Hong Kong RGC project (PolyU No. 5202/12E) and a National Nature Science Foundation of China (NSFC No. 61272291). References Aggarwal Gaurav, Sumbaly Roshan and Sinha Shakti. 2009. Update Summarization. Stanford: CS224N Final Projects. 570 Blei M. David and Jordan I. Michael. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113120. Pittsburgh, Pennsylvania. Carbonell Jaime and Goldstein Jade. 1998. The use of MMR, diversity based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, 335-336. Melbourne, Australia. Duan Yajuan, Chen Zhimin, Wei Furu, Zhou Ming and Heung-Yeung Shum. 2012. Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality. In Proceedings of the 24th International Conference on Computational Linguistics, 763-780. Mumbai, India. Harabagiu Sanda and Hickl Andrew. 2011. Relevance Modeling for Microblog Summarization. In Proceedings of 5th International AAAI Conference on Weblogs and Social Media. Barcelona, Spain. Lapata Mirella. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 1-14. Li Wenyuan, Ng Wee-Keong, Liu Ying and Ong Kok-Leong. 2007. Enhancing the Effectiveness of Clustering with Spectra Analysis. IEEE Transactions on Knowledge and Data Engineering, 19(7):887-902. Li Xiaoyan and Croft W. Bruce. 2006. Improving novelty detection for general topics using sentence level information patterns. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 238-247. New York, USA. Lin Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out, 74-81 . Barcelona, Spain. Liu Fei, Liu Yang and Weng Fuliang. 2011. Why is “SXSW ” trending? Exploring Multiple Text Sources for Twitter Topic Summarization. In Proceedings of the ACL Workshop on Language in Social Media, 66-75. Portland, Oregon. O'Connor Brendan, Krieger Michel and Ahn David. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, 384-385. Atlanta, Georgia. Shamma A. David, Kennedy Lyndon and Churchill F. Elizabeth. 2010. Tweetgeist: Can the Twitter Timeline Reveal the Structure of Broadcast Events? In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 589-593. Savannah, Georgia, USA. Sharifi Beaux, Hutton Mark-Anthony and Kalita Jugal. 2010. Summarizing Microblogs Automatically. In Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 685688. Los Angeles, California. Steinberger Josef and Jezek Karel. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering, 205-213. Munich, Germany. Varma Vasudeva, Bharat Vijay, Kovelamudi Sudheer, Praveen Bysani, Kumar K. N, Kranthi Reddy, Karuna Kumar and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In Proceedings of the 2009 Text Analysis Conference. GaithsBurg, Maryland. Wan Xiaojun and Yang Jianjun. 2008. Multidocument summarization using cluster-based link analysis. In Proceedings of the 3 1st Annual International Conference on Research and Development in Information Retrieval, 299-306. Singapore, Singapore. 571

4 0.87933934 145 acl-2013-Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks

Author: Jose G.C. de Souza ; Miquel Espla-Gomis ; Marco Turchi ; Matteo Negri

Abstract: The use of automatic word alignment to capture sentence-level semantic relations is common to a number of cross-lingual NLP applications. Despite its proved usefulness, however, word alignment information is typically considered from a quantitative point of view (e.g. the number of alignments), disregarding qualitative aspects (the importance of aligned terms). In this paper we demonstrate that integrating qualitative information can bring significant performance improvements with negligible impact on system complexity. Focusing on the cross-lingual textual en- tailment task, we contribute with a novel method that: i) significantly outperforms the state of the art, and ii) is portable, with limited loss in performance, to language pairs where training data are not available.

5 0.83965898 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

Abstract: Recently, researchers have begun exploring methods of scoring student essays with respect to particular dimensions of quality such as coherence, technical errors, and relevance to prompt, but there is relatively little work on modeling thesis clarity. We present a new annotated corpus and propose a learning-based approach to scoring essays along the thesis clarity dimension. Additionally, in order to provide more valuable feedback on why an essay is scored as it is, we propose a second learning-based approach to identifying what kinds of errors an essay has that may lower its thesis clarity score.

6 0.81083298 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

7 0.79697794 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

8 0.57499111 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

9 0.56204683 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

10 0.56099147 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

11 0.54906887 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

12 0.5480938 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

13 0.54562104 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

14 0.54025805 333 acl-2013-Summarization Through Submodularity and Dispersion

15 0.53436446 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

16 0.52727669 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

17 0.52587515 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

18 0.5212962 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

19 0.51315874 176 acl-2013-Grounded Unsupervised Semantic Parsing

20 0.50563824 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization