emnlp emnlp2011 emnlp2011-82 knowledge-graph by maker-knowledge-mining

82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

Source: pdf

Author: Richard Farkas

Abstract: Information-oriented document labeling is a special document multi-labeling task where the target labels refer to a specific information instead of the topic of the whole document. These kind oftasks are usually solved by looking up indicator phrases and analyzing their local context to filter false positive matches. Here, we introduce an approach for machine learning local content shifters which detects irrelevant local contexts using just the original document-level training labels. We handle content shifters in general, instead of learning a particular language phenomenon detector (e.g. negation or hedging) and form a single system for document labeling and content shift detection. Our empirical results achieved 24% error reduction compared to supervised baseline methods – on three document label– ing tasks.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These kind oftasks are usually solved by looking up indicator phrases and analyzing their local context to filter false positive matches. [sent-2, score-0.493]

2 Here, we introduce an approach for machine learning local content shifters which detects irrelevant local contexts using just the original document-level training labels. [sent-3, score-0.559]

3 We handle content shifters in general, instead of learning a particular language phenomenon detector (e. [sent-4, score-0.468]

4 negation or hedging) and form a single system for document labeling and content shift detection. [sent-6, score-0.706]

5 1 Introduction There are special document multi-labeling tasks where the target labels refer to a specific piece of information extractable from the document instead of the overall topic of the document. [sent-8, score-0.392]

6 In these kinds of tasks the target information is usually an attribute or relation related to the target entity (usually a person or an organisation) of the document in question, but the task is to assign class labels at the document (entity) level. [sent-9, score-0.428]

7 For example, the smoking habits of the patients are frequently discussed in the textual parts of clinical notes (Uzuner et al. [sent-10, score-0.351]

8 Similarly, the soccer club names where a sportsman played for are document(sportman)-level labels in Wikipedia articles expressed by the Wikipedia cat– egories. [sent-17, score-0.472]

9 The target information in these tasks is usually just mentioned in the document and much of the document is irrelevant for this information request in contrast to standard document classification tasks where the goal is to identify the topics of the whole document. [sent-18, score-0.54]

10 There are several application areas where information-oriented document labels are naturally present in an enormous amount like clinical records, Wikipedia categories and usergenerated tags of news. [sent-22, score-0.447]

11 , 2007; Uzuner, 2009) demonstrated that information-oriented document labeling can be effectively performed by looking up indicator phrases which can be gathered by hand, by corpus statistics or in a hybrid way. [sent-25, score-0.518]

12 However these campaigns also highlighted that the analysis of the local context of the indicator phrases is crucial. [sent-26, score-0.372]

13 For in- stance, in the smoking habit detection task there are a few indicator words (e. [sent-27, score-0.389]

14 We propose a simple but efficient approach for information-oriented document labeling tasks by addressing the automatic detection of language phenomena for a particular task which alters the sense or information content of the indicator phrase’s occurrences. [sent-36, score-0.689]

15 clinical notes usually contain information about the family history of the patient); or the semantic content of the shifter may change the role of the target span of a text (e. [sent-43, score-0.512]

16 We call these phenomena content shifters and the task of identifying them content shift detection (CSD). [sent-46, score-0.788]

17 , 2007) or a supervised learning approach that exploits corpora manually annotated at the token-level for a particular type of content shifter (Morante et al. [sent-48, score-0.291]

18 We argue that the nature of content shifters is domain and task dependent, so training corpora (at the token-level) are required for content shifters which are important for a particular task but the construction of such training corpora is expensive. [sent-51, score-0.754]

19 a clinical dataset consisting clinical notes and meta-data about patients). [sent-55, score-0.491]

20 Our approach extracts indicator phrases and trains a CSD jointly. [sent-56, score-0.297]

21 We focus on local content 760 shifters and we analyse just the sentences of indicator phrase occurrences. [sent-57, score-0.715]

22 Our chief assumption is that CSD can be learnt by exploiting the false positive occurrences of indicator phrases in the training dataset. [sent-58, score-0.598]

23 2 Related Work Information-oriented document classification tasks were first highlighted in the clinical domain where medical reports contain useful information about the patient in question, but labels are only available at the document (patient) level. [sent-61, score-0.783]

24 These challenges were dominated by entirely or partly rulebased systems that solved the tasks using indicator phrase lookup and incorporated explicit mechanisms for detecting speculation and negation. [sent-69, score-0.362]

25 Existing content shift detection approaches focus on a particular class of language phenomena, especially on negation and hedge recognitions. [sent-73, score-0.648]

26 content shifted text spans detectors for negation and speculation following a supervised sequence labeling approach, while O¨zg u¨r and Radev (2009) developed a rule-based system that exploits syntactic patterns. [sent-90, score-0.548]

27 We deal with the two tasks (information-oriented document classification and 761 content shift detection) together and introduce a colearning approach for them. [sent-99, score-0.563]

28 Our approach handles content shifters in a data-driven and generalized way i. [sent-100, score-0.377]

29 Several systems of the challenge employed a negation and speculation detection submodule. [sent-126, score-0.285]

30 The challenge in 2008 focused on analyzing clinical discharge summary texts and addressed the following question: ”Who is obese and what comorbidities do they have? [sent-130, score-0.343]

31 The top performing systems of the shared task employed mostly hand-crafted rules for indicator selection and for negation and uncertainty detection as well. [sent-138, score-0.523]

32 The categories assigned to Wikipedia articles can be regarded as labels (for example, the labels of David Beckham in the Wikipedia are English people, expatriate soccer player, male model and A. [sent-143, score-0.355]

33 For a case study we focused on learning English soccer clubs that a given sportsman played for. [sent-149, score-0.375]

34 Note that this task is an information-oriented document labeling task as the clubs for which a sportsman played are usually just mentioned (especially for smaller clubs) in the article of a player. [sent-150, score-0.389]

35 4 Document-labeling with CSD We introduce here an iterative solution which selects indicator phrases and trains a content shift detector at the same time. [sent-155, score-0.743]

36 Our focus will be on multilabel document classification tasks where multiple class labels can be assigned to a single document. [sent-156, score-0.304]

37 Our resulting multi-label model is then a set of binary classifiers ”assign a label” classifiers for each class label and the final prediction on a document is simply the union of the labels forecasted by the individual classifiers. [sent-158, score-0.378]

38 Our key assumption in the multi-label environment is that while indicator phrases have to be selected on a per class basis, the content shifters can be learnt in a class-independent (aggregated) way i. [sent-159, score-0.854]

39 we can assume that within one task, each class label belongs to a given semantic domain (determined by the task), thus the content shifters for their indicator – – phrases are the same. [sent-161, score-0.756]

40 This approach provides an adequate amount of training samples for content shift detector learning. [sent-162, score-0.446]

41 1 Learning Content Shift Detectors The key idea behind our approach is that a training corpus for task-specific content shifter learning can be automatically generated by exploiting the occurrences of indicators in various contexts. [sent-166, score-0.389]

42 The local context of an indicator is assumed to have altered if it yields a false positive document-level prediction. [sent-167, score-0.54]

43 More precisely, a training dataset can be constructed for learning a content shift detector in a way that the instances are the local contexts ofeach occurrence of indicator phrases in the training document set. [sent-168, score-1.065]

44 The instances of this content shifter training dataset are then labeled as non-altered when the indicated label is among the gold-standard labels of the document in question or is labeled as altered other- wise. [sent-169, score-0.722]

45 As a feature representation of a local context of an indicator phrase, the bag-of-words of the sentence instance (excluding the indicator phrase itself) was used at the beginning. [sent-171, score-0.572]

46 Our preliminary experiments showed that the tokens of the sentence after the indicator played a negligible role, hence we represented contexts just by tokens before the indicator. [sent-172, score-0.304]

47 First, the deepest noun phrase which includes the indicator phrase was identified, then all 3We parse only the sentences which contain indicator phrase which makes these features computable in reasonable time even on bigger document sets. [sent-175, score-0.749]

48 From the dependency parse, the lemmas and dependency labels on the directed path from the indicator to the root node (main path) were extracted. [sent-177, score-0.327]

49 – – Table 2 exemplifies the feature representation of local contexts of the Newcastle and Arsenal indicators for the Wikipedia soccer task. [sent-181, score-0.376]

50 We want to learn content shifters from them along with the true positive match of Arsenal in sentence 2. [sent-183, score-0.421]

51 2 Co-learning of Indicator Selection and CSD If document labels are available at training time, an iterative approach can be used to learn the local content shift detector and the indicator phrases as well. [sent-188, score-1.044]

52 The training phase of this procedure (see Algorithm 1) has two outputs, namely the set of indicator phrases for each label I the content shift deand tector S which is a binary function for determining whether the sense of an indicator in a particular local context is being altered. [sent-189, score-1.007]

53 Good indicator phrases are those that identify the class label in question when they are present. [sent-190, score-0.379]

54 In each step of the iteration we select indicator phrases I[l] for each label lbased on the actual state of the document set D′. [sent-191, score-0.509]

55 The better the selected indicators are, the better the content shift detectors can be learnt. [sent-195, score-0.503]

56 By applying the content shift detector to each token of the documents, each part of the texts lying within the scope of a content shifter can be removed4. [sent-196, score-0.766]

57 A non-altered indicator directly assigns a class label without any global consistency check on assigned labels. [sent-203, score-0.316]

58 There are several possible ways of developing indicator selection algorithms. [sent-210, score-0.286]

59 However indicator selection is not the focus of this 5A derivation is more complicated example-based classifiers like SVMs. [sent-213, score-0.321]

60 or unfeasible for Table 3: Results obtained for local content shift detection in a precision/recall/F-measure format. [sent-214, score-0.486]

61 The aim of the the indicator selection here is to cover each positive documents while introducing a relatively small amount of false positives. [sent-226, score-0.443]

62 The greedy algorithm iteratively selects the 1-best phrase according to a feature evaluation metric based on the actual state of covered documents and adds it to the indicator phrase set. [sent-227, score-0.328]

63 5 Experiments Experiments were carried out on the three datasets introduced in Section 3 with local content shift detection as an individual task and also to investigate its added value to information-oriented document labeling. [sent-231, score-0.691]

64 1 Content Shifter Learning Results In order to evaluate content shift detection as an individual task, a set ofindicator phrases have to be fixed as an input to the CSD. [sent-236, score-0.502]

65 We used manually collected indicator phrases for each label for each dataset. [sent-237, score-0.343]

66 Based on the occurrences of these fixed indicator phrases, CSD training datasets were built from the local contexts of the three datasets and binary classification was carried out by using MaxEnt. [sent-241, score-0.497]

67 Here, the precision/recall/F-measure values measure how many false positive matches of the indicator phrases can be recognized (the F-measure of the altered class), i. [sent-243, score-0.528]

68 here, the true positives are local contexts of an indicator phrase which do not indicate a document label in the evaluation set and the local content shift detector predicted it to be altered. [sent-245, score-1.147]

69 Row 3 of Table 3 (we refer to it as content shifted sentence detection baseline (CSSDB) later on) shows the results archived by the method which predicts every sentence to be alt e red which contain any cue phrases for negation, modality and different experiencer. [sent-247, score-0.412]

70 We collected cue phrases for such a content shifted sentence detection from the dataset adapters can be found as the supplementary material works of Chapman et al. [sent-249, score-0.346]

71 On the CMC dataset, our machine learning approach identified mostly negation and speculation expressions as content shifters; the top weighted features for the positive class of the MaxEnt model were no, without, may and vs. [sent-260, score-0.487]

72 On the Obesity dataset, similar content shifters were learnt along with references to family members (like the terms mother and uncle, and the family history header). [sent-263, score-0.521]

73 The significance of these types of content shifters may be illustrated by the following sentence: History of hypertension in mother and sister. [sent-264, score-0.377]

74 The soccer task highlighted totally different content shifters which is also the reason for the poor performance of CSSDB. [sent-265, score-0.584]

75 The mention of a club name which the person in question did not play for (false positives) is usually a rival club, club of an unsuccessful negotiation or club which was managed by the footballer after his retirement. [sent-266, score-0.33]

76 Table 3 shows that learnt 766 CSDs were able to eliminate a significant amount of false positive indicator phrase matches on each of the three datasets. [sent-270, score-0.528]

77 5) on the Soccer dataset as content shifters different from negation, hedge and experiencer are useful there. [sent-272, score-0.497]

78 On the other hand, the content shifters could be learnt on this dataset by our CSD approach (achieving F-score of 79. [sent-273, score-0.57]

79 This score can be regarded as an upper bound for the amount of false positive indicator matches that can be fixed by local speculation and negation detectors. [sent-287, score-0.687]

80 introducing more soccer clubs in the Soccer task) would also directly increase the size of training dataset as we use the occurrences of the indicator phrases belong- ing to each of the labels for training a CSD. [sent-300, score-0.722]

81 The first two learners are popular choices for document classification, while the third is similar to our simple indicator selection procedure. [sent-306, score-0.452]

82 Second, as our indicator 767 selection phrase can be regarded as a special feature selection method, we carried out an Information Gain-based feature selection (keeping the 500 bestrated features proved to be the best solution) on the bag-of-word representation of the documents. [sent-310, score-0.447]

83 The indicator selection results presented in the rows 4-8 of Table 4 made use of the p(+ |f)-based irnowdicsa 4t-o8r osefl Tecatbolre 4w mitha a ufisvee o-ffo tldhe-c pro(s+s-|vfa)l-idbaatseedd stopping threshold t (introduced in Section 4. [sent-313, score-0.286]

84 For training the CSD, we employed MaxEnt as a binary classifier for detecting altered local contexts and we used the basic BoW feature representation for the clinical tasks while the extended (BoW+syntactic) one for the Soccer dataset. [sent-318, score-0.438]

85 In the final experiment (the last row of Table 4)) we investigated whether the learnt content shift detector can be applied as a general ”document cleaner” tool. [sent-319, score-0.63]

86 For this, we trained the baseline MaxEnt document classifier with feature selection on documents from which the text spans predicted to be altered by the learnt CSD in the tenth iteration were removed. [sent-320, score-0.508]

87 This means that the systems used in row 2 and row 9 differ only in the applied document cleaner pre-processing steps (the first one applied the CSSDB while the latter one employed the the learnt CSD). [sent-321, score-0.39]

88 The difference between the best baseline and the indicator selector with learnt CSD and between the best baseline and the document classifier with learnt CSD were statistically significant8 on each dataset. [sent-322, score-0.688]

89 Our co-learning method which integrated the document-labeling and CSD tasks significantly outperformed the baseline approaches which use separate document cleaning and document labeling steps on the three datasets. [sent-325, score-0.387]

90 On the soccer domain club names, synonyms (like The Saints) and stadium names (e. [sent-330, score-0.317]

91 Note that in these information-oriented document multi-labeling tasks simple indicator selectionbased document labelers alone achieved results comparable to the bag-of-words-based classifiers. [sent-334, score-0.566]

92 The learnt content shift detectors led to an average improvement of 3. [sent-335, score-0.585]

93 After a few iterations the set of indicator phrases and the content shift detector did not change substantially. [sent-341, score-0.743]

94 001 768 6 Conclusions In this paper, we dealt with information-oriented document labeling tasks and investigated machine learning approaches for local content shift detectors from document-level labels. [sent-345, score-0.737]

95 We demonstrated experimentally that a significant amount of false positive matches of indicator phrases can be recognized by trained content shift detectors. [sent-346, score-0.773]

96 A co-learning framework for training local content shift detectors and indicator selection was introduced as well. [sent-352, score-0.802]

97 However, the proposed content shift detector learning approach is tailored for informationoriented document labeling tasks, i. [sent-355, score-0.667]

98 it performs – – well when not too many and reliable indicator phrases are present. [sent-357, score-0.297]

99 In the future, we plan to investigate and extend the framework for the general document classification task where many indicators with complex relationships among them determine the labels of a document but local content shifters can play an important role. [sent-358, score-0.948]

100 A shared task involving multi-label classification of clinical free text. [sent-455, score-0.314]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('csd', 0.453), ('indicator', 0.234), ('clinical', 0.221), ('farkas', 0.212), ('soccer', 0.207), ('shifters', 0.199), ('content', 0.178), ('shift', 0.177), ('document', 0.166), ('cmc', 0.156), ('learnt', 0.144), ('csds', 0.142), ('negation', 0.13), ('obesity', 0.127), ('uzuner', 0.122), ('shifter', 0.113), ('club', 0.11), ('altered', 0.11), ('speculation', 0.099), ('orgy', 0.099), ('smoking', 0.099), ('szarvas', 0.099), ('detector', 0.091), ('detectors', 0.086), ('bionlp', 0.085), ('gy', 0.085), ('icd', 0.085), ('pestian', 0.085), ('vincze', 0.085), ('false', 0.077), ('local', 0.075), ('clubs', 0.073), ('patient', 0.073), ('wikipedia', 0.072), ('hedge', 0.071), ('discharge', 0.066), ('phrases', 0.063), ('indicators', 0.062), ('coding', 0.061), ('labels', 0.06), ('bioscope', 0.057), ('cssdb', 0.057), ('pred', 0.057), ('sportsman', 0.057), ('detection', 0.056), ('labeling', 0.055), ('player', 0.055), ('medical', 0.055), ('selection', 0.052), ('shared', 0.051), ('dataset', 0.049), ('biological', 0.048), ('label', 0.046), ('positive', 0.044), ('positives', 0.044), ('arsenal', 0.042), ('medlock', 0.042), ('morante', 0.042), ('newcastle', 0.042), ('sauri', 0.042), ('classification', 0.042), ('row', 0.04), ('datasets', 0.039), ('rd', 0.039), ('played', 0.038), ('coach', 0.037), ('documents', 0.036), ('occurrences', 0.036), ('class', 0.036), ('classifiers', 0.035), ('diseases', 0.035), ('maxent', 0.034), ('veronika', 0.033), ('lemmas', 0.033), ('contexts', 0.032), ('chapman', 0.032), ('alt', 0.031), ('patients', 0.031), ('phrase', 0.029), ('scope', 0.029), ('regarded', 0.028), ('red', 0.028), ('archived', 0.028), ('cleaned', 0.028), ('comorbidities', 0.028), ('deepest', 0.028), ('ganter', 0.028), ('hedges', 0.028), ('hedging', 0.028), ('kilicoglu', 0.028), ('larkey', 0.028), ('modality', 0.028), ('nhofen', 0.028), ('obert', 0.028), ('obese', 0.028), ('ofindicator', 0.028), ('recognise', 0.028), ('roser', 0.028), ('smoker', 0.028), ('solt', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

Author: Richard Farkas

2 0.074181758 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts

Author: Bryan Rink ; Sanda Harabagiu

Abstract: This paper presents a generative model for the automatic discovery of relations between entities in electronic medical records. The model discovers relation instances and their types by determining which context tokens express the relation. Additionally, the valid semantic classes for each type of relation are determined. We show that the model produces clusters of relation trigger words which better correspond with manually annotated relations than several existing clustering techniques. The discovered relations reveal some of the implicit semantic structure present in patient records.

3 0.069530688 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

Author: Sebastian Riedel ; Andrew McCallum

Abstract: Extracting biomedical events from literature has attracted much recent attention. The bestperforming systems so far have been pipelines of simple subtask-specific local classifiers. A natural drawback of such approaches are cascading errors introduced in early stages of the pipeline. We present three joint models of increasing complexity designed to overcome this problem. The first model performs joint trigger and argument extraction, and lends itself to a simple, efficient and exact inference algorithm. The second model captures correlations between events, while the third model ensures consistency between arguments of the same event. Inference in these models is kept tractable through dual decomposition. The first two models outperform the previous best joint approaches and are very competitive with respect to the current state-of-theart. The third model yields the best results reported so far on the BioNLP 2009 shared task, the BioNLP 2011 Genia task and the BioNLP 2011Infectious Diseases task.

4 0.061375361 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

Author: Peng Li ; Yinglin Wang ; Wei Gao ; Jing Jiang

Abstract: In this paper, we propose a novel approach to automatic generation of aspect-oriented summaries from multiple documents. We first develop an event-aspect LDA model to cluster sentences into aspects. We then use extended LexRank algorithm to rank the sentences in each cluster. We use Integer Linear Programming for sentence selection. Key features of our method include automatic grouping of semantically related sentences and sentence ranking based on extension of random walk model. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We compare our method with four baseline methods. Quantitative evaluation based on Rouge metric demonstrates the effectiveness and advantages of our method.

5 0.060927112 114 emnlp-2011-Relation Extraction with Relation Topics

Author: Chang Wang ; James Fan ; Aditya Kalyanpur ; David Gondek

Abstract: This paper describes a novel approach to the semantic relation detection problem. Instead of relying only on the training instances for a new relation, we leverage the knowledge learned from previously trained relation detectors. Specifically, we detect a new semantic relation by projecting the new relation’s training instances onto a lower dimension topic space constructed from existing relation detectors through a three step process. First, we construct a large relation repository of more than 7,000 relations from Wikipedia. Second, we construct a set of non-redundant relation topics defined at multiple scales from the relation repository to characterize the existing relations. Similar to the topics defined over words, each relation topic is an interpretable multinomial distribution over the existing relations. Third, we integrate the relation topics in a kernel function, and use it together with SVM to construct detectors for new relations. The experimental results on Wikipedia and ACE data have confirmed that backgroundknowledge-based topics generated from the Wikipedia relation repository can significantly improve the performance over the state-of-theart relation detection approaches.

6 0.055491947 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

7 0.051285584 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

8 0.050670598 128 emnlp-2011-Structured Relation Discovery using Generative Models

9 0.047687508 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

10 0.047604956 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

11 0.04677207 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

12 0.046744123 125 emnlp-2011-Statistical Machine Translation with Local Language Models

13 0.046642799 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

14 0.045344941 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling

15 0.044664562 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

16 0.044613723 116 emnlp-2011-Robust Disambiguation of Named Entities in Text

17 0.044199649 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification

18 0.043482676 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

19 0.039614853 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use

20 0.038694281 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.157), (1, -0.09), (2, -0.046), (3, -0.026), (4, -0.011), (5, -0.008), (6, 0.026), (7, -0.028), (8, -0.01), (9, 0.014), (10, 0.01), (11, -0.048), (12, -0.011), (13, 0.012), (14, 0.045), (15, 0.033), (16, -0.017), (17, -0.011), (18, -0.048), (19, -0.015), (20, -0.041), (21, -0.005), (22, 0.037), (23, -0.078), (24, -0.051), (25, -0.054), (26, 0.077), (27, 0.105), (28, 0.05), (29, -0.034), (30, 0.016), (31, 0.024), (32, -0.116), (33, -0.094), (34, -0.04), (35, 0.055), (36, 0.321), (37, 0.208), (38, 0.195), (39, 0.064), (40, -0.064), (41, 0.135), (42, 0.189), (43, 0.017), (44, 0.02), (45, -0.082), (46, 0.126), (47, 0.098), (48, -0.152), (49, 0.105)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93766797 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

Author: Richard Farkas

2 0.46798453 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

Author: Sebastian Riedel ; Andrew McCallum

3 0.43206069 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

Author: Eiji ARAMAKI ; Sachiko MASKAWA ; Mizuki MORITA

Abstract: Sachiko MASKAWA The University of Tokyo Tokyo, Japan s achi ko . mas kawa @ gma i . com l Mizuki MORITA National Institute of Biomedical Innovation Osaka, Japan mori ta . mi zuki @ gmai l com . posts more than 5.5 million messages (tweets) every day (reported by Twitter.com in March 201 1). With the recent rise in popularity and scale of social media, a growing need exists for systems that can extract useful information from huge amounts of data. We address the issue of detecting influenza epidemics. First, the proposed system extracts influenza related tweets using Twitter API. Then, only tweets that mention actual influenza patients are extracted by the support vector machine (SVM) based classifier. The experiment results demonstrate the feasibility of the proposed approach (0.89 correlation to the gold standard). Especially at the outbreak and early spread (early epidemic stage), the proposed method shows high correlation (0.97 correlation), which outperforms the state-of-the-art methods. This paper describes that Twitter texts reflect the real world, and that NLP techniques can be applied to extract only tweets that contain useful information. 1

4 0.3982982 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

Author: Ashish Venugopal ; Jakob Uszkoreit ; David Talbot ; Franz Och ; Juri Ganitkevitch

Abstract: We propose a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method is robust to local editing operations and provides well defined trade-offs between the ability to identify algorithm outputs and the quality of the watermarked output. Unlike previous work in the field, our approach does not rely on controlling the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm. We present an application in statistical machine translation, where machine translated output is watermarked at minimal loss in translation quality and detected with high recall. 1 Motivation Machine learning algorithms provide structured results to input queries by simulating human behavior. Examples include automatic machine translation (Brown et al. , 1993) or automatic text and rich media summarization (Goldstein et al. , 1999) . These algorithms often estimate some portion of their models from publicly available human generated data. As new services that output structured results are made available to the public and the results disseminated on the web, we face a daunting new challenge: Machine generated structured results contaminate the pool of naturally generated human data. For example, machine translated output 1363 2Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218, USA juri@cs.jhu.edu and human generated translations are currently both found extensively on the web, with no automatic way of distinguishing between them. Algorithms that mine data from the web (Uszkoreit et al. , 2010) , with the goal of learning to simulate human behavior, will now learn models from this contaminated and potentially selfgenerated data, reinforcing the errors committed by earlier versions of the algorithm. It is beneficial to be able to identify a set of encountered structured results as having been generated by one’s own algorithm, with the purpose of filtering such results when building new models. Problem Statement: We define a structured result of a query q as r = {z1 · · · zL} where tthuree odr rdeesru latn dof identity qof a sele rm =en {tzs zi are important to the quality of the result r. The structural aspect of the result implies the existence of alternative results (across both the order of elements and the elements themselves) that might vary in their quality. Given a collection of N results, CN = r1 · · · rN, where each result ri has k rankedC alterna·t·iv·ers Dk (qi) of relatively similar quality and queries q1 · · · qN are arbitrary and not controlled by the watermarking algorithm, we define the watermarking task as: Task. Replace ri with ri0 ∈ Dk (qi) for some subset of results in CN to produce a watermarked sceoltle ocfti orens CN0 slleucchti otnha Ct: • CN0 is probabilistically identifiable as having bCeen generated by one’s own algorithm. Proce dEindgisnb oufr tgh e, 2 S0c1o1tl Canodn,f eUrKen,c Jeuol yn 2 E7m–3p1ir,ic 2a0l1 M1.e ?tc ho2d0s1 in A Nsasotucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag uesis 1ti3c6s3–1372, • • • 2 the degradation in quality from CN to the wthaete dremgarrakdeadt CN0 isnho quulda bitye analytically controllable, trading quality for detection performance. CN0 should not be detectable as watermCarked content without access to the generating algorithms. the detection of CN0 should be robust to simple eddeitte operations performed on individual results r ∈ CN0. Impact on Statistical Machine Translation Recent work(Resnik and Smith, 2003; Munteanu and Marcu, 2005; Uszkoreit et al. , 2010) has shown that multilingual parallel documents can be efficiently identified on the web and used as training data to improve the quality of statistical machine translation. The availability of free translation services (Google Translate, Bing Translate) and tools (Moses, Joshua) , increase the risk that the content found by parallel data mining is in fact generated by a machine, rather than by humans. In this work, we focus on statistical machine translation as an application for watermarking, with the goal of discarding documents from training if they have been generated by one’s own algorithms. To estimate the magnitude of the problem, we used parallel document mining (Uszkoreit et al. , 2010) to generate a collection of bilingual document pairs across several languages. For each document, we inspected the page content for source code that indicates the use of translation modules/plug-ins that translate and publish the translated content. We computed the proportion of the content within our corpus that uses these modules. We find that a significant proportion of the mined parallel data for some language pairs is generated via one of these translation modules. The top 3 languages pairs, each with parallel translations into English, are Tagalog (50.6%) , Hindi (44.5%) and Galician (41.9%) . While these proportions do not reflect impact on each language’s monolingual web, they are certainly high 1364 enough to affect machine translations systems that train on mined parallel data. In this work, we develop a general approach to watermark structured outputs and apply it to the outputs of a statistical machine translation system with the goal of identifying these same outputs on the web. In the context of the watermarking task defined above, we output selecting alternative translations for input source sentences. These translations often undergo simple edit and formatting operations such as case changes, sentence and word deletion or post editing, prior to publishing on the web. We want to ensure that we can still detect watermarked translations despite these edit operations. Given the rapid pace of development within machine translation, it is also important that the watermark be robust to improvements in underlying translation quality. Results from several iterations of the system within a single collection of documents should be identifiable under probabilistic bounds. While we present evaluation results for statistical machine translation, our proposed approach and associated requirements are applicable to any algorithm that produces structured results with several plausible alternatives. The alternative results can arise as a result of inherent task ambiguity (for example, there are multiple correct translations for a given input source sentence) or modeling uncertainty (for example, a model assigning equal probability to two competing results) . 3 Watermark Structured Results Selecting an alternative r0 from the space of alternatives Dk (q) can be stated as: r0= arr∈gDmk(aqx)w(r,Dk(q),h) (1) where w ranks r ∈ Dk (q) based on r’s presentwahtieorne owf a watermarking signal computed by a hashing operation h. In this approach, w and its component operation h are the only secrets held by the watermarker. This selection criterion is applied to all system outputs, ensuring that watermarked and non-watermarked version of a collection will never be available for comparison. A specific implementation of w within our watermarking approach can be evaluated by the following metrics: • • • False Positive Rate: how often nonFwaaltseermarked collections are falsely identified as watermarked. Recall Rate: how often watermarked collRecectiaolnls R are correctly inde wntaitfeierdm as wdat ceorl-marked. Quality Degradation: how significantly dQoueasl CN0 d Dieffegrr fdraotmio CN when evaluated by tdaoseks specific quality Cmetrics. While identification is performed at the collection level, we can scale these metrics based on the size of each collection to provide more task sensitive metrics. For example, in machine translation, we count the number of words in the collection towards the false positive and recall rates. In Section 3.1, we define a random hashing operation h and a task independent implementation of the selector function w. Section 3.2 describes how to classify a collection of watermarked results. Section 3.3 and 3.4 describes refinements to the selection and classification criteria that mitigate quality degradation. Following a comparison to related work in Section 4, we present experimental results for several languages in Section 5. 3.1 Watermarking: CN → CN0 We define a random hashing operation h that is applied to result r. It consists of two components: • A hash function applied to a structured re- sAul ht r hto f generate a lbieitd sequence cotfu a dfix reedlength. • An optional mapping that maps a single cAannd oidptaitoen raels umlta r ntog a hsaett mofa spusb -are ssiunlgtsle. Each sub-result is then hashed to generate a concatenated bit sequence for r. A good hash function produces outputs whose bits are independent. This implies that we can treat the bits for any input structured results 1365 as having been generated by a binomial distribution with equal probability of generating 1s vs 0s. This condition also holds when accumulating the bit sequences over a collection of results as long as its elements are selected uniformly from the space of possible results. Therefore, the bits generated from a collection of unwatermarked results will follow a binomial distribution with parameter p = 0.5. This result provides a null hypothesis for a statistical test on a given bit sequence, testing whether it is likely to have been generated from a binomial distribution binomial(n, p) where p = 0.5 and n is the length of the bit sequence. For a collection CN = r1 · · · rN, we can define a Fwaorte arm coalrlekc ranking funct·i·o·nr w to systematically select alternatives ri0 ∈ Dk (q) , such that the resulting CN0 is unlikely ∈to D produce bit sequences ltthinagt f Collow the p = 0.5 binomial distribution. A straightforward biasing criteria would be to select the candidate whose bit sequence exhibits the highest ratio of 1s. w can be defined as: (2) w(r,Dk(q),h) =#(|h1,(rh)(|r)) where h(r) returns the randomized bit sequence for result r, and #(x, y) counts the number of occurrences of x in sequence Selecting alternatives results to exhibit this bias will result in watermarked collections that exhibit this same bias. y. 3.2 Detecting the Watermark To classify a collection CN as watermarked or non-watermarked, we apply the hashing operation h on each element in CN and concatenate ttihoen sequences. eTlhemis sequence is tested against the null hypothesis that it was generated by a binomial distribution with parameter p = 0.5. We can apply a Fisherian test of statistical significance to determine whether the observed distribution of bits is unlikely to have occurred by chance under the null hypothesis (binomial with p = 0.5) . We consider a collection of results that rejects the null hypothesis to be watermarked results generated by our own algorithms. The p-value under the null hypothesis is efficiently computed by: p − value = Pn (X ≥ x) = Xi=nx?ni?pi(1 − p)n−i (3) (4) where x is the number of 1s observed in the collection, and n is the total number of bits in the sequence. Comparing this p-value against a desired significance level α, we reject the null hypothesis for collections that have Pn(X ≥ x) < α, thus deciding that such collections( were gen- erated by our own system. This classification criteria has a fixed false positive rate. Setting α = 0.05, we know that 5% of non-watermarked bit sequences will be falsely labeled as watermarked. This parameter α can be controlled on an application specific basis. By biasing the selection of candidate results to produce more 1s than 0s, we have defined a watermarking approach that exhibits a fixed false positive rate, a probabilistically bounded detection rate and a task independent hashing and selection criteria. In the next sections, we will deal with the question of robustness to edit operations and quality degradation. 3.3 Robustness and Inherent Bias We would like the ability to identify watermarked collections to be robust to simple edit operations. Even slight modifications to the elements within an item r would yield (by construction of the hash function) , completely different bit sequences that no longer preserve the biases introduced by the watermark selection function. To ensure that the distributional biases introduced by the watermark selector are preserved, we can optionally map individual results into a set of sub-results, each one representing some local structure of r. h is then applied to each subresult and the results concatenated to represent r. This mapping is defined as a component of the h operation. While a particular edit operation might affect a small number of sub-results, the majority of the bits in the concatenated bit sequence for r would remain untouched, thereby limiting the damage to the biases selected during watermark1366 ing. This is of course no defense to edit operations that are applied globally across the result; our expectation is that such edits would either significantly degrade the quality of the result or be straightforward to identify directly. For example, a sequence of words r = z1 · · · zL can be mapped into a set of consecutive n-gram sequences. Operations to edit a word zi in r will only affect events that consider the word zi. To account for the fact that alternatives in Dk (q) might now result in bit sequences of different lengths, we can generalize the biasing criteria to directly reflect the expected contribution to the watermark by defining: w(r, Dk(q), h) = Pn(X ≥ #(1, h(r))) (5) where Pn gives probabilities from binomial(n = |h(r) |,p = 0.5) . (Irn)|h,epr =en 0t. 5c)o.llection level biases: Our null hypothesis is based on the assumption that collections of results draw uniformly from the space of possible results. This assumption might not always hold and depends on the type of the results and collection. For example, considering a text document as a collection of sentences, we can expect that some sentences might repeat more frequently than others. This scenario is even more likely when applying a mapping into sub-results. n-gram sequences follow long-tailed or Zipfian distributions, with a small number of n-grams contributing heavily toward the total number of n-grams in a document. A random hash function guarantees that inputs are distributed uniformly at random over the output range. However, the same input will be assigned the same output deterministically. Therefore, if the distribution of inputs is heavily skewed to certain elements of the input space, the output distribution will not be uniformly distributed. The bit sequences resulting from the high frequency sub-results have the potential to generate inherently biased distributions when accumulated at the collection level. We want to choose a mapping that tends towards generating uniformly from the space of sub-results. We can empirically measure the quality of a sub-result mapping for a specific task by computing the false positive rate on non-watermarked collections. For a given significance level α, an ideal mapping would result in false positive rates close to α as well. Figure 1 shows false positive rates from 4 alternative mappings, computed on a large corpus of French documents (see Table 1for statistics) . Classification decisions are made at the collection level (documents) but the contribution to the false positive rate is based on the number of words in the classified document. We consider mappings from a result (sentence) into its 1-grams, 1 − 5-grams and 3 − 5 grams as well as trahem non-mapping case, w 3h −ere 5 tghrea mfusll a sres wuelltl is hashed. Figure 1 shows that the 1-grams and 1 − 5gram generate wsusb t-hraetsul tthse t 1h-agtr rmessu latn idn 1h −eav 5-ily biased false positive rates. The 3 − 5 gram mapping yields pfaolsseit positive r.a Ttesh ecl 3os −e t 5o gthraemir theoretically expected values. 1 Small deviations are expected since documents make different contributions to the false positive rate as a function of the number of words that they represent. For the remainder of this work, we use the 3-5 gram mapping and the full sentence mapping, since the alternatives generate inherently distributions with very high false positive rates. 3.4 Considering Quality The watermarking described in Equation 3 chooses alternative results on a per result basis, with the goal of influencing collection level bit sequences. The selection criteria as described will choose the most biased candidates available in Dk (q) . The parameter k determines the extent to which lesser quality alternatives can be chosen. If all the alternatives in each Dk (q) are of relatively similar quality, we expect minimal degradation due to watermarking. Specific tasks however can be particularly sensitive to choosing alternative results. Discriminative approaches that optimize for arg max selection like (Och, 2003; Liang et al. , 2006; Chiang et al. , 2009) train model parameters such 1In the final version of this paper we will perform sampling to create a more reliable estimate of the false positive rate that is not overly influenced by document length distributions. 1367 that the top-ranked result is well separated from its competing alternatives. Different queries also differ in the inherent ambiguity expected from their results; sometimes there really is just one correct result for a query, while for other queries, several alternatives might be equally good. By generalizing the definition of the w function to interpolate the estimated loss in quality and the gain in the watermarking signal, we can trade-off the ability to identify the watermarked collections against quality degradation: w(r,Dk(q),fw)− =(1 λ − ∗ λ g)ai ∗nl( or,s D(rk,(Dq)k,(fqw)) (6) Loss: The loss(r, Dk (q)) function reflects the quality degradation that results from selecting alternative r as opposed to the best ranked candidate in Dk (q)) . We will experiment with two variants: lossrank (r, Dk (q)) = (rank(r) − k)/k losscost(r, Dk(q)) = (cost(r)−cost(r1))/ cost(r1) where: • • • rank(r) : returns the rank of r within Dk (q) . cost(r) : a weighted sum of features (not cnoosrtm(ra)li:ze ad over httheed sse uarmch o space) rine a loglinear model such as those mentioned in (Och, 2003). r1: the highest ranked alternative in Dk (q) . lossrank provides a generally applicable criteria to select alternatives, penalizing selection from deep within Dk (q) . This estimate of the quality degradation does not reflect the generating model’s opinion on relative quality. losscost considers the relative increase in the generating model’s cost assigned to the alternative translation. Gain: The gain(r, Dk (q) , fw) function reflects the gain in the watermarking signal by selecting candidate r. We simply define the gain as the Pn(X ≥ #(1, h(r))) from Equation 5. ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (a) 1-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (c) 3 − 5-grams mapping ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 ptendcuo mfrsi 0 . 204186eoxbpsecrvted0.510.25 p-value threshold (b) 1− 5-grams mapping p-value threshold (d) Full result hashing Figure 1 Comparison : of expected false positive rates against observed false positive rates for different sub-result mappings. 4 Related Work Using watermarks with the goal of transmitting a hidden message within images, video, audio and monolingual text media is common. For structured text content, linguistic approaches like (Chapman et al. , 2001; Gupta et al., 2006) use language specific linguistic and semantic expansions to introduce hidden watermarks. These expansions provide alternative candidates within which messages can be encoded. Recent publications have extended this idea to machine translation, using multiple systems and expansions to generate alternative translations. (Stutsman et al. , 2006) uses a hashing function to select alternatives that encode the hidden message in the lower order bits of the translation. In each of these approaches, the watermarker has control over the collection of results into which the watermark is to be embedded. These approaches seek to embed a hidden message into a collection of results that is selected by the watermarker. In contrast, we address the condition where the input queries are not in the watermarker’s control. 1368 The goal is therefore to introduce the watermark into all generated results, with the goal of probabilistically identifying such outputs. Our approach is also task independent, avoiding the need for templates to generate additional alternatives. By addressing the problem directly within the search space of a dynamic programming algorithm, we have access to high quality alternatives with well defined models of quality loss. Finally, our approach is robust to local word editing. By using a sub-result mapping, we increase the level of editing required to obscure the watermark signal; at high levels of editing, the quality of the results themselves would be significantly degraded. 5 Experiments We evaluate our watermarking approach applied to the outputs of statistical machine translation under the following experimental setup. A repository of parallel (aligned source and target language) web documents is sampled to produce a large corpus on which to evaluate the watermarking classification performance. The corpora represent translations into 4 diverse target languages, using English as the source language. Each document in this corpus can be considered a collection of un-watermarked structured results, where source sentences are queries and each target sentence represents a structured result. Using a state-of-the-art phrase-based statistical machine translation system (Och and Ney, 2004) trained on parallel documents identified by (Uszkoreit et al. , 2010) , we generate a set of 100 alternative translations for each source sentence. We apply the proposed watermarking approach, along with the proposed refinements that address task specific loss (Section 3.4) and robustness to edit operations (Section 3.3) to generate watermarked corpora. Each method is controlled via a single parameter (like k or λ) which is varied to generate alternative watermarked collections. For each parameter value, we evaluate the Recall Rate and Quality Degradation with the goal of finding a setting that yields a high recall rate, minimal quality degradation. False positive rates are evaluated based on a fixed classification significance level of α = 0.05. The false positive and recall rates are evaluated on the word level; a document that is misclassified or correctly identified contributes its length in words towards the error calculation. In this work, we use α = 0.05 during classification corresponding to an expected 5% false positive rate. The false positive rate is a function of h and the significance level α and therefore constant across the parameter values k and λ. We evaluate quality degradation on human translated test corpora that are more typical for machine translation evaluation. Each test corpus consists of 5000 source sentences randomly selected from the web and translated into each respective language. We chose to evaluate quality on test corpora to ensure that degradations are not hidden by imperfectly matched web corpora and are consistent with the kind of results often reported for machine translation systems. As with the classification corpora, we create watermarked versions at each parameter value. For a given pa1369 recall Figure 2: BLEU loss against recall of watermarked content for the baseline approach (max K-best) , rank and cost interpolation. rameter value, we measure false positive and re- call rates on the classification corpora and quality degradation on the evaluation corpora. Table 1 shows corpus statistics for the classification and test corpora and non-watermarked BLEU scores for each target language. All source texts are in English. 5.1 Loss Interpolated Experiments Our first set of experiments demonstrates baseline performance using the watermarking criteria in Equation 5 versus the refinements suggested in Section 3.4 to mitigate quality degradation. The h function is computed on the full sentence result r with no sub-event mapping. The following methods are evaluated in Figure 2. • • Baseline method (labeled “max K-best” ): sBealescetlsin er0 purely (blaasbedel on gain Kin- bweastte”r):marking signal (Equation 5) and is parameterized by k: the number of alternatives considered for each result. Rank interpolation: incorporates rank into w, varying ptholea interpolation parameter nλ.t • Cost interpolation: incorporates cost into w, varying tohlea interpolation parameter nλ.t The observed false positive rate on the French classification corpora is 1.9%. ClassificationQuality Ta AbFHularei ankg1bdic:sehitCon#t12e08n7w39t1065o s40r7tda617tsi c#sfo1e r85n37c2018tl2a5e4s 5n0sicfeastion#adno1c68 q3u09 06ma70lietynsdegr#ad7 aw3 t534io9 0rn279dcsorp# as.e5 nN54 t08 oe369n-wceatsrmBaL21 kEe6320d. U462579 B%LEU scores are reported for the quality corpora. We consider 0.2% BLEU loss as a threshold for acceptable quality degradation. Each method is judged by its ability to achieve high recall below this quality degradation threshold. Applying cost interpolation yields the best results in Figure 2, achieving a recall of 85% at 0.2% BLEU loss, while rank interpolation achieves a recall of 76%. The baseline approach of selecting the highest gain candidate within a depth of k candidates does not provide sufficient parameterization to yield low quality degradation. At k = 2, this method yields almost 90% recall, but with approximately 0.4% BLEU loss. 5.2 Robustness Experiments In Section 5.2, we proposed mapping results into sub-events or features. We considered alternative feature mappings in Figure 1, finding that mapping sentence results into a collection of 35 grams yields acceptable false positive rates at varied levels of α. Figure 3 presents results that compare moving from the result level hashing to the 3-5 gram sub-result mapping. We show the impact of the mapping on the baseline max K-best method as well as for cost interpolation. There are substantial reductions in recall rate at the 0.2% BLEU loss level when applying sub-result mappings in cases. The cost interpolation method recall drops from 85% to 77% when using the 3-5 grams event mapping. The observed false positive rate of the 3-5 gram mapping is 4.7%. By using the 3-5 gram mapping, we expect to increase robustness against local word edit operations, but we have sacrificed recall rate due to the inherent distributional bias discussed in Section 3.3. 1370 recall Figure 3: BLEU loss against recall of watermarked content for the baseline and cost interpolation methods using both result level and 3-5 gram mapped events. 5.3 Multilingual Experiments The watermarking approach proposed here introduces no language specific watermarking operations and it is thus broadly applicable to translating into all languages. In Figure 4, we report results for the baseline and cost interpolation methods, considering both the result level and 3-5 gram mapping. We set α = 0.05 and measure recall at 0.2% BLEU degradation for translation from English into Arabic, French, Hindi and Turkish. The observed false positive rates for full sentence hashing are: Arabic: 2.4%, French: 1.8%, Hindi: 5.6% and Turkish: 5.5%, while for the 3-5 gram mapping, they are: Arabic: 5.8%, French: 7.5%, Hindi:3.5% and Turkish: 6.2%. Underlying translation quality plays an important role in translation quality degradation when watermarking. Without a sub-result mapping, French (BLEU: 26.45%) Figure 4: Loss of recall when using 3-5 gram mapping vs sentence level mapping for Arabic, French, Hindi and Turkish translations. achieves recall of 85% at 0.2% BLEU loss, while the other languages achieve over 90% recall at the same BLEU loss threshold. Using a subresult mapping degrades quality for each language pair, but changes the relative performance. Turkish experiences the highest relative drop in recall, unlike French and Arabic, where results are relatively more robust to using sub-sentence mappings. This is likely a result of differences in n-gram distributions across these languages. The languages considered here all use space separated words. For languages that do not, like Chinese or Thai, our approach can be applied at the character level. 6 Conclusions In this work we proposed a general method to watermark and probabilistically identify the structured outputs of machine learning algorithms. Our method provides probabilistic bounds on detection ability, analytic control on quality degradation and is robust to local edit- ing operations. Our method is applicable to any task where structured outputs are generated with ambiguities or ties in the results. We applied this method to the outputs of statistical machine translation, evaluating each refinement to our approach with false positive and recall rates against BLEU score quality degradation. Our results show that it is possible, across several language pairs, to achieve high recall rates (over 80%) with low false positive rates (between 5 and 8%) at minimal quality degradation (0.2% 1371 BLEU) , while still allowing for local edit operations on the translated output. In future work we will continue to investigate methods to mitigate quality loss. References Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Minimum error rate training in statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311 . Mark Chapman, George Davida, and Marc Rennhardway. 2001. A practical and effective approach to large-scale automated linguistic steganography. In Proceedings of the Information Security Conference. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval, pages 121–128. Gaurav Gupta, Josef Pieprzyk, and Hua Xiong Wang. 2006. An attack-localizing watermarking scheme for natural language documents. In Proceedings of the 2006 A CM Symposium on Information, computer and communications security, ASIACCS ’06, pages 157–165, New York, NY, USA. ACM. Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of the Joint International Conference on Computational Linguistics and Association of Computational Linguistics (COLING/A CL, pages 761–768. Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics. Franz Josef Och and Hermann Ney. alignment template approach to statistical machine translation. Computational Linguistics. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 2003 Meeting of the Asssociation of Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. computational linguistics. Computational Linguistics. Ryan Stutsman, Mikhail Atallah, Christian Grothoff, and Krista Grothoff. 2006. Lost in just the translation. In Proceedings of the 2006 A CM Symposium on Applied Computing. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 2010 COLING. 1372 2004. The

5 0.38791528 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

Author: Lukas Michelbacher ; Alok Kothari ; Martin Forst ; Christina Lioma ; Hinrich Schutze

Abstract: Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because unlike other work on MWUs – tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system. – 1

6 0.36952823 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

7 0.34079373 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model

8 0.30911091 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

9 0.30196545 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts

10 0.2983669 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

11 0.29104033 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

12 0.2769627 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus

13 0.26916412 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images

14 0.2586216 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

15 0.25754896 110 emnlp-2011-Ranking Human and Machine Summarization Systems

16 0.25441566 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

17 0.2541827 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

18 0.24973795 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

19 0.2468482 114 emnlp-2011-Relation Extraction with Relation Topics

20 0.24102467 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.121), (31, 0.382), (36, 0.027), (37, 0.031), (45, 0.068), (53, 0.014), (54, 0.015), (57, 0.019), (62, 0.031), (64, 0.016), (66, 0.026), (69, 0.02), (79, 0.032), (82, 0.018), (96, 0.043), (98, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69640756 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

Author: Richard Farkas

2 0.3963494 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

Author: Eiji ARAMAKI ; Sachiko MASKAWA ; Mizuki MORITA

3 0.39494383 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

4 0.38789165 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

Author: Burr Settles

Abstract: This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (e.g., words) and instances (e.g., documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels. Human annotators in user studies were able to produce near-stateof-the-art classifiers—on several corpora in a variety of application domains—with only a few minutes of effort.

5 0.38454607 128 emnlp-2011-Structured Relation Discovery using Generative Models

Author: Limin Yao ; Aria Haghighi ; Sebastian Riedel ; Andrew McCallum

Abstract: We explore unsupervised approaches to relation extraction between two named entities; for instance, the semantic bornIn relation between a person and location entity. Concretely, we propose a series of generative probabilistic models, broadly similar to topic models, each which generates a corpus of observed triples of entity mention pairs and the surface syntactic dependency path between them. The output of each model is a clustering of observed relation tuples and their associated textual expressions to underlying semantic relation types. Our proposed models exploit entity type constraints within a relation as well as features on the dependency path between entity mentions. We examine effectiveness of our approach via multiple evaluations and demonstrate 12% error reduction in precision over a state-of-the-art weakly supervised baseline.

6 0.38263053 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

7 0.38238195 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

8 0.38198885 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

9 0.38162249 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

10 0.38159978 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

11 0.38157982 136 emnlp-2011-Training a Parser for Machine Translation Reordering

12 0.38119948 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

13 0.38113037 70 emnlp-2011-Identifying Relations for Open Information Extraction

14 0.37999669 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

15 0.37963232 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

16 0.37870669 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

17 0.37819481 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

18 0.37727952 17 emnlp-2011-Active Learning with Amazon Mechanical Turk

19 0.37709638 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

20 0.37566304 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning