acl acl2011 acl2011-244 knowledge-graph by maker-knowledge-mining

244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Source: pdf

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to improve event extraction by learning to identify secondary role filler contexts in the absence of event keywords. We propose a multilayered event extraction architecture that progressively “zooms in” on relevant information. Our extraction model includes a document genre classifier to recognize event narratives, two types of sentence classifiers, and noun phrase classifiers to extract role fillers. These modules are organized as a pipeline to gradually zero in on event-related information. We present results on the MUC-4 event extraction data set and show that this model performs better than previous systems.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu o , Abstract The goal of our research is to improve event extraction by learning to identify secondary role filler contexts in the absence of event keywords. [sent-3, score-2.105]

2 We propose a multilayered event extraction architecture that progressively “zooms in” on relevant information. [sent-4, score-0.962]

3 Our extraction model includes a document genre classifier to recognize event narratives, two types of sentence classifiers, and noun phrase classifiers to extract role fillers. [sent-5, score-1.373]

4 For example, the Message Understanding Conferences (MUCs) challenged NLP researchers to create event extraction systems for domains such as terrorism (e. [sent-9, score-0.914]

5 Most event extraction systems use either a learning-based classifier to label words as role fillers, or lexico-syntactic patterns to extract role fillers from pattern contexts. [sent-14, score-1.514]

6 Both approaches, however, generally tackle event recognition and role filler extraction at the same time. [sent-15, score-1.151]

7 In other words, 1137 most event extraction systems primarily recognize contexts that explicitly refer to a relevant event. [sent-16, score-1.064]

8 The goal of our research is to improve event extraction by learning to identify secondary role filler contexts in the absence of event keywords. [sent-25, score-2.105]

9 We create a set of classifiers to recognize role-specific contexts that suggest the presence of a likely role filler regardless of whether a relevant event is mentioned or not. [sent-26, score-1.344]

10 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s137–1147, address this, we adopt a two-pronged strategy for event extraction that handles event narrative documents differently from other documents. [sent-31, score-1.853]

11 We define an event narrative as an article whose main purpose is to report the details of an event. [sent-32, score-0.91]

12 We apply the rolespecific sentence classifiers only to event narratives to aggressively search for role fillers in these stories. [sent-33, score-1.51]

13 We will refer to these documents as fleeting reference texts because they mention a relevant event somewhere in the document, albeit briefly. [sent-36, score-1.187]

14 To ensure that relevant information is extracted from all documents, we also apply a conservative extraction process to every document to extract facts from explicit event sentences. [sent-37, score-1.051]

15 Our complete event extraction model, called TIER, incorporates both document genre and rolespecific context recognition into 3 layers of analysis: document analysis, sentence analysis, and noun phrase (NP) analysis. [sent-38, score-1.146]

16 At the top level, we train a text genre classifier to identify event narrative documents. [sent-39, score-1.082]

17 Event sentence classifiers identify sentences that are associated with relevant events, and role-specific context classifiers identify sentences that contain possible role fillers irrespective of whether an event is mentioned. [sent-41, score-1.593]

18 All documents pass through the event sentence classifier, and event sentences are given to the role filler extractors. [sent-44, score-1.969]

19 Documents identified as event narratives additionally pass through role-specific sentence classifiers, and the role-specific sentences are also given to the role filler extractors. [sent-45, score-1.296]

20 This multi-layered approach creates an event extraction system that can discover role fillers in a variety of different contexts, while maintaining good precision. [sent-46, score-1.258]

21 In the following sections, we position our research with respect to related work, present the details of our multi-layered event extraction model, and show experimental results for five event roles using the MUC-4 data set. [sent-47, score-1.612]

22 1138 2 Related Work Some event extraction data sets only include documents that describe relevant events (e. [sent-48, score-1.072]

23 But many IE data sets present a more realistic task where the IE system must determine whether a relevant event is present in the document, and if so, extract its role fillers. [sent-51, score-0.989]

24 Most of the Message Understanding Conference data sets represent this type of event extraction task, containing (roughly) a 50/50 mix of relevant and irrelevant documents (e. [sent-52, score-1.069]

25 Our research focuses on this setting where the event extraction system is not assured of getting only relevant documents to process. [sent-55, score-1.04]

26 In addition, many classifiers have been created to sequentially label event role fillers in a sentence (e. [sent-71, score-1.26]

27 , 2003; Bunescu and Mooney, 2007)), but that task is different from event extraction because it focuses on isolated relations rather than template-based event analysis. [sent-79, score-1.574]

28 Most event extraction systems scan a text and search small context windows using patterns or a classifier. [sent-80, score-0.87]

29 Ji and Grishman (2008) enforce event role consistency across different documents. [sent-83, score-0.892]

30 (Patwardhan and Riloff, 2007) developed a system that learns to recognize event sen- tences and uses patterns that have a semantic affinity for an event role to extract role fillers. [sent-87, score-1.908]

31 Our event extraction model progressively “zooms in” on relevant information by first identifying the document type, then identifying sentences that are likely to contain relevant information, and finally analyzing individual noun phrases to identify role fillers. [sent-93, score-1.367]

32 Figure 1 shows the multi-layered pipeline of our event extraction system. [sent-96, score-0.859]

33 The event extraction task is to find any description of a relevant event, even if the event is not the topic of the article. [sent-98, score-1.671]

34 1 Consequently, all documents are given to the event sentence recognizers and their mission is to identify any sentence that mentions a relevant event. [sent-99, score-1.054]

35 This path through the pipeline is conservative because information is ex- tracted only from event sentences, but all documents are processed, including stories that contain only a fleeting reference to a relevant event. [sent-100, score-1.195]

36 The second path through the pipeline performs additional processing for documents that belong to the event narrative text genre. [sent-102, score-1.042]

37 For event narratives, we assume that most of the document discusses a relevant event so we can more aggressively hunt for event-related information in secondary contexts. [sent-103, score-1.796]

38 We will return to the issue of document genre and the event narrative classifier in Section 4. [sent-105, score-1.109]

39 1 Sentence Classification We have argued that event role fillers commonly occur in two types of contexts: event contexts and role-specific secondary contexts. [sent-107, score-2.074]

40 A secondary context is a sentence that provides information re- lated to an event but in the context of other activities that precede or follow the event. [sent-110, score-0.885]

41 Each document that describes a relevant event has answer key templates with the role fillers (answer key strings) for each event. [sent-113, score-1.424]

42 To train the event sentence recognizer, we consider a sentence to be a positive training instance if it contains one or more answer key strings from any of the event roles. [sent-114, score-1.643]

43 There is no guarantee that a classifier trained in this way will identify event sentences, but our hypothesis was that training across all of the event roles together would produce a classifier that learns to recognize general event contexts. [sent-118, score-2.48]

44 This approach was also used to train GLACIER’s sentential event recognizer (Patwardhan and Riloff, 2009), and they demonstrated that this approach worked reasonably training well when compared to training with event sentences labelled by human judges. [sent-119, score-1.538]

45 2 In this way, we force each classifier to focus on the contexts specific to its particular event role. [sent-124, score-0.868]

46 We expect the role-specific sentence classifiers to find some secondary contexts that the event sentence classifier will miss, although some sentences may be classified as both. [sent-125, score-1.145]

47 2 Role Filler Extractors Our extraction model also includes a set of role filler extractors, one per event role. [sent-132, score-1.151]

48 Each extractor receives a sentence as input and determines which noun phrases (NPs) in the sentence are fillers for the event role. [sent-133, score-1.118]

49 To train an SVM classifier, noun phrases corresponding to answer key strings for the event role are positive instances. [sent-134, score-1.03]

50 2We intentionally do not use sentences that contain fillers for competing event roles as negative instances because sentences often contain multiple role fillers of different types (e. [sent-136, score-1.563]

51 In this section, we first present an analysis of the MUC-4 data set which reveals the distribution of event narratives in the corpus, and then explain how we train a classifier to automatically identify event narrative stories. [sent-151, score-1.933]

52 1 Manual Analysis We define an event narrative as an article whose main focus is on reporting the details of an event. [sent-153, score-0.91]

53 For the purposes of this research, we are only concerned with events that are relevant to the event extraction task (i. [sent-154, score-0.964]

54 In between these extremes is another category of documents that briefly mention a relevant event, but the event is not the focus of the article. [sent-158, score-0.993]

55 Many of the fleeting reference documents in the MUC-4 corpus are transcripts of interviews, speeches, or terrorist propaganda com1141 muniques that refer to a terrorist event and mention at least one role filler, but within a discussion about a different topic (e. [sent-160, score-1.521]

56 To gain a better understanding of how we might create a system to automatically distinguish event narrative documents from fleeting reference documents, we manually labelled the 116 relevant documents in our tuning set. [sent-163, score-1.438]

57 8c2 The first row event narratives “gold standard” more than half of Table 1 shows the distribution of and fleeting references based on our manual annotations. [sent-167, score-1.084]

58 We see that of the relevant documents (62/1 16) are not focused on reporting a terrorist event, even though they contain information about a terrorist event somewhere in the document. [sent-168, score-1.198]

59 2 Heuristics for Event Narrative Identification Our goal is to train a document classifier to automatically identify event narratives. [sent-170, score-0.918]

60 The MUC-4 answer keys reveal which documents are relevant and irrelevant with respect to the terrorism domain, but they do not tell us which relevant documents are event narratives and which are fleeting reference stories. [sent-171, score-1.655]

61 First, we noticed that event narratives tend to mention relevant information within the first several sentences, whereas fleeting reference texts usually mention relevant information only in the middle or end of the document. [sent-174, score-1.375]

62 Therefore our first heuristic requires that an event narrative mention a role filler within the first 7 sentences. [sent-175, score-1.275]

63 Second, event narratives generally have a higher density of relevant information. [sent-176, score-1.077]

64 Other documents contain a high concentration of role fillers in some parts of the document but no role fillers in other parts. [sent-179, score-1.023]

65 Figure 2 shows histograms for dif|AfelrlSenent sv|alues of this ratio in the event narrative (a) vs. [sent-184, score-0.966]

66 The histograms clearly show that documents with a high (> 50%) ratio are almost always event narratives. [sent-186, score-0.903]

67 4 We use these heuristics to label a document as an event narrative if: (1) it has a high density of relevant information, and (2) it mentions a role filler within the first 7 sentences. [sent-204, score-1.511]

68 The heuristics correctly identify event narratives and 5625 fleeting reference stories, to achieve an overall accuracy of 82%. [sent-206, score-1.177]

69 1142 ing data for an event narrative classifier. [sent-209, score-0.91]

70 3 Event Narrative Classifier The heuristics above use the answer keys to help determine whether a story belongs to the event narrative genre, but our goal is to create a classifier that can identify event narrative documents without the benefit of answer keys. [sent-211, score-2.236]

71 So we used the heuristics to automatically create training data for a classifier by labelling each relevant document in the training set as an event narrative or a fleeting reference document. [sent-212, score-1.386]

72 We then trained a document classifier using the 292 event narrative documents as positive instances and all irrelevent training documents as negative instances. [sent-214, score-1.29]

73 The 308 relevant documents that were not identified as event narratives were discarded to minimize noise (i. [sent-215, score-1.118]

74 Table 2 shows the performance of the event narrative classifier on the manually labeled tuning set. [sent-219, score-1.003]

75 The classifier identified 69% of the event narratives with 63% precision. [sent-220, score-0.981]

76 However, these results should be interpreted loosely because there is not always a clear dividing line between event narratives and other documents. [sent-226, score-0.913]

77 Fortunately, it is not essential for TIER to have a perfect event narrative classifier since all documents will be processed by the event sentence recognizer anyway. [sent-228, score-1.859]

78 The recall of the event narrative classifier means that nearly 70% of the event narratives will get additional scrutiny, which should help to find additional role fillers. [sent-229, score-2.044]

79 Its precision of 63% means that some documents that are not event narratives will also get additional scrutiny, but information will be extracted only if both the role-specific sentence rec- ognizer and NP extractors believe they have found something relevant. [sent-230, score-1.121]

80 We evaluate our system on the five MUC-4 “string-fill” event roles: perpetrator individuals, perpetrator organizations, physical targets, victims 1143 and weapons. [sent-243, score-0.926]

81 Our results are reported as Precision/Recall/F(1)-score for each event role separately. [sent-253, score-0.892]

82 explicitly identifies event sentences and uses patterns that have a semantic affinity for an event role to extract role fillers. [sent-263, score-1.894]

83 The EventSent row shows the performance of our Role Filler Extractors applied only to the event sentences identified by our event sentence classifier. [sent-271, score-1.578]

84 These results are similar to GLACIER’s results on most event roles, which isn’t surprising because GLACIER also incorporates event sentence identification. [sent-274, score-1.512]

85 This result is consistent with our hypothesis that many role fillers exist in rolespecific contexts that are not event sentences. [sent-277, score-1.259]

86 The EventSent row reveals that information found in event sentences has the highest precision, even without relying on document classification. [sent-285, score-0.874]

87 We concluded that evidence of an event sentence is probably sufficient to warrant role filler extraction irrespective of the style of the document. [sent-286, score-1.207]

88 As we discussed in Section 4, many documents contain only a fleeting reference to an event, so it is important to be able to extract information from those isolated event descriptions as well. [sent-287, score-1.017]

89 Consequently, we created a system, EventSent+DomDoc/RoleSent, that extracts information from event sentences in all documents, but extracts information from role-specific sentences only if they appear in a domain-relevant document. [sent-288, score-0.855]

90 The last row, EventSent+ENarrDoc/RoleSent, shows the results of our final architecture which extracts information from event sentences in all documents, but extracts information from role-specific sentences only in Event Narrative documents. [sent-291, score-0.885]

91 Overall, TIER’s multi-layered extraction architecture produced higher F1 scores than previous systems on four of the five event roles. [sent-294, score-0.865]

92 The improved precision comes from our two-pronged strategy of treating event narratives differently from other documents. [sent-296, score-0.913]

93 TIER aggressively searches for extractions in event narrative stories but is conservative and extracts information only from event sentences in all other documents. [sent-297, score-1.856]

94 TIER’s role-specific sentence classifiers did correctly identify some sentences containing role fillers that were not classified as event sentences. [sent-300, score-1.335]

95 ” The first two sentences identify victims, but the terrorist event itself was mentioned earlier in the document. [sent-305, score-0.941]

96 The third sentence contains a perpetrator (the woman), victims (students), and weapons (hand grenades) in the context of a hostage situation after the main event (a bus attack), when the perpetrator escaped. [sent-306, score-0.96]

97 , injury or physical damage) but recognizing that the event is part of a terrorist incident depends on the larger discourse. [sent-312, score-0.866]

98 6 Conclusions We have presented a new approach to event extraction that uses three levels of analysis: document genre classification to identify event narrative stories, two types of sentence classifiers, and noun phrase classifiers. [sent-322, score-1.993]

99 Another important aspect of our approach is a two-pronged strategy that handles event narratives differently from other documents. [sent-324, score-0.913]

100 TIER aggressively hunts for role fillers in event narratives, but is conservative about extracting information from other documents. [sent-325, score-1.232]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('event', 0.739), ('fillers', 0.27), ('narratives', 0.174), ('narrative', 0.171), ('filler', 0.163), ('role', 0.153), ('fleeting', 0.138), ('terrorist', 0.127), ('secondary', 0.112), ('documents', 0.108), ('riloff', 0.102), ('tier', 0.098), ('relevant', 0.097), ('extraction', 0.096), ('glacier', 0.084), ('document', 0.069), ('classifier', 0.068), ('density', 0.067), ('extractors', 0.066), ('classifiers', 0.064), ('freitag', 0.064), ('perpetrator', 0.064), ('patwardhan', 0.062), ('genre', 0.062), ('contexts', 0.061), ('eventsent', 0.06), ('terrorism', 0.059), ('victims', 0.059), ('extractions', 0.052), ('heuristics', 0.052), ('answer', 0.052), ('mention', 0.049), ('finn', 0.048), ('recognize', 0.047), ('affinity', 0.042), ('identify', 0.042), ('noun', 0.041), ('aggressively', 0.04), ('roles', 0.038), ('ie', 0.036), ('califf', 0.036), ('kushmerick', 0.036), ('rolespecific', 0.036), ('sundance', 0.036), ('patterns', 0.035), ('sentence', 0.034), ('row', 0.033), ('sentences', 0.033), ('chieu', 0.032), ('events', 0.032), ('reference', 0.032), ('conservative', 0.03), ('architecture', 0.03), ('histograms', 0.029), ('irrelevant', 0.029), ('negative', 0.027), ('sentential', 0.027), ('stories', 0.027), ('ratio', 0.027), ('extracts', 0.025), ('tuning', 0.025), ('message', 0.025), ('refer', 0.024), ('allsent', 0.024), ('announcements', 0.024), ('autoslog', 0.024), ('ciravegna', 0.024), ('civilian', 0.024), ('lehnert', 0.024), ('looting', 0.024), ('propaganda', 0.024), ('rifles', 0.024), ('rolesent', 0.024), ('seize', 0.024), ('pipeline', 0.024), ('artificial', 0.023), ('strings', 0.023), ('np', 0.023), ('irrespective', 0.022), ('key', 0.022), ('arbor', 0.022), ('keys', 0.022), ('swing', 0.021), ('killed', 0.021), ('speeches', 0.021), ('homes', 0.021), ('cercone', 0.021), ('dayne', 0.021), ('guerrillas', 0.021), ('perpetrators', 0.021), ('zooms', 0.021), ('gu', 0.021), ('facts', 0.02), ('create', 0.02), ('students', 0.02), ('mooney', 0.02), ('phillips', 0.02), ('appelt', 0.02), ('scrutiny', 0.02), ('arrest', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000032 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Author: Ruihong Huang ; Ellen Riloff

2 0.63061428 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu

Abstract: Event extraction is the task of detecting certain specified types of events that are mentioned in the source language data. The state-of-the-art research on the task is transductive inference (e.g. cross-event inference). In this paper, we propose a new method of event extraction by well using cross-entity inference. In contrast to previous inference methods, we regard entitytype consistency as key feature to predict event mentions. We adopt this inference method to improve the traditional sentence-level event extraction system. Experiments show that we can get 8.6% gain in trigger (event) identification, and more than 11.8% gain for argument (role) classification in ACE event extraction. 1

3 0.54074335 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

4 0.47309569 122 acl-2011-Event Extraction as Dependency Parsing

Author: David McClosky ; Mihai Surdeanu ; Christopher Manning

Abstract: Nested event structures are a common occurrence in both open domain and domain specific extraction tasks, e.g., a “crime” event can cause a “investigation” event, which can lead to an “arrest” event. However, most current approaches address event extraction with highly local models that extract each event and argument independently. We propose a simple approach for the extraction of such structures by taking the tree of event-argument relations and using it directly as the representation in a reranking dependency parser. This provides a simple framework that captures global properties of both nested and flat event structures. We explore a rich feature space that models both the events to be parsed and context from the original supporting text. Our approach obtains competitive results in the extraction of biomedical events from the BioNLP’09 shared task with a F1 score of 53.5% in development and 48.6% in testing.

5 0.27848059 293 acl-2011-Template-Based Information Extraction without the Templates

Author: Nathanael Chambers ; Dan Jurafsky

Abstract: Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to handcreated gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

6 0.19820049 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

7 0.13757367 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

8 0.13297006 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

9 0.13058387 121 acl-2011-Event Discovery in Social Media Feeds

10 0.12501891 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

11 0.10648029 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

12 0.069116212 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

13 0.065027155 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

14 0.062790222 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

15 0.059162751 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

16 0.058357041 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

17 0.057888001 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

18 0.056949079 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

19 0.056621876 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

20 0.056151237 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.175), (1, 0.11), (2, -0.3), (3, 0.036), (4, 0.402), (5, 0.311), (6, -0.144), (7, -0.066), (8, 0.411), (9, 0.012), (10, -0.084), (11, 0.043), (12, 0.0), (13, 0.083), (14, 0.015), (15, 0.036), (16, 0.096), (17, 0.058), (18, -0.016), (19, 0.011), (20, 0.025), (21, -0.022), (22, -0.011), (23, 0.018), (24, -0.019), (25, 0.06), (26, -0.021), (27, 0.009), (28, 0.04), (29, 0.005), (30, 0.017), (31, 0.058), (32, -0.022), (33, -0.003), (34, -0.027), (35, -0.014), (36, -0.004), (37, -0.004), (38, -0.004), (39, -0.004), (40, -0.018), (41, -0.018), (42, -0.003), (43, 0.001), (44, 0.049), (45, -0.007), (46, -0.006), (47, 0.004), (48, 0.042), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98022306 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Author: Ruihong Huang ; Ellen Riloff

2 0.96271741 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu

3 0.92729294 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

4 0.87412047 122 acl-2011-Event Extraction as Dependency Parsing

Author: David McClosky ; Mihai Surdeanu ; Christopher Manning

5 0.63908339 293 acl-2011-Template-Based Information Extraction without the Templates

Author: Nathanael Chambers ; Dan Jurafsky

6 0.5366112 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

7 0.53371817 121 acl-2011-Event Discovery in Social Media Feeds

8 0.45868617 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

9 0.41079825 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

10 0.38584924 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

11 0.27568844 291 acl-2011-SystemT: A Declarative Information Extraction System

12 0.23951215 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

13 0.23349059 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

14 0.22646487 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

15 0.21669196 68 acl-2011-Classifying arguments by scheme

16 0.21372998 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

17 0.2035754 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining

18 0.20057593 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

19 0.19983464 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

20 0.19426267 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.041), (9, 0.015), (17, 0.052), (26, 0.022), (31, 0.014), (37, 0.085), (39, 0.032), (40, 0.012), (41, 0.111), (55, 0.028), (59, 0.105), (72, 0.023), (85, 0.203), (91, 0.032), (96, 0.117), (97, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82799757 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

Author: Anja Belz ; Eric Kow

Abstract: Studies assessing rating scales are very common in psychology and related fields, but are rare in NLP. In this paper we assess discrete and continuous scales used for measuring quality assessments of computergenerated language. We conducted six separate experiments designed to investigate the validity, reliability, stability, interchangeability and sensitivity of discrete vs. continuous scales. We show that continuous scales are viable for use in language evaluation, and offer distinct advantages over discrete scales. 1 Background and Introduction Rating scales have been used for measuring human perception of various stimuli for a long time, at least since the early 20th century (Freyd, 1923). First used in psychology and psychophysics, they are now also common in a variety of other disciplines, including NLP. Discrete scales are the only type of scale commonly used for qualitative assessments of computer-generated language in NLP (e.g. in the DUC/TAC evaluation competitions). Continuous scales are commonly used in psychology and related fields, but are virtually unknown in NLP. While studies assessing the quality of individual scales and comparing different types of rating scales are common in psychology and related fields, such studies hardly exist in NLP, and so at present little is known about whether discrete scales are a suitable rating tool for NLP evaluation tasks, or whether continuous scales might provide a better alternative. A range of studies from sociology, psychophysiology, biometrics and other fields have compared 230 Kow} @bright on .ac .uk discrete and continuous scales. Results tend to differ for different types of data. E.g., results from pain measurement show a continuous scale to outperform a discrete scale (ten Klooster et al., 2006). Other results (Svensson, 2000) from measuring students’ ease of following lectures show a discrete scale to outperform a continuous scale. When measuring dyspnea, Lansing et al. (2003) found a hybrid scale to perform on a par with a discrete scale. Another consideration is the types of data produced by discrete and continuous scales. Parametric methods of statistical analysis, which are far more sensitive than non-parametric ones, are commonly applied to both discrete and continuous data. However, parametric methods make very strong assumptions about data, including that it is numerical and normally distributed (Siegel, 1957). If these assumptions are violated, then the significance of results is overestimated. Clearly, the numerical assumption does not hold for the categorial data produced by discrete scales, and it is unlikely to be normally distributed. Many researchers are happier to apply parametric methods to data from continuous scales, and some simply take it as read that such data is normally distributed (Lansing et al., 2003). Our aim in the present study was to systematically assess and compare discrete and continuous scales when used for the qualitative assessment of computer-generated language. We start with an overview of assessment scale types (Section 2). We describe the experiments we conducted (Sec- tion 4), the data we used in them (Section 3), and the properties we examined in our inter-scale comparisons (Section 5), before presenting our results Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiastti ocns:aslh Loirntpgaupisetricss, pages 230–235, Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. 1. Very Poor 2. Poor 3. Barely Acceptable 4. Good 5. Very Good Figure 1: Evaluation of Readability in DUC’06, comprising 5 evaluation criteria, including Grammaticality. Evaluation task for each summary text: evaluator selects one of the options (1–5) to represent quality of the summary in terms of the criterion. (Section 6), and some conclusions (Section 7). 2 Rating Scales With Verbal Descriptor Scales (VDSs), participants give responses on ordered lists of verbally described and/or numerically labelled response cate- gories, typically varying in number from 2 to 11 (Svensson, 2000). An example of a VDS used in NLP is shown in Figure 1. VDSs are used very widely in contexts where computationally generated language is evaluated, including in dialogue, summarisation, MT and data-to-text generation. Visual analogue scales (VASs) are far less common outside psychology and related areas than VDSs. Responses are given by selecting a point on a typically horizontal line (although vertical lines have also been used (Scott and Huskisson, 2003)), on which the two end points represent the extreme values of the variable to be measured. Such lines can be mono-polar or bi-polar, and the end points are labelled with an image (smiling/frowning face), or a brief verbal descriptor, to indicate which end of the line corresponds to which extreme of the variable. The labels are commonly chosen to represent a point beyond any response actually likely to be chosen by raters. There is only one examples of a VAS in NLP system evaluation that we are aware of (Gatt et al., 2009). Hybrid scales, known as a graphic rating scales, combine the features of VDSs and VASs, and are also used in psychology. Here, the verbal descriptors are aligned along the line of a VAS and the endpoints are typically unmarked (Svensson, 2000). We are aware of one example in NLP (Williams and Reiter, 2008); 231 Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. extbreamdely excellent Figure 2: Evaluation of Grammaticality with alternative VAS scale (cf. Figure 1). Evaluation task for each summary text: evaluator selects a place on the line to represent quality of the summary in terms of the criterion. we did not investigate this scale in our study. We used the following two specific scale designs in our experiments: VDS-7: 7 response categories, numbered (7 = best) and verbally described (e.g. 7 = “perfectly fluent” for Fluency, and 7 = “perfectly clear” for Clarity). Response categories were presented in a vertical list, with the best category at the bottom. Each category had a tick-box placed next to it; the rater’s task was to tick the box by their chosen rating. VAS: a horizontal, bi-polar line, with no ticks on it, mapping to 0–100. In the image description tests, statements identified the left end as negative, the right end as positive; in the weather forecast tests, the positive end had a smiling face and the label “statement couldn’t be clearer/read better”; the negative end had a frowning face and the label “statement couldn’t be more unclear/read worse”. The raters’ task was to move a pointer (initially in the middle of the line) to the place corresponding to their rating. 3 Data Weather forecast texts: In one half of our evaluation experiments we used human-written and automatically generated weather forecasts for the same weather data. The data in our evaluations was for 22 different forecast dates and included outputs from 10 generator systems and one set of human forecasts. This data has also been used for comparative system evaluation in previous research (Langner, 2010; Angeli et al., 2010; Belz and Kow, 2009). The following are examples of weather forecast texts from the data: 1: S SE 2 8 -3 2 INCREAS ING 3 6-4 0 BY MID AF TERNOON 2 : S ’ LY 2 6-3 2 BACKING S SE 3 0 -3 5 BY AFTERNOON INCREAS ING 3 5 -4 0 GUSTS 5 0 BY MID EVENING Image descriptions: In the other half of our evaluations, we used human-written and automatically generated image descriptions for the same images. The data in our evaluations was for 112 different image sets and included outputs from 6 generator systems and 2 sets of human-authored descriptions. This data was originally created in the TUNA Project (van Deemter et al., 2006). The following is an example of an item from the corpus, consisting of a set of images and a description for the entity in the red frame: the smal l blue fan 4 Experimental Set-up 4.1 Evaluation criteria Fluency/Readability: Both the weather forecast and image description evaluation experiments used a quality criterion intended to capture ‘how well a piece of text reads’ , called Fluency in the latter, Readability in the former. Adequacy/Clarity: In the image description experiments, the second quality criterion was Adequacy, explained as “how clear the description is”, and “how easy it would be to identify the image from the description”. This criterion was called Clarity in the weather forecast experiments, explained as “how easy is it to understand what is being described”. 4.2 Raters In the image experiments we used 8 raters (native speakers) in each experiment, from cohorts of 3rdyear undergraduate and postgraduate students doing a degree in a linguistics-related subject. They were paid and spent about 1hour doing the experiment. In the weather forecast experiments, we used 22 raters in each experiment, from among academic staff at our own university. They were not paid and spent about 15 minutes doing the experiment. 232 4.3 Summary overview of experiments Weather VDS-7 (A): VDS-7 scale; weather forecast data; criteria: Readability and Clarity; 22 raters (university staff) each assessing 22 forecasts. Weather VDS-7 (B): exact repeat of Weather VDS-7 (A), including same raters. Weather VAS: VAS scale; 22 raters (university staff), no overlap with raters in Weather VDS-7 experiments; other details same as in Weather VDS-7. Image VDS-7: VDS-7 scale; image description data; 8 raters (linguistics students) each rating 112 descriptions; criteria: Fluency and Adequacy. Image VAS (A): VAS scale; 8 raters (linguistics students), no overlap with raters in Image VAS-7; other details same as in Image VDS-7 experiment. Image VAS (B): exact repeat of Image VAS (A), including same raters. 4.4 Design features common to all experiments In all our experiments we used a Repeated Latin Squares design to ensure that each rater sees the same number of outputs from each system and for each text type (forecast date/image set). Following detailed instructions, raters first did a small number of practice examples, followed by the texts to be rated, in an order randomised for each rater. Evaluations were carried out via a web interface. They were allowed to interrupt the experiment, and in the case of the 1hour long image description evaluation they were encouraged to take breaks. 5 Comparison and Assessment of Scales Validity is to the extent to which an assessment method measures what it is intended to measure (Svensson, 2000). Validity is often impossible to assess objectively, as is the case of all our criteria except Adequacy, the validity of which we can directly test by looking at correlations with the accuracy with which participants in a separate experiment identify the intended images given their descriptions. A standard method for assessing Reliability is Kendall’s W, a coefficient of concordance, measuring the degree to which different raters agree in their ratings. We report W for all 6 experiments. Stability refers to the extent to which the results of an experiment run on one occasion agree with the results of the same experiment (with the same raters) run on a different occasion. In the present study, we assess stability in an intra-rater, test-retest design, assessing the agreement between the same participant’s responses in the first and second runs of the test with Pearson’s product-moment correlation coefficient. We report these measures between ratings given in Image VAS (A) vs. those given in Image VAS (B), and between ratings given in Weather VDS-7 (A) vs. those given in Weather VDS-7 (B). We assess Interchangeability, that is, the extent to which our VDS and VAS scales agree, by computing Pearson’s and Spearman’s coefficients between results. We report these measures for all pairs of weather forecast/image description evaluations. We assess the Sensitivity of our scales by determining the number of significant differences between different systems and human authors detected by each scale. We also look at the relative effect of the different experimental factors by computing the F-Ratio for System (the main factor under investigation, so its relative effect should be high), Rater and Text Type (their effect should be low). F-ratios were de- termined by a one-way ANOVA with the evaluation criterion in question as the dependent variable and System, Rater or Text Type as grouping factors. 6 Results 6.1 Interchangeability and Reliability for system/human authored image descriptions Interchangeability: Pearson’s r between the means per system/human in the three image description evaluation experiments were as follows (Spearman’s ρ shown in brackets): Forb.eqAdFlouthV AD S d-(e7Aq)uac.y945a78n*d(V.F9A2l5uS8e*(—An *c)y,.98o36r.748*e1l9a*(tV.i98(Ao.2578nS019s(*5B b) e- tween Image VDS-7 and Image VAS (A) (the main VAS experiment) are extremely high, meaning that they could substitute for each other here. Reliability: Inter-rater agreement in terms of Kendall’s W in each of the experiments: 233 K ’ s W FAldue qnucayc .6V549D80S* -7* VA.46S7 16(*A * )VA.7S529 (5*B *) W was higher in the VAS data in the case of Fluency, whereas for Adequacy, W was the same for the VDS data and VAS (B), and higher in the VDS data than in the VAS (A) data. 6.2 Interchangeability and Reliability for system/human authored weather forecasts Interchangeability: The correlation coefficients (Pearson’s r with Spearman’s ρ in brackets) between the means per system/human in the image description experiments were as follows: ForRCea.ld bVoDt hS -A7 (d BAeq)ua.c9y851a*nVdD(.8F9S7-lu09*(eBn—*)cy,.9 o43r2957*1e la(*t.8i(o736n025Vs9*6A bS)e- tween Weather VDS-7 (A) (the main VDS-7 experiment) and Weather VAS (A) are again very high, although rank-correlation is somewhat lower. Reliability: Inter-rater agreement Kendall’s W was as follows: in terms of W RClea rdi.tyVDS.5-4739 7(*A * )VDS.4- 7583 (*B * ).4 8 V50*A *S This time the highest agreement for both Clarity and Readability was in the VDS-7 data. 6.3 Stability tests for image and weather data Pearson’s r between ratings given by the same raters first in Image VAS (A) and then in Image VAS (B) was .666 for Adequacy, .593 for Fluency. Between ratings given by the same raters first in Weather VDS-7 (A) and then in Weather VDS-7 (B), Pearson’s r was .656 for Clarity, .704 for Readability. (All significant at p < .01.) Note that these are computed on individual scores (rather than means as in the correlation figures given in previous sections). 6.4 F-ratios and post-hoc analysis for image data The table below shows F-ratios determined by a oneway ANOVA with the evaluation criterion in question (Adequacy/Fluency) as the dependent variable and System/Rater/Text Type as the grouping factor. Note that for System a high F-ratio is desirable, but a low F-ratio is desirable for other factors. tem, the main factor under investigation, VDS-7 found 8 for Adequacy and 14 for Fluency; VAS (A) found 7 for Adequacy and 15 for Fluency. 6.5 F-ratios and post-hoc analysis for weather data The table below shows F-ratios analogous to the previous section (for Clarity/Readability). tem, VDS-7 (A) found 24 for Clarity, 23 for Readability; VAS found 25 for Adequacy, 26 for Fluency. 6.6 Scale validity test for image data Our final table of results shows Pearson’s correlation coefficients (calculated on means per system) between the Adequacy data from the three image description evaluation experiments on the one hand, and the data from an extrinsic experiment in which we measured the accuracy with which participants identified the intended image described by a description: ThecorIlm at iog ne V bAeDSt w-(A7eB)An dA eqd uqe ac uy a cy.I89nD720d 6AI*Dc .Acuray was strong and highly significant in all three image description evaluation experiments, but strongest in VAS (B), and weakest in VAS (A). For comparison, 234 Pearson’s between Fluency and ID Accuracy ranged between .3 and .5, whereas Pearson’s between Adequacy and ID Speed (also measured in the same image identfication experiment) ranged between -.35 and -.29. 7 Discussion and Conclusions Our interchangeability results (Sections 6. 1and 6.2) indicate that the VAS and VDS-7 scales we have tested can substitute for each other in our present evaluation tasks in terms of the mean system scores they produce. Where we were able to measure validity (Section 6.6), both scales were shown to be similarly valid, predicting image identification accuracy figures from a separate experiment equally well. Stability (Section 6.3) was marginally better for VDS-7 data, and Reliability (Sections 6.1 and 6.2) was better for VAS data in the image descrip- tion evaluations, but (mostly) better for VDS-7 data in the weather forecast evaluations. Finally, the VAS experiments found greater numbers of statistically significant differences between systems in 3 out of 4 cases (Section 6.5). Our own raters strongly prefer working with VAS scales over VDSs. This has also long been clear from the psychology literature (Svensson, 2000)), where raters are typically found to prefer VAS scales over VDSs which can be a “constant source of vexation to the conscientious rater when he finds his judgments falling between the defined points” (Champney, 1941). Moreover, if a rater’s judgment falls between two points on a VDS then they must make the false choice between the two points just above and just below their actual judgment. In this case we know that the point they end up selecting is not an accurate measure of their judgment but rather just one of two equally accurate ones (one of which goes unrecorded). Our results establish (for our evaluation tasks) that VAS scales, so far unproven for use in NLP, are at least as good as VDSs, currently virtually the only scale in use in NLP. Combined with the fact that raters strongly prefer VASs and that they are regarded as more amenable to parametric means of statistical analysis, this indicates that VAS scales should be used more widely for NLP evaluation tasks. References Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 16–24. H. Champney. 1941. The measurement of parent behavior. Child Development, 12(2): 13 1. M. Freyd. 1923. The graphic rating scale. Biometrical Journal, 42:83–102. A. Gatt, A. Belz, and E. Kow. 2009. The TUNA Challenge 2009: Overview and evaluation results. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG’09), pages 198–206. Brian Langner. 2010. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Robert W. Lansing, Shakeeb H. Moosavi, and Robert B. Banzett. 2003. Measurement of dyspnea: word labeled visual analog scale vs. verbal ordinal scale. Respiratory Physiology & Neurobiology, 134(2):77 –83. J. Scott and E. C. Huskisson. 2003. Vertical or horizontal visual analogue scales. Annals of the rheumatic diseases, (38):560. Sidney Siegel. 1957. Non-parametric statistics. The American Statistician, 11(3): 13–19. Elisabeth Svensson. 2000. Comparison of the quality of assessments using continuous and discrete ordinal rating scales. Biometrical Journal, 42(4):417–434. P. M. ten Klooster, A. P. Klaar, E. Taal, R. E. Gheith, J. J. Rasker, A. K. El-Garf, and M. A. van de Laar. 2006. The validity and reliability of the graphic rating scale and verbal rating scale for measuing pain across cultures: A study in egyptian and dutch women with rheumatoid arthritis. The Clinical Journal of Pain, 22(9):827–30. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation, pages 130–132, Sydney, Australia, July. S. Williams and E. Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525. 235

same-paper 2 0.77707422 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Author: Ruihong Huang ; Ellen Riloff

3 0.7496773 154 acl-2011-How to train your multi bottom-up tree transducer

Author: Andreas Maletti

Abstract: The local multi bottom-up tree transducer is introduced and related to the (non-contiguous) synchronous tree sequence substitution grammar. It is then shown how to obtain a weighted local multi bottom-up tree transducer from a bilingual and biparsed corpus. Finally, the problem of non-preservation of regularity is addressed. Three properties that ensure preservation are introduced, and it is discussed how to adjust the rule extraction process such that they are automatically fulfilled.

4 0.696594 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

5 0.69210029 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

6 0.6898675 293 acl-2011-Template-Based Information Extraction without the Templates

7 0.67752969 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

8 0.67646617 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

9 0.6764406 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

10 0.67512798 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

11 0.67063344 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

12 0.66940409 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

13 0.66722977 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

14 0.66701281 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

15 0.66399485 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation

16 0.66350204 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics

17 0.66136122 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

18 0.66119784 311 acl-2011-Translationese and Its Dialects

19 0.66075128 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

20 0.65798384 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories