acl acl2011 acl2011-40 knowledge-graph by maker-knowledge-mining

40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents


Source: pdf

Author: Gregory Brown

Abstract: Relation extraction in documents allows the detection of how entities being discussed in a document are related to one another (e.g. partof). This paper presents an analysis of a relation extraction system based on prior work but applied to the J.D. Power and Associates Sentiment Corpus to examine how the system works on documents from a range of social media. The results are examined on three different subsets of the JDPA Corpus, showing that the system performs much worse on documents from certain sources. The proposed explanation is that the features used are more appropriate to text with strong editorial standards than the informal writing style of blogs.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An Error Analysis of Relation Extraction in Social Media Documents Gregory Ichneumon Brown University of Colorado at Boulder Boulder, Colorado browngp @ co l orado . [sent-1, score-0.019]

2 edu Abstract Relation extraction in documents allows the detection of how entities being discussed in a document are related to one another (e. [sent-2, score-0.263]

3 This paper presents an analysis of a relation extraction system based on prior work but applied to the J. [sent-5, score-0.179]

4 Power and Associates Sentiment Corpus to examine how the system works on documents from a range of social media. [sent-7, score-0.209]

5 The results are examined on three different subsets of the JDPA Corpus, showing that the system performs much worse on documents from certain sources. [sent-8, score-0.177]

6 The proposed explanation is that the features used are more appropriate to text with strong editorial standards than the informal writing style of blogs. [sent-9, score-0.166]

7 1 Introduction To summarize accurately, determine the sentiment, or answer questions about a document it is often nec- essary to be able to determine the relationships between entities being discussed in the document (such as part-of or member-of). [sent-10, score-0.166]

8 Ilove determining the sentiment the author is expressing about the car requires knowing that the engine is a part of the car so that the positive sentiment being expressed about the engine can also be attributed to the car. [sent-14, score-0.412]

9 In this paper we examine our preliminary results from applying a relation extraction system to the 64 J. [sent-15, score-0.199]

10 Our system uses lexical features from prior work to classify relations, and we examine how the system works on different subsets from the JDPA Sentiment Corpus, breaking the source documents down into professionally written reviews, blog reviews, and social networking re- views. [sent-19, score-0.399]

11 These three document types represent quite different writing styles, and we see significant difference in how the relation extraction system performs on the documents from different sources. [sent-20, score-0.392]

12 , 2005) is one of the most common corpora for performing relation extraction. [sent-23, score-0.119]

13 In addition to the co-reference annotations, the Corpus is annotated to indicate 23 different relations between real-world entities that are mentioned in the same sentence. [sent-24, score-0.095]

14 The documents consist of broadcast news transcripts and newswire articles from a variety of news organizations. [sent-25, score-0.252]

15 2 JDPA Sentiment Corpus The JDPA Corpus consists of 457 documents containing discussions about cars, and 180 documents discussing cameras (Kessler et al. [sent-27, score-0.31]

16 In this work we only use the automotive documents. [sent-29, score-0.046]

17 The documents are drawn from a variety of sources, and we particularly focus on the 24% of the doc- uments from the JDPA Power Steering blog, 18% from Blogspot, and 18% from LiveJournal. [sent-30, score-0.152]

18 c T2 2001111 A Sstsuodceinatti Soens fsoiorn C,o pmagpeusta 64ti–o6n8a,l Linguistics The annotated mentions in the Corpus are single or multi-word expressions which refer to a particular real world or abstract entity. [sent-33, score-0.1]

19 The mentions are annotated to indicate sets of mentions which constitute co-reference groups referring to the same entity. [sent-34, score-0.241]

20 Five relationships are annotated between these entities: PartOf, FeatureOf, Produces, InstanceOf, and MemberOf. [sent-35, score-0.029]

21 One significant difference between these relation annotations and those in the ACE Corpus is that the former are relations between sets of mentions (the co-reference groups) rather than between individual mentions. [sent-36, score-0.292]

22 This means that these relations are not limited to being between mentions in the same sentence. [sent-37, score-0.156]

23 1, “engine” would be marked as a part of “car” in the JDPA Corpus annotations, but there would be no relation annotated in the ACE Corpus. [sent-39, score-0.119]

24 For a more direct comparison to the ACE Corpus results, we restrict ourselves only to mentions within the same sentence (we discuss this decision further in section 5. [sent-40, score-0.1]

25 1 Overview The system extracts all pairs of mentions in a sentence, and then classifies each pair of mentions as either having a relationship, having an inverse relationship, or having no relationship. [sent-43, score-0.217]

26 So for the PartOf relation in the JDPA Sentiment Corpus we consider both the relation “X is part of Y” and “Y is part of X”. [sent-44, score-0.238]

27 The classification of each mention pair is performed using a support vector machine implemented using libLinear (Fan et al. [sent-45, score-0.094]

28 To generate the features for each of the mention pairs a proprietary JDPA Tokenizer is used for parsing the document and the Stanford Parser (Klein and Manning, 2003) is used to generate parse trees and part of speech tags for the sentences in the documents. [sent-47, score-0.165]

29 , 2005) as the basis for the features of our system similar to what other researchers have done (Chan and Roth, 2010). [sent-51, score-0.039]

30 Additional work has extended these features (Jiang and Zhai, 2007) or incorporated other data sources (e. [sent-52, score-0.022]

31 WordNet), but in this paper we focus solely on the initial step of applying these same 65 lexical features to the JDPA Corpus. [sent-54, score-0.022]

32 The Mention Level, Overlap, Base Phrase Chunking, Dependency Tree, and Parse Tree features are the same as Zhou et al. [sent-55, score-0.022]

33 Entity Types: Some of the entity types in the JEDnPtAity Corpus :i Sndoimcaete o fth teh type toyf ttyhpee srel inati tohne (e. [sent-59, score-0.072]

34 CarFeature, CarPart) and so we replace those entity types with “Unknown”. [sent-61, score-0.035]

35 Token Class: We added an additional feature (TC12+ET12) indicating tnhe a dTdoiktioenn Cl fleasast roef the head words (e. [sent-62, score-0.02]

36 • Semantic Information: These features are specific tco Itnhfeo rAmCaEt orenl:ations and so are not used. [sent-65, score-0.04]

37 ’s work, this set of features increases the overall F-Measure by 1. [sent-67, score-0.022]

38 1 ACE Corpus Results We ran our system on the ACE-2004 Corpus as a baseline to prove that the system worked properly and could approximately duplicate Zhou et al. [sent-70, score-0.034]

39 Using 5-fold cross validation on the newswire and broadcast news documents in the dataset we achieved an average overall F-Measure of 50. [sent-72, score-0.223]

40 Table 1 shows relation extraction results of the system on the test portion of the corpus. [sent-80, score-0.179]

41 The results are further broken out by three different source types to highlight the differences caused RelationPAll DocRumentFsPLiveJRournalFPBloRgspotFPJDRPAF IPFOMNAREvSeOMTrDaAUBOlNRECF OSF345671. [sent-81, score-0.057]

42 ,9%5267 Table 1: Relation extraction results on the JDPA Corpus test set, broken down by document source. [sent-99, score-0.114]

43 LiveJournalBlogspotJDPAACE Table 2: Selected document statistics for three JDPA Corpus document sources. [sent-100, score-0.098]

44 by the writing styles from different types of media: LiveJournal (livejournal. [sent-101, score-0.058]

45 com), a social media site where users comment and discuss stories with each other; Blogspot (blospot. [sent-102, score-0.076]

46 com’s Power Steering blog), consisting of reviews of cars written by JDPA professional writers/analysts. [sent-104, score-0.147]

47 These subsets were selected because they provide the extreme (JDPA and LiveJournal) and average (Blogspot) results for the overall dataset. [sent-105, score-0.028]

48 5 Analysis Overall the system is not performing as well as it does on the ACE-2004 dataset. [sent-106, score-0.017]

49 However, there is a 25 point F-Measure difference between the LiveJournal and JDPA authored documents. [sent-107, score-0.117]

50 This suggests that the informal style of the LiveJournal documents may be reducing the effectiveness of the features developed by Zhou et al. [sent-108, score-0.205]

51 , which were developed on newswire and broadcast news transcript documents. [sent-109, score-0.091]

52 In the remainder of this section we look at a statistical analysis of the training portion of the JDPA Corpus, separated by document source, and suggest 66 areas where improved features may be able to aid relation extraction on the JDPA Corpus. [sent-110, score-0.233]

53 1 Document Statistic Effects on Classifier Table 2 summarizes some important statistical differences between the documents from different sources. [sent-112, score-0.15]

54 These differences suggest two reasons why the instances being used to train the classifier could be skewed disproportionately towards the JDPA authored documents. [sent-113, score-0.189]

55 First, the JDPA written documents express a much larger number of relations between entities. [sent-114, score-0.223]

56 When training the classifier, these differences will cause a large share of the instances that have a relation to be from a JDPA written document, skewing the classifier towards any language clues specific to these documents. [sent-115, score-0.222]

57 Second, the number of mention pairs occurring within one sentence is significantly higher in the JDPA authored documents than the other documents. [sent-116, score-0.343]

58 This disparity is even true on a per sentence or per document basis. [sent-117, score-0.083]

59 This provides the classifier with significantly more negative examples written in a JDPA written style. [sent-118, score-0.1]

60 PMLheriavntsei oJonurna%lPMheBranltsoiegonspot%PMheranJtsiDeonPA% Table 3: Top 10 phrases in mention pairs whose relation was incorrectly classified, and the total percentage of errors from the top ten. [sent-119, score-0.326]

61 2 Common Errors Table 3 shows the mention phrases that occur most commonly in the incorrectly classified mention pairs. [sent-121, score-0.276]

62 For the LiveJournal and Blogspot data, many more of the errors are due to a few specific phrases being classified incorrectly such as “car”, “Maybach”, and various forms of “it”. [sent-122, score-0.114]

63 The top four phrases constitute 17% of the errors for LiveJournal and 14% for Blogspot. [sent-123, score-0.087]

64 Whereas the JDPA documents have the errors spread more evenly across mention phrases, with the top 10 phrases constituting 13. [sent-124, score-0.293]

65 Furthermore, the phrases causing many of the problems for the LiveJournal and Blogspot relation detection are generic nouns and pronouns such as “car” and “it”. [sent-126, score-0.233]

66 This suggests that the classifier is having difficulty determining relationships when these less descriptive words are involved. [sent-127, score-0.115]

67 67 Table 4: Frequency of some common words per token. [sent-130, score-0.017]

68 We find that despite all the documents discussing cars, the JDPA reviews use the word “car” much less often, and use proper nouns significantly more often. [sent-132, score-0.213]

69 Although “car” also appears in the top ten errors on the JDPA documents, the total percentage of the errors is one fifth of the error rate on the LiveJournal documents. [sent-133, score-0.07]

70 The JDPA authored documents also tend to have more multi-word mention phrases (Table 2) suggesting that the authors use more descriptive language when referring to an entity. [sent-134, score-0.461]

71 3% of the mentions in LiveJournal documents use only a single word while 61. [sent-136, score-0.232]

72 2% of mentions JDPA authored documents are a single word. [sent-137, score-0.349]

73 Rather than descriptive noun phrases, the LiveJournal and Blogspot documents make more use of pronouns. [sent-138, score-0.188]

74 LiveJournal especially uses pronouns often, to the point of averaging one per sentence, while JDPA uses only one every five sentences. [sent-139, score-0.044]

75 4 Extra-Sentential Relations Many relations in the JDPA Corpus occur between entities which are not mentioned in the same sentence. [sent-141, score-0.095]

76 Our system only detects relations between mentions in the same sentence, causing about 29% of entity relations to never be detected (Table 2). [sent-142, score-0.291]

77 The LiveJournal documents are more likely to contain relationships between entities that are not mentioned in the same sentence. [sent-143, score-0.2]

78 Improvements in entity relation extraction could likely be made by extending Zhou et al. [sent-145, score-0.197]

79 6 Conclusion The above analysis shows that at least some of the reason for the system performing worse on the JDPA Corpus than on the ACE-2004 Corpus is that many of the documents in the JDPA Corpus have a different writing style from the news articles in the ACE Corpus. [sent-147, score-0.237]

80 Both the ACE news documents, and the JDPA authored documents are written by professional writers with stronger editorial standards than the other JDPA Corpus documents, and the relation extraction system performs much better on professionally edited documents. [sent-148, score-0.616]

81 The heavy use of pronouns and less descriptive mention phrases in the other documents seems to be one cause of the re- duction in relation extraction performance. [sent-149, score-0.531]

82 There is also some evidence that because of the greater number ofrelations in the JPDA authored documents that the classifier training data could be skewed more towards those documents. [sent-150, score-0.323]

83 Future work needs to explore features that can address the difference in language usage that the different authors use. [sent-151, score-0.022]

84 This work also does not address whether the relation extraction task is being negatively impacted by poor tokenization or parsing of the documents rather than the problems being caused by the relation classification itself. [sent-152, score-0.45]

85 Further work is also needed to classify extra-sentential relations, as the current methods look only at relations occurring within a single sentence thus ignoring a large percentage of relations between entities. [sent-153, score-0.13]

86 A systematic exploration of the feature space for relation extraction. [sent-185, score-0.119]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jdpa', 0.853), ('livejournal', 0.225), ('documents', 0.132), ('blogspot', 0.122), ('relation', 0.119), ('authored', 0.117), ('ace', 0.101), ('mentions', 0.1), ('car', 0.097), ('mention', 0.094), ('zhou', 0.086), ('sentiment', 0.082), ('kessler', 0.081), ('partof', 0.069), ('descriptive', 0.056), ('relations', 0.056), ('document', 0.049), ('cars', 0.048), ('automotive', 0.046), ('steering', 0.046), ('extraction', 0.043), ('phrases', 0.041), ('social', 0.04), ('entities', 0.039), ('editorial', 0.037), ('nicolov', 0.037), ('broadcast', 0.037), ('power', 0.037), ('media', 0.036), ('reviews', 0.036), ('chai', 0.035), ('professionally', 0.035), ('blog', 0.035), ('entity', 0.035), ('written', 0.035), ('writing', 0.032), ('srl', 0.031), ('gerber', 0.031), ('classifier', 0.03), ('associates', 0.03), ('tokenizer', 0.03), ('relationships', 0.029), ('liblinear', 0.029), ('news', 0.029), ('subsets', 0.028), ('incorrectly', 0.028), ('professional', 0.028), ('corpus', 0.027), ('causing', 0.027), ('engine', 0.027), ('pronouns', 0.027), ('style', 0.027), ('discussing', 0.026), ('styles', 0.026), ('errors', 0.026), ('chan', 0.026), ('newswire', 0.025), ('standards', 0.024), ('skewed', 0.024), ('informal', 0.024), ('fan', 0.023), ('broken', 0.022), ('features', 0.022), ('zhai', 0.021), ('referring', 0.021), ('colorado', 0.02), ('blogging', 0.02), ('cameras', 0.02), ('honorific', 0.02), ('impacted', 0.02), ('nielsen', 0.02), ('ofrelations', 0.02), ('rodney', 0.02), ('roef', 0.02), ('skewing', 0.02), ('uments', 0.02), ('constitute', 0.02), ('boulder', 0.02), ('examine', 0.02), ('classified', 0.019), ('nouns', 0.019), ('ations', 0.019), ('duction', 0.019), ('headden', 0.019), ('orado', 0.019), ('toyf', 0.019), ('roth', 0.018), ('jiang', 0.018), ('differences', 0.018), ('percentage', 0.018), ('jim', 0.018), ('tohne', 0.018), ('networking', 0.018), ('ilove', 0.018), ('tco', 0.018), ('system', 0.017), ('per', 0.017), ('annotations', 0.017), ('caused', 0.017), ('mitchell', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

Author: Gregory Brown

Abstract: Relation extraction in documents allows the detection of how entities being discussed in a document are related to one another (e.g. partof). This paper presents an analysis of a relation extraction system based on prior work but applied to the J.D. Power and Associates Sentiment Corpus to examine how the system works on documents from a range of social media. The results are examined on three different subsets of the JDPA Corpus, showing that the system performs much worse on documents from certain sources. The proposed explanation is that the features used are more appropriate to text with strong editorial standards than the informal writing style of blogs.

2 0.12691125 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

Author: Ang Sun ; Ralph Grishman ; Satoshi Sekine

Abstract: We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system. 1

3 0.10409205 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

Author: Yee Seng Chan ; Dan Roth

Abstract: In this paper, we observe that there exists a second dimension to the relation extraction (RE) problem that is orthogonal to the relation type dimension. We show that most of these second dimensional structures are relatively constrained and not difficult to identify. We propose a novel algorithmic approach to RE that starts by first identifying these structures and then, within these, identifying the semantic type of the relation. In the real RE problem where relation arguments need to be identified, exploiting these structures also allows reducing pipelined propagated errors. We show that this RE framework provides significant improvement in RE performance.

4 0.10260194 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

Author: Ryan Gabbard ; Marjorie Freedman ; Ralph Weischedel

Abstract: As an alternative to requiring substantial supervised relation training data, many have explored bootstrapping relation extraction from a few seed examples. Most techniques assume that the examples are based on easily spotted anchors, e.g., names or dates. Sentences in a corpus which contain the anchors are then used to induce alternative ways of expressing the relation. We explore whether coreference can improve the learning process. That is, if the algorithm considered examples such as his sister, would accuracy be improved? With coreference, we see on average a 2-fold increase in F-Score. Despite using potentially errorful machine coreference, we see significant increase in recall on all relations. Precision increases in four cases and decreases in six.

5 0.092016958 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

Author: Harr Chen ; Edward Benson ; Tahira Naseem ; Regina Barzilay

Abstract: We present a novel approach to discovering relations and their instantiations from a collection of documents in a single domain. Our approach learns relation types by exploiting meta-constraints that characterize the general qualities of a good relation in any domain. These constraints state that instances of a single relation should exhibit regularities at multiple levels of linguistic structure, including lexicography, syntax, and document-level context. We capture these regularities via the structure of our probabilistic model as well as a set of declaratively-specified constraints enforced during posterior inference. Across two domains our approach successfully recovers hidden relation structure, comparable to or outperforming previous state-of-the-art approaches. Furthermore, we find that a small , set of constraints is applicable across the domains, and that using domain-specific constraints can further improve performance. 1

6 0.086001009 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

7 0.081988238 204 acl-2011-Learning Word Vectors for Sentiment Analysis

8 0.075971924 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

9 0.072139755 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

10 0.067897953 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

11 0.067754611 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

12 0.065721333 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

13 0.063381463 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

14 0.060891535 23 acl-2011-A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models

15 0.060680397 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

16 0.058252051 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

17 0.057933029 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

18 0.057850808 292 acl-2011-Target-dependent Twitter Sentiment Classification

19 0.05718847 117 acl-2011-Entity Set Expansion using Topic information

20 0.050804619 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, 0.1), (2, -0.043), (3, -0.001), (4, 0.086), (5, 0.031), (6, -0.004), (7, -0.035), (8, -0.082), (9, 0.001), (10, 0.025), (11, -0.012), (12, -0.042), (13, -0.045), (14, 0.017), (15, 0.02), (16, -0.012), (17, -0.073), (18, 0.015), (19, 0.025), (20, -0.046), (21, 0.01), (22, 0.022), (23, -0.017), (24, -0.035), (25, -0.015), (26, 0.073), (27, 0.018), (28, 0.073), (29, -0.035), (30, -0.067), (31, -0.004), (32, 0.018), (33, -0.024), (34, -0.02), (35, -0.012), (36, -0.075), (37, 0.005), (38, 0.009), (39, 0.1), (40, -0.082), (41, 0.071), (42, 0.057), (43, -0.014), (44, -0.013), (45, 0.003), (46, 0.018), (47, -0.024), (48, -0.01), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93736577 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

Author: Gregory Brown

Abstract: Relation extraction in documents allows the detection of how entities being discussed in a document are related to one another (e.g. partof). This paper presents an analysis of a relation extraction system based on prior work but applied to the J.D. Power and Associates Sentiment Corpus to examine how the system works on documents from a range of social media. The results are examined on three different subsets of the JDPA Corpus, showing that the system performs much worse on documents from certain sources. The proposed explanation is that the features used are more appropriate to text with strong editorial standards than the informal writing style of blogs.

2 0.76123166 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

Author: Ang Sun ; Ralph Grishman ; Satoshi Sekine

Abstract: We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system. 1

3 0.71079218 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

Author: Yee Seng Chan ; Dan Roth

Abstract: In this paper, we observe that there exists a second dimension to the relation extraction (RE) problem that is orthogonal to the relation type dimension. We show that most of these second dimensional structures are relatively constrained and not difficult to identify. We propose a novel algorithmic approach to RE that starts by first identifying these structures and then, within these, identifying the semantic type of the relation. In the real RE problem where relation arguments need to be identified, exploiting these structures also allows reducing pipelined propagated errors. We show that this RE framework provides significant improvement in RE performance.

4 0.68609405 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

Author: Ryan Gabbard ; Marjorie Freedman ; Ralph Weischedel

Abstract: As an alternative to requiring substantial supervised relation training data, many have explored bootstrapping relation extraction from a few seed examples. Most techniques assume that the examples are based on easily spotted anchors, e.g., names or dates. Sentences in a corpus which contain the anchors are then used to induce alternative ways of expressing the relation. We explore whether coreference can improve the learning process. That is, if the algorithm considered examples such as his sister, would accuracy be improved? With coreference, we see on average a 2-fold increase in F-Score. Despite using potentially errorful machine coreference, we see significant increase in recall on all relations. Precision increases in four cases and decreases in six.

5 0.67919821 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

Author: Truc Vien T. Nguyen ; Alessandro Moschitti

Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.

6 0.64333081 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

7 0.62586761 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons

8 0.6195153 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

9 0.51728523 322 acl-2011-Unsupervised Learning of Semantic Relation Composition

10 0.50893986 293 acl-2011-Template-Based Information Extraction without the Templates

11 0.48155177 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

12 0.47820118 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

13 0.45657688 291 acl-2011-SystemT: A Declarative Information Extraction System

14 0.44516647 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations

15 0.43749043 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

16 0.42057586 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

17 0.41910291 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages

18 0.41072863 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

19 0.40217996 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

20 0.39707294 133 acl-2011-Extracting Social Power Relationships from Natural Language


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.03), (17, 0.042), (26, 0.026), (31, 0.013), (35, 0.239), (37, 0.09), (39, 0.043), (41, 0.109), (55, 0.02), (59, 0.059), (72, 0.048), (91, 0.038), (96, 0.129)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78526986 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

Author: Gregory Brown

Abstract: Relation extraction in documents allows the detection of how entities being discussed in a document are related to one another (e.g. partof). This paper presents an analysis of a relation extraction system based on prior work but applied to the J.D. Power and Associates Sentiment Corpus to examine how the system works on documents from a range of social media. The results are examined on three different subsets of the JDPA Corpus, showing that the system performs much worse on documents from certain sources. The proposed explanation is that the features used are more appropriate to text with strong editorial standards than the informal writing style of blogs.

2 0.75676715 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages

Author: Cheng-Te Li ; Chien-Yuan Wang ; Chien-Lin Tseng ; Shou-De Lin

Abstract: Micro-blogging services provide platforms for users to share their feelings and ideas on the move. In this paper, we present a search-based demonstration system, called MemeTube, to summarize the sentiments of microblog messages in an audiovisual manner. MemeTube provides three main functions: (1) recognizing the sentiments of messages (2) generating music melody automatically based on detected sentiments, and (3) produce an animation of real-time piano playing for audiovisual display. Our MemeTube system can be accessed via: http://mslab.csie.ntu.edu.tw/memetube/ .

3 0.70329833 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

4 0.6572063 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

5 0.64873272 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

6 0.6474489 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

7 0.64613557 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

8 0.6412549 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

9 0.64037365 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

10 0.637429 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

11 0.6350584 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

12 0.6347872 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

13 0.63444251 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

14 0.63301671 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

15 0.63106722 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

16 0.63092136 311 acl-2011-Translationese and Its Dialects

17 0.63061047 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

18 0.63054574 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

19 0.63032687 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

20 0.63003093 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition