acl acl2012 acl2012-169 knowledge-graph by maker-knowledge-mining

169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction


Source: pdf

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. [sent-4, score-1.253]

2 When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. [sent-5, score-0.917]

3 However, this heuristic can fail with the result that some sentences are labeled wrongly. [sent-6, score-0.331]

4 In this paper, we propose a method to reduce the number of wrong labels. [sent-8, score-0.398]

5 We present a novel generative model that directly models the heuristic labeling process of distant supervision. [sent-9, score-0.63]

6 The model predicts whether assigned labels are correct or wrong via its hidden variables. [sent-10, score-0.683]

7 Our experimental results show that this model detected wrong labels with higher performance than baseline methods. [sent-11, score-0.63]

8 In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction. [sent-12, score-0.73]

9 1 Introduction Machine learning approaches have been developed to address relation extraction, which is the task of extracting semantic relations between entities expressed in text. [sent-13, score-0.565]

10 Supervised approaches are limited in scalability because labeled data is expensive to produce. [sent-14, score-0.336]

11 A particularly attractive approach, called distant supervision (DS), creates labeled data by heuristically aligning entities in text with those in a knowledge base, such as Freebase (Mintz et al. [sent-15, score-1.479]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('distant', 0.355), ('shingo', 0.329), ('wrong', 0.325), ('tokyo', 0.252), ('supervision', 0.246), ('heuristically', 0.231), ('freebase', 0.22), ('jp', 0.183), ('relation', 0.156), ('hongo', 0.143), ('sato', 0.143), ('labels', 0.132), ('nakagawa', 0.131), ('seeks', 0.131), ('boosted', 0.131), ('okyo', 0.131), ('base', 0.127), ('entities', 0.125), ('mintz', 0.122), ('labeled', 0.116), ('hiroshi', 0.115), ('laboratories', 0.11), ('extraction', 0.106), ('heuristic', 0.098), ('corporation', 0.097), ('dl', 0.091), ('attractive', 0.089), ('scalability', 0.086), ('fail', 0.086), ('ds', 0.078), ('creates', 0.075), ('detected', 0.075), ('aligning', 0.075), ('causes', 0.073), ('poor', 0.067), ('predicts', 0.067), ('reducing', 0.064), ('relations', 0.063), ('center', 0.062), ('expensive', 0.061), ('su', 0.058), ('com', 0.058), ('knowledge', 0.055), ('reduction', 0.054), ('hidden', 0.048), ('particularly', 0.048), ('noisy', 0.045), ('developed', 0.041), ('reduce', 0.041), ('expressed', 0.041), ('entity', 0.04), ('labeling', 0.04), ('generative', 0.038), ('called', 0.038), ('refer', 0.037), ('approaches', 0.036), ('extracting', 0.034), ('supervised', 0.034), ('address', 0.034), ('assigned', 0.033), ('technologies', 0.03), ('limited', 0.029), ('novel', 0.029), ('sentence', 0.028), ('label', 0.027), ('extract', 0.027), ('pair', 0.024), ('correct', 0.023), ('experimental', 0.022), ('via', 0.022), ('directly', 0.022), ('baseline', 0.021), ('technology', 0.021), ('found', 0.02), ('whether', 0.019), ('semantic', 0.019), ('source', 0.019), ('propose', 0.019), ('higher', 0.019), ('text', 0.018), ('performance', 0.017), ('result', 0.016), ('sentences', 0.015), ('model', 0.014), ('corresponding', 0.014), ('approach', 0.014), ('process', 0.014), ('present', 0.011), ('task', 0.01), ('models', 0.009), ('method', 0.009), ('system', 0.008), ('data', 0.008), ('machine', 0.008), ('university', 0.006), ('learning', 0.006), ('show', 0.005), ('paper', 0.004), ('number', 0.004), ('information', 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

2 0.39663377 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

Author: Ce Zhang ; Feng Niu ; Christopher Re ; Jude Shavlik

Abstract: Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.

3 0.27539513 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

Author: Enrique Alfonseca ; Katja Filippova ; Jean-Yves Delort ; Guillermo Garrido

Abstract: We describe the use of a hierarchical topic model for automatically identifying syntactic and lexical patterns that explicitly state ontological relations. We leverage distant supervision using relations from the knowledge base FreeBase, but do not require any manual heuristic nor manual seed list selections. Results show that the learned patterns can be used to extract new relations with good precision.

4 0.13362168 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

5 0.11300029 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: To discover relation types from text, most methods cluster shallow or syntactic patterns of relation mentions, but consider only one possible sense per pattern. In practice this assumption is often violated. In this paper we overcome this issue by inducing clusters of pattern senses from feature representations of patterns. In particular, we employ a topic model to partition entity pairs associated with patterns into sense clusters using local and global features. We merge these sense clusters into semantic relations using hierarchical agglomerative clustering. We compare against several baselines: a generative latent-variable model, a clustering method that does not disambiguate between path senses, and our own approach but with only local features. Experimental results show our proposed approach discovers dramatically more accurate clusters than models without sense disambiguation, and that incorporating global features, such as the document theme, is crucial.

6 0.076213293 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

7 0.074990883 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

8 0.058468461 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

9 0.05609357 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

10 0.055105239 73 acl-2012-Discriminative Learning for Joint Template Filling

11 0.050940618 42 acl-2012-Bootstrapping via Graph Propagation

12 0.042678479 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

13 0.041302089 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

14 0.041235011 153 acl-2012-Named Entity Disambiguation in Streaming Data

15 0.039804995 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

16 0.039309494 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

17 0.036851294 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

18 0.035293169 140 acl-2012-Machine Translation without Words through Substring Alignment

19 0.034681037 176 acl-2012-Sentence Compression with Semantic Role Constraints

20 0.03434537 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.115), (1, 0.109), (2, -0.049), (3, 0.152), (4, 0.064), (5, 0.016), (6, -0.132), (7, 0.012), (8, -0.007), (9, -0.081), (10, 0.258), (11, -0.036), (12, -0.166), (13, -0.127), (14, 0.084), (15, 0.115), (16, -0.247), (17, -0.227), (18, 0.179), (19, 0.007), (20, 0.118), (21, -0.095), (22, -0.031), (23, -0.039), (24, -0.032), (25, -0.013), (26, 0.061), (27, 0.048), (28, 0.112), (29, -0.109), (30, 0.224), (31, -0.112), (32, -0.117), (33, -0.056), (34, 0.133), (35, -0.04), (36, -0.011), (37, -0.106), (38, -0.031), (39, -0.073), (40, 0.034), (41, -0.004), (42, 0.039), (43, -0.101), (44, 0.047), (45, 0.007), (46, 0.087), (47, -0.023), (48, 0.042), (49, -0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97269452 169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

2 0.90435779 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

Author: Ce Zhang ; Feng Niu ; Christopher Re ; Jude Shavlik

Abstract: Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.

3 0.72602493 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

Author: Enrique Alfonseca ; Katja Filippova ; Jean-Yves Delort ; Guillermo Garrido

Abstract: We describe the use of a hierarchical topic model for automatically identifying syntactic and lexical patterns that explicitly state ontological relations. We leverage distant supervision using relations from the knowledge base FreeBase, but do not require any manual heuristic nor manual seed list selections. Results show that the learned patterns can be used to extract new relations with good precision.

4 0.36202115 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

5 0.36112484 129 acl-2012-Learning High-Level Planning from Text

Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay

Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –

6 0.34607425 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

7 0.30350864 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

8 0.29880351 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

9 0.29025966 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

10 0.28022769 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

11 0.26237571 73 acl-2012-Discriminative Learning for Joint Template Filling

12 0.23714115 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

13 0.17419235 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

14 0.1704285 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

15 0.16911404 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

16 0.1682276 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

17 0.16151616 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

18 0.15602262 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

19 0.15482612 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

20 0.15168992 153 acl-2012-Named Entity Disambiguation in Streaming Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(90, 0.056), (92, 0.014), (99, 0.791)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97272933 169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

2 0.94951081 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

3 0.89696938 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

Author: Elena Cabrio ; Serena Villata

Abstract: Blogs and forums are widely adopted by online communities to debate about various issues. However, a user that wants to cut in on a debate may experience some difficulties in extracting the current accepted positions, and can be discouraged from interacting through these applications. In our paper, we combine textual entailment with argumentation theory to automatically extract the arguments from debates and to evaluate their acceptability.

4 0.89409572 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development

Author: Rafael E. Banchs

Abstract: This paper describes Movie-DiC a Movie Dialogue Corpus recently collected for research and development purposes. The collected dataset comprises 132,229 dialogues containing a total of 764,146 turns that have been extracted from 753 movies. Details on how the data collection has been created and how it is structured are provided along with its main statistics and characteristics. 1

5 0.85159695 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

6 0.83016437 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

7 0.50513983 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

8 0.49942663 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

9 0.49343613 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

10 0.44771913 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

11 0.44764549 191 acl-2012-Temporally Anchored Relation Extraction

12 0.44020492 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

13 0.43469188 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

14 0.42264563 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

15 0.42011639 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

16 0.4134582 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

17 0.39421862 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

18 0.39143437 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

19 0.38614297 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

20 0.38108233 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction