acl acl2011 acl2011-213 knowledge-graph by maker-knowledge-mining

213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia


Source: pdf

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. [sent-5, score-0.228]

2 The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. [sent-6, score-0.256]

3 Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. [sent-7, score-0.582]

4 In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. [sent-8, score-0.669]

5 We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat. [sent-9, score-0.765]

6 1 Introduction Wikification is the task of identifying and linking expressions in text to their referent Wikipedia pages. [sent-10, score-0.413]

7 Recently, Wikification has been shown to form a valuable component for numerous natural language processing tasks including text classification (Gabrilovich and Markovitch, 2007b; Chang et al. [sent-11, score-0.207]

8 , 2008), measuring semantic similarity between texts (Gabrilovich and Markovitch, 2007a), crossdocument co-reference resolution (Finin et al. [sent-12, score-0.265]

9 1375 Previous studies on Wikification differ with respect to the corpora they address and the subset of expressions they attempt to link. [sent-16, score-0.409]

10 For example, some studies focus on linking only named entities, whereas others attempt to link all “interesting” expressions, mimicking the link structure found in Wikipedia. [sent-17, score-0.753]

11 Regardless, all Wikification systems are faced with a key Disambiguation to Wikipedia (D2W) task. [sent-18, score-0.094]

12 In the D2W task, we’re given a text along with explicitly identified substrings (called mentions) to disambiguate, and the goal is to output the corresponding Wikipedia page, if any, for each mention. [sent-19, score-0.231]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wikification', 0.556), ('wikipedia', 0.318), ('disambiguations', 0.278), ('disambiguation', 0.237), ('markovitch', 0.212), ('gabrilovich', 0.201), ('link', 0.133), ('linking', 0.129), ('expressions', 0.126), ('danr', 0.123), ('northwe', 0.123), ('mayfield', 0.106), ('crossdocument', 0.106), ('kulkarni', 0.101), ('mimicking', 0.101), ('rat', 0.101), ('global', 0.097), ('doug', 0.096), ('referent', 0.093), ('illinois', 0.09), ('visiting', 0.087), ('finin', 0.087), ('encyclopedia', 0.087), ('lev', 0.087), ('arrive', 0.084), ('substrings', 0.084), ('ste', 0.084), ('ratinov', 0.082), ('friends', 0.08), ('entities', 0.079), ('local', 0.074), ('mike', 0.073), ('attempt', 0.073), ('traditional', 0.072), ('disambiguating', 0.072), ('faced', 0.069), ('gmai', 0.067), ('rn', 0.065), ('studies', 0.065), ('edu', 0.061), ('disambiguate', 0.061), ('increasingly', 0.061), ('ee', 0.06), ('numerous', 0.058), ('resolution', 0.058), ('fundamental', 0.058), ('coherent', 0.056), ('mentions', 0.055), ('regardless', 0.051), ('concepts', 0.049), ('chang', 0.049), ('utilize', 0.044), ('sensitive', 0.042), ('com', 0.042), ('page', 0.041), ('measuring', 0.041), ('valuable', 0.041), ('popular', 0.04), ('differ', 0.04), ('re', 0.039), ('call', 0.039), ('provides', 0.038), ('tasks', 0.038), ('analyze', 0.037), ('approaches', 0.034), ('texts', 0.034), ('distinct', 0.034), ('others', 0.032), ('explicitly', 0.031), ('hard', 0.031), ('respect', 0.031), ('whereas', 0.03), ('named', 0.03), ('component', 0.029), ('online', 0.029), ('recently', 0.028), ('along', 0.028), ('sense', 0.028), ('identified', 0.028), ('subset', 0.027), ('called', 0.027), ('structure', 0.027), ('interesting', 0.026), ('address', 0.026), ('similarity', 0.026), ('key', 0.025), ('document', 0.025), ('dan', 0.025), ('identifying', 0.024), ('goal', 0.022), ('made', 0.021), ('text', 0.021), ('corpora', 0.021), ('classification', 0.02), ('algorithms', 0.02), ('task', 0.02), ('target', 0.018), ('previous', 0.017), ('improved', 0.017), ('output', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

2 0.19763778 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

3 0.14255044 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

4 0.1253338 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Author: Sameer Singh ; Amarnag Subramanya ; Fernando Pereira ; Andrew McCallum

Abstract: Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1.5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach.

5 0.12143553 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

Author: Truc Vien T. Nguyen ; Alessandro Moschitti

Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.

6 0.11055654 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

7 0.1032926 285 acl-2011-Simple supervised document geolocation with geodesic grids

8 0.098550647 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

9 0.082242347 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

10 0.07406418 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

11 0.065718263 52 acl-2011-Automatic Labelling of Topic Models

12 0.061252873 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

13 0.060537919 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

14 0.055656519 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

15 0.051108625 224 acl-2011-Models and Training for Unsupervised Preposition Sense Disambiguation

16 0.048634965 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

17 0.04563595 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

18 0.040506884 23 acl-2011-A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models

19 0.039408803 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

20 0.039199457 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.094), (1, 0.047), (2, -0.088), (3, 0.046), (4, 0.029), (5, -0.013), (6, 0.032), (7, -0.049), (8, -0.175), (9, 0.005), (10, 0.016), (11, -0.009), (12, -0.018), (13, -0.097), (14, 0.071), (15, 0.021), (16, 0.216), (17, -0.015), (18, 0.009), (19, -0.06), (20, 0.06), (21, -0.099), (22, -0.021), (23, -0.083), (24, 0.135), (25, -0.029), (26, 0.02), (27, -0.033), (28, 0.019), (29, 0.077), (30, -0.025), (31, 0.055), (32, -0.017), (33, 0.032), (34, 0.067), (35, 0.091), (36, 0.012), (37, -0.06), (38, 0.005), (39, 0.019), (40, 0.012), (41, 0.051), (42, 0.036), (43, 0.099), (44, -0.008), (45, 0.035), (46, -0.105), (47, -0.02), (48, 0.024), (49, -0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97842002 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

2 0.83536261 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

3 0.83285397 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

4 0.69136995 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

Author: Xianpei Han ; Le Sun

Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1

5 0.67553025 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

6 0.62323922 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

7 0.52702731 285 acl-2011-Simple supervised document geolocation with geodesic grids

8 0.52259618 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

9 0.47648886 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

10 0.4398469 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

11 0.39173949 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

12 0.38320079 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

13 0.36406338 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

14 0.32046282 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

15 0.32027915 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

16 0.30817813 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

17 0.30464977 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.29357702 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

19 0.28968406 261 acl-2011-Recognizing Named Entities in Tweets

20 0.2845577 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.062), (17, 0.032), (26, 0.012), (37, 0.089), (41, 0.055), (44, 0.013), (59, 0.077), (72, 0.025), (84, 0.379), (91, 0.025), (96, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.66781926 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

2 0.52341008 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. We also apply our method to English/Hindi and English/Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally, we integrate the transliteration module into the GIZA++ word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments.

3 0.4096708 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

4 0.4083823 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

5 0.40770847 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to improve event extraction by learning to identify secondary role filler contexts in the absence of event keywords. We propose a multilayered event extraction architecture that progressively “zooms in” on relevant information. Our extraction model includes a document genre classifier to recognize event narratives, two types of sentence classifiers, and noun phrase classifiers to extract role fillers. These modules are organized as a pipeline to gradually zero in on event-related information. We present results on the MUC-4 event extraction data set and show that this model performs better than previous systems.

6 0.40737373 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

7 0.40644419 293 acl-2011-Template-Based Information Extraction without the Templates

8 0.40640575 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

9 0.40606636 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

10 0.40537286 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

11 0.40460896 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition

12 0.4045307 311 acl-2011-Translationese and Its Dialects

13 0.40341288 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

14 0.4032065 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation

15 0.40197185 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

16 0.40117759 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

17 0.40057218 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics

18 0.39941007 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

19 0.39912876 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

20 0.39899749 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal