acl acl2011 acl2011-128 knowledge-graph by maker-knowledge-mining

128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation


Source: pdf

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. [sent-3, score-0.798]

2 Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. [sent-4, score-1.34]

3 We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. [sent-5, score-0.492]

4 We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. [sent-6, score-0.498]

5 1 Introduction Identifying the correct real-world referents ofnamed entities (NE) mentioned in text (such as people, organizations, and geographic locations) plays an important role in various natural language processing and information retrieval tasks. [sent-8, score-0.142]

6 The goal of Named Entity Disambiguation (NED) is to label a surface form denoting an NE in text with one of multiple predefined NEs from a knowledge base (KB), or to detect that the surface form refers to an out-ofKB entity, which is known as NIL detection. [sent-9, score-0.824]

7 NED has become a popular research field recently, as the growth of large-scale publicly available encyclopedic knowledge resources such as Wikipedia has 18 stimulated research on linking NEs in text to their entries in these KBs (Bunescu and Pasca, 2006; McNamee and Dang, 2009). [sent-10, score-0.159]

8 The disambiguation of named entities raises several challenges: Surface forms in text can be ambiguous, and the same entity can be referred to by different surface forms. [sent-11, score-1.056]

9 For example, the surface form “George Bush” may denote either of two former U. [sent-12, score-0.367]

10 Thus, a many-to-many mapping between surface forms and entities has to be resolved. [sent-16, score-0.547]

11 In addition, entity mentions may not have a matching entity in the KB, which is often the case for nonpopular entities. [sent-17, score-0.452]

12 Typical approaches to NED combine the use of document context knowledge with entity information stored in the KB in order to disambiguate entities. [sent-18, score-0.399]

13 Many systems represent document context and KB information as word or concept vectors, and rank entities using vector space similarity metrics (Cucerzan, 2007). [sent-19, score-0.357]

14 Other authors employ supervised machine learning algorithms to classify or rank candidate entities (Bunescu and Pasca, 2006; Zhang et al. [sent-20, score-0.311]

15 Common features include popularity metrics based on Wikipedia’s graph structure or on name mention frequency (Dredze et al. [sent-22, score-0.241]

16 , 2010; Han and Zhao, 2009), similarity metrics exploring Wikipedia’s concept relations (Han and Zhao, 2009), and string similarity features. [sent-23, score-0.184]

17 While previous research has largely focused on disambiguating each entity mention in a document Portland, OPR,ro UcSeeAdi 1n9g-s2 o4f J uthnee A 2C01L-1H. [sent-26, score-0.376]

18 c T2 2001111 A Sstsuodceinatti Soens fsoiorn C,o pmagpeusta 18ti–o2n3a,l Linguistics separately (McNamee and Dang, 2009), we explore an approach that is driven by the observation that entities normally co-occur in texts. [sent-28, score-0.142]

19 Documents often discuss several different entities related to each other, e. [sent-29, score-0.142]

20 a news article may report on a meeting of political leaders from different countries. [sent-31, score-0.08]

21 Our Contributions In this paper, we evaluate a range of novel disambiguation features that exploit the relations between NEs identified in a document and in the KB. [sent-33, score-0.404]

22 Our goal is to explore the usefulness of Wikipedia’s link structure as source of relations between entities. [sent-34, score-0.094]

23 We propose a method for candidate selection that is based on an inverted index of surface forms and entities (Section 3. [sent-35, score-0.805]

24 Instead of a bag-of-words approach we use co-occurring NEs in text for describing an ambiguous surface form. [sent-37, score-0.408]

25 We introduce several different disambiguation features that exploit the relations between entities derived from the graph structure of Wikipedia (Section 3. [sent-38, score-0.493]

26 Finally, we combine our disambiguation features and achieve state-of-the-art results with a Support Vector Machine (SVM) classifier (Section 4). [sent-40, score-0.289]

27 2 Problem statement The task of NED is to assign a surface form s found in a document d to a target NE t ∈ E(s), where E(s) ⊂ uEm eisn a sdet t oof a c taanrdgiedta NteE EN tEs ∈ ∈fr Eom(s an entity KE(Bs )th ⊂at Eis idse fain seedt by aEn = {e1, e2 , . [sent-41, score-0.703]

28 , en}, or ttoy recognize sth daetf itnheed dfo buynd E su =rfa {cee form s ref}er,s o to a missing target entity t ∈/ E(s). [sent-44, score-0.298]

29 Since the same surface form s may refer to more than one NE e, the correct target entity t has to be determined from a set of candidates E(s) Name variants Often, name variants (e. [sent-47, score-0.827]

30 abbreviations, acronyms or synonyms) are used in texts to refer to the same NE, which has to be considered for the determination of candidates E(s) for a given surface form s. [sent-49, score-0.481]

31 Another challenge of 19 Figure 1: Ambiguity of Wikipedia surface forms. [sent-51, score-0.322]

32 The distribution follows a power law, as many surface forms have only a single meaning (i. [sent-52, score-0.405]

33 refer to a single Wikipedia concept), and some surface forms are highly ambiguous, referring to very many different concepts. [sent-54, score-0.464]

34 NED is therefore to recognize missing NEs where t ∈/ E(s), given a surface form s (NIL detection). [sent-55, score-0.367]

35 In this section we describe the construction and structure of the KB and the candidate selection scheme, followed by an overview of disambiguation features and the candidate classification algorithm. [sent-57, score-0.556]

36 1 Knowledge base construction Our approach disambiguates named entities against a KB constructed from Wikipedia. [sent-59, score-0.275]

37 To this end, we process Wikipedia to extract several types of information for each Wikipedia article describing a concept (i. [sent-60, score-0.14]

38 any article not being a redirect page, a disambiguation page, or any other kind of meta page). [sent-62, score-0.403]

39 We collect a set of name variants (surface forms) for each concept from article titles, redirect pages, disambiguation pages and the anchor texts of internal Wikipedia links, following Cucerzan (2007). [sent-63, score-0.553]

40 For each concept, we also collect its set of incoming and outgoing links to other Wikipedia pages. [sent-64, score-0.161]

41 We store this information in an inverted index, which allows for very efficient access and search during candidate selection and feature computation. [sent-66, score-0.264]

42 The distribution of surface forms follows a power law, where the majority of surface forms is unambiguous, but some surface forms are very ambiguous (Figure 1). [sent-67, score-1.301]

43 This suggests that for a given set of distinct surface forms found in a document, many of these will unambiguously refer to a single Wikipedia entity. [sent-68, score-0.436]

44 These entities can then be used to disambiguate surface forms referring to multiple entities. [sent-69, score-0.629]

45 2 Candidate selection Given a surface form identified in a document, the task of the candidate selection component is to retrieve a set of candidate entities from the KB. [sent-71, score-0.863]

46 To this end, we execute a search on index fields storing article titles, redirect titles, and name variants. [sent-72, score-0.306]

47 We implement a weighted search to give high weights to exact title matches, a lesser emphasis on redirect matches, and finally a low weight for all other name variants. [sent-73, score-0.207]

48 In addition, we implement a fuzzy search on the title and redirect fields to select KB entries with approximate string similarity to the surface form. [sent-74, score-0.544]

49 3 Disambiguation features In this section, we describe the features that we use in our disambiguation approach. [sent-76, score-0.28]

50 Entity Context (EC) The EC disambiguation feature is calculated as the cosine similarity between the document context d of a surface form s and the Wikipedia article c of each candidate c ∈ E(s). [sent-77, score-0.964]

51 If a surface form is ambiguous, we choose the most popular NE with the popularity metric described below. [sent-81, score-0.445]

52 Analogously, we represent each c as a vector of the incoming and outgoing URIs found on its Wikipedia page. [sent-82, score-0.132]

53 Link Context (LC) The link context feature is an extension of the EC feature. [sent-83, score-0.103]

54 Since our observations have shown that the entity context can be very small and consequently the overlap between d and c may be very low, we extend d by all incoming (LC-in) or by all incoming and outgoing (LC-all) Wikipedia URIs of the NEs from the entity context. [sent-84, score-0.654]

55 We assume that Wikipedia pages that refer to other 20 Wikipedia pages contain information on the referenced pages or at least are thematically related to these pages. [sent-85, score-0.081]

56 Candidate Rank (CR) The features described so far disambiguate every surface form s ∈ S from a fdaorc dumisaemntb dig separately, w shuerfraecaes our mC asnd ∈id Sate fr oRman ka feature aims to disambiguate all surface forms S found in a document d at once. [sent-87, score-1.081]

57 We represent d as a graph D = (E(S) , L(E(S))) where the nodes E(S) = ∪s∈SE(s) are all candidates of all surface Efor(Sms) =in ∪the document and L(E(S)) is the set of links between the candidates, as found in Wikipedia. [sent-88, score-0.547]

58 Then, we compute the PageRank score (Brin and Page, 1998) of all c ∈ E(S) and choose for each s athgee, c 1a9n9d8id)a otef awllith c t h∈e highest PageRank score cinh the document graph D. [sent-89, score-0.167]

59 Standard Features In addition to the previously described features we also implement a set of commonly accepted features. [sent-90, score-0.066]

60 These include a feature based on the cosine similarity between word vector representations of the document and the Wikipedia article of each candidate (BOW) (Bunescu, 2007). [sent-91, score-0.393]

61 Another standard feature we use is the popularity of a surface form (SFP). [sent-94, score-0.498]

62 We calculate how often a surface form s references a candidate c ∈ E(s) in relation to the totrealf enruemncbeesr a ao cfa mndenidtiaotens c o∈f s (ins Wikipedia (Han ea tnodZhao, 2009). [sent-95, score-0.504]

63 Since we use an index for selecting candidates (Section 3. [sent-96, score-0.13]

64 2), we also exploit the candidate selection score (CS) returned for each candidate as a disambiguation feature. [sent-97, score-0.553]

65 4 Candidate classifier and NIL detection We cast NED as a supervised classification task and use two binary SVM classifiers (Vapnik, 1995). [sent-99, score-0.075]

66 The first classifier decides for each candidate c ∈ E(s) if fiti corresponds teoc tidhee target entity. [sent-100, score-0.184]

67 For training the classifier we label as a positive example at most x(c) x(c) one from the set of candidates for a surface form s, and all others as negative. [sent-102, score-0.497]

68 In addition, we train a separate classifier to detect NIL queries, i. [sent-103, score-0.047]

69 The best feature set contains all features except for LC-all and CR. [sent-113, score-0.091]

70 Our system outperforms previously reported results on NIL queries, and compares favorably on all queries. [sent-114, score-0.041]

71 if the similarity values of all candidates c ∈ E(s) are very liolawri. [sent-115, score-0.123]

72 W vael uceaslc oufla atell scaevnedriadla tdeisffe cr ∈ent E fe(as-) tures, such as the maximum, mean and minimum, the difference between maximum and mean, and the difference between maximum and minimum, of all atomic features, using the feature vectors of all candidates in E(s). [sent-116, score-0.277]

73 Both classifier use a radial basis function kernel, with parameter settings of C = 32 and γ = 8. [sent-117, score-0.047]

74 A set of 3904 surface form-document pairs (queries) is constructed from these sources, encompassing 560 unique entities. [sent-121, score-0.322]

75 The majority of queries (57%) are NIL queries, of the KB queries, 69% are for organizations and 15% each for persons and geopolitical entities. [sent-122, score-0.225]

76 For each query the surface form appearing in the given document has to be disambiguated against the KB. [sent-123, score-0.45]

77 We randomly split the 3904 queries to perform 10-fold cross-validation, and stratify the resulting folds to ensure a similar distribution of KB and NIL queries in our training data. [sent-124, score-0.351]

78 After normalizing feature values to be in [0, 1], we train a candidate and a NIL classifier on 90% of the queries in each iteration, and test using the remaining 10%. [sent-125, score-0.399]

79 Results reported in this paper are then averaged across the 21 All queries KB NIL 1,0 0,9 car oi-aegvMdacuy0 0 , 408137625Baselin featursBetfauresD dze tal. [sent-126, score-0.162]

80 Figure 2: The micro-averaged accuracy for all types of queries on TAC-KBP 2009 data in comparison to other systems. [sent-128, score-0.193]

81 Table 1compares the micro-averaged accuracy of our approach on KB and NIL queries for different feature sets, and lists the results of two other stateof-the-art systems (Dredze et al. [sent-130, score-0.246]

82 As a baseline we use a feature set consisting of the BOW and SFP features. [sent-135, score-0.053]

83 The best feature set in our experiments comprises all features except for the LC-all and CR features. [sent-136, score-0.091]

84 Using the best feature set improves the disambiguation accuracy by 6. [sent-139, score-0.288]

85 2% over the baseline feature set, which is significant at p = 0. [sent-140, score-0.053]

86 For KB queries our system’s accuracy is higher than that of Dredze et al. [sent-142, score-0.193]

87 We can see that the novel entity features contribute to a higher overall accuracy. [sent-146, score-0.264]

88 Including the candidate selection score (CS) improves accuracy by 3. [sent-147, score-0.208]

89 The Wikipedia link-based features provide additional gains, however differences are quite 0,85 BOW BOW + EC BOW + EC + LC-in BOW + EC + LC-all Figure 3: Differences in micro-averaged accuracy for various feature combinations on TAC-KBP 2009 data. [sent-149, score-0.122]

90 Adding Wikipedia link-based features significantly improves performance over the baseline feature set. [sent-150, score-0.091]

91 The Candidate Rank (CR) feature slightly decreases the overall accuracy. [sent-156, score-0.053]

92 A manual inspection of the CR feature shows that often candidates cannot be distinguished by the classifier because they are assigned the same PageRank scores. [sent-157, score-0.183]

93 We assume this results from our use of uniform priors for the edges and vertices of the document graphs. [sent-158, score-0.083]

94 5 Conclusion and Future Work We presented a supervised approach for named entity disambiguation that explores novel features based on Wikipedia’s link structure. [sent-159, score-0.597]

95 These features use NEs co-occurring with an ambiguous surface form in a document and their Wikipedia relations to score the candidates. [sent-160, score-0.618]

96 We find that our features improve disambiguation results by 6. [sent-162, score-0.242]

97 2% over the popularity baseline, and are especially helpful for recognizing entities not contained in the KB. [sent-163, score-0.22]

98 In addition to Wikipedia, we also intend to exploit more dynamical information sources. [sent-167, score-0.062]

99 Named entity dis- ambiguation by leveraging wikipedia semantic knowledge. [sent-203, score-0.551]

100 Overview of the tac 2009 knowledge base population track. [sent-208, score-0.176]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wikipedia', 0.325), ('surface', 0.322), ('kb', 0.311), ('nes', 0.261), ('nil', 0.249), ('entity', 0.226), ('disambiguation', 0.204), ('ned', 0.19), ('mcnamee', 0.166), ('queries', 0.162), ('entities', 0.142), ('candidate', 0.137), ('bow', 0.12), ('redirect', 0.119), ('cr', 0.107), ('ne', 0.093), ('ambiguous', 0.086), ('candidates', 0.083), ('dredze', 0.083), ('forms', 0.083), ('document', 0.083), ('ec', 0.082), ('article', 0.08), ('named', 0.079), ('popularity', 0.078), ('dang', 0.075), ('incoming', 0.07), ('bunescu', 0.07), ('han', 0.07), ('kbs', 0.062), ('uris', 0.062), ('outgoing', 0.062), ('name', 0.06), ('concept', 0.06), ('sfp', 0.055), ('disambiguate', 0.054), ('base', 0.054), ('tac', 0.054), ('pagerank', 0.054), ('feature', 0.053), ('referenced', 0.05), ('titles', 0.05), ('link', 0.05), ('index', 0.047), ('classifier', 0.047), ('zheng', 0.046), ('form', 0.045), ('encyclopedic', 0.045), ('pasca', 0.045), ('relations', 0.044), ('linking', 0.043), ('analogously', 0.043), ('bush', 0.043), ('cucerzan', 0.041), ('favorably', 0.041), ('trang', 0.041), ('brin', 0.041), ('hoa', 0.04), ('similarity', 0.04), ('selection', 0.04), ('razvan', 0.039), ('features', 0.038), ('cos', 0.038), ('page', 0.037), ('knowledge', 0.036), ('organizations', 0.036), ('exploit', 0.035), ('entries', 0.035), ('mention', 0.035), ('inverted', 0.034), ('vectors', 0.034), ('organizing', 0.033), ('law', 0.033), ('disambiguating', 0.032), ('population', 0.032), ('rank', 0.032), ('accuracy', 0.031), ('refer', 0.031), ('paul', 0.031), ('variants', 0.03), ('berlin', 0.03), ('finkel', 0.03), ('graph', 0.03), ('multilingual', 0.03), ('links', 0.029), ('implement', 0.028), ('referring', 0.028), ('detection', 0.028), ('xianpei', 0.027), ('geopolitical', 0.027), ('stratify', 0.027), ('rfa', 0.027), ('dynamical', 0.027), ('eom', 0.027), ('asnd', 0.027), ('cinh', 0.027), ('hypertextual', 0.027), ('itnheed', 0.027), ('otef', 0.027), ('sdet', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

2 0.34833068 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

Author: Xianpei Han ; Le Sun

Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1

3 0.29705706 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

Author: Heng Ji ; Ralph Grishman

Abstract: In this paper we give an overview of the Knowledge Base Population (KBP) track at the 2010 Text Analysis Conference. The main goal of KBP is to promote research in discovering facts about entities and augmenting a knowledge base (KB) with these facts. This is done through two tasks, Entity Linking linking names in context to entities in the KB and Slot Filling – adding information about an entity to the KB. A large source collection of newswire and web documents is provided from which systems are to discover information. Attributes (“slots”) derived from Wikipedia infoboxes are used to create the reference KB. In this paper we provide an overview of the techniques which can serve as a basis for a good KBP system, lay out the – – remaining challenges by comparison with traditional Information Extraction (IE) and Question Answering (QA) tasks, and provide some suggestions to address these challenges. 1

4 0.19763778 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

5 0.17805812 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

Author: Truc Vien T. Nguyen ; Alessandro Moschitti

Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.

6 0.16198599 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

7 0.16195562 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

8 0.15884267 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

9 0.13851191 285 acl-2011-Simple supervised document geolocation with geodesic grids

10 0.13602002 52 acl-2011-Automatic Labelling of Topic Models

11 0.12760501 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

12 0.12401186 129 acl-2011-Extending the Entity Grid with Entity-Specific Features

13 0.11962568 117 acl-2011-Entity Set Expansion using Topic information

14 0.11830762 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

15 0.10496747 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

16 0.09929657 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

17 0.098968647 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

18 0.090137616 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

19 0.088940226 258 acl-2011-Ranking Class Labels Using Query Sessions

20 0.086533651 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.21), (1, 0.099), (2, -0.195), (3, 0.092), (4, 0.042), (5, -0.088), (6, 0.008), (7, -0.166), (8, -0.22), (9, 0.001), (10, 0.049), (11, 0.021), (12, -0.094), (13, -0.155), (14, 0.137), (15, 0.087), (16, 0.26), (17, -0.009), (18, 0.006), (19, -0.069), (20, 0.065), (21, -0.098), (22, 0.009), (23, -0.112), (24, 0.131), (25, -0.017), (26, -0.024), (27, -0.033), (28, 0.01), (29, 0.105), (30, -0.044), (31, 0.085), (32, 0.019), (33, 0.026), (34, 0.168), (35, 0.153), (36, -0.03), (37, -0.042), (38, 0.015), (39, -0.005), (40, -0.011), (41, 0.078), (42, -0.089), (43, 0.091), (44, -0.07), (45, 0.058), (46, -0.106), (47, 0.043), (48, 0.036), (49, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97319084 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

2 0.88373154 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

3 0.85815799 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

Author: Xianpei Han ; Le Sun

Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1

4 0.82388031 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

Author: Heng Ji ; Ralph Grishman

Abstract: In this paper we give an overview of the Knowledge Base Population (KBP) track at the 2010 Text Analysis Conference. The main goal of KBP is to promote research in discovering facts about entities and augmenting a knowledge base (KB) with these facts. This is done through two tasks, Entity Linking linking names in context to entities in the KB and Slot Filling – adding information about an entity to the KB. A large source collection of newswire and web documents is provided from which systems are to discover information. Attributes (“slots”) derived from Wikipedia infoboxes are used to create the reference KB. In this paper we provide an overview of the techniques which can serve as a basis for a good KBP system, lay out the – – remaining challenges by comparison with traditional Information Extraction (IE) and Question Answering (QA) tasks, and provide some suggestions to address these challenges. 1

5 0.72077495 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

6 0.57275051 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

7 0.56095839 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

8 0.54964519 285 acl-2011-Simple supervised document geolocation with geodesic grids

9 0.52239466 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

10 0.50298429 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

11 0.47276118 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

12 0.46702775 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

13 0.46001315 129 acl-2011-Extending the Entity Grid with Entity-Specific Features

14 0.45722497 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

15 0.41151547 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

16 0.393929 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

17 0.38958725 261 acl-2011-Recognizing Named Entities in Tweets

18 0.37951979 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

19 0.37072766 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

20 0.36919054 298 acl-2011-The ACL Anthology Searchbench


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.024), (8, 0.165), (9, 0.017), (17, 0.028), (26, 0.033), (31, 0.011), (37, 0.111), (39, 0.078), (41, 0.07), (53, 0.033), (55, 0.045), (59, 0.058), (61, 0.012), (72, 0.026), (91, 0.038), (96, 0.134), (97, 0.011), (98, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88467658 296 acl-2011-Terminal-Aware Synchronous Binarization

Author: Licheng Fang ; Tagyoung Chung ; Daniel Gildea

Abstract: We present an SCFG binarization algorithm that combines the strengths of early terminal matching on the source language side and early language model integration on the target language side. We also examine how different strategies of target-side terminal attachment during binarization can significantly affect translation quality.

same-paper 2 0.85480899 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

3 0.80942571 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

Author: Ivan Titov ; Alexandre Klementiev

Abstract: We propose a non-parametric Bayesian model for unsupervised semantic parsing. Following Poon and Domingos (2009), we consider a semantic parsing setting where the goal is to (1) decompose the syntactic dependency tree of a sentence into fragments, (2) assign each of these fragments to a cluster of semantically equivalent syntactic structures, and (3) predict predicate-argument relations between the fragments. We use hierarchical PitmanYor processes to model statistical dependencies between meaning representations of predicates and those of their arguments, as well as the clusters of their syntactic realizations. We develop a modification of the MetropolisHastings split-merge sampler, resulting in an efficient inference algorithm for the model. The method is experimentally evaluated by us- ing the induced semantic representation for the question answering task in the biomedical domain.

4 0.76196575 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

5 0.76094878 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

Author: Lonneke van der Plas ; Paola Merlo ; James Henderson

Abstract: Broad-coverage semantic annotations for training statistical learners are only available for a handful of languages. Previous approaches to cross-lingual transfer of semantic annotations have addressed this problem with encouraging results on a small scale. In this paper, we scale up previous efforts by using an automatic approach to semantic annotation that does not rely on a semantic ontology for the target language. Moreover, we improve the quality of the transferred semantic annotations by using a joint syntacticsemantic parser that learns the correlations between syntax and semantics of the target language and smooths out the errors from automatic transfer. We reach a labelled F-measure for predicates and arguments of only 4% and 9% points, respectively, lower than the upper bound from manual annotations.

6 0.7584359 282 acl-2011-Shift-Reduce CCG Parsing

7 0.75757122 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

8 0.75752771 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

9 0.75473964 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

10 0.75320011 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

11 0.75252342 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

12 0.75251317 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

13 0.75213134 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.75063741 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

15 0.75050908 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

16 0.75031096 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic

17 0.75030625 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

18 0.75020576 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

19 0.74969167 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

20 0.74823701 182 acl-2011-Joint Annotation of Search Queries