emnlp emnlp2012 emnlp2012-84 knowledge-graph by maker-knowledge-mining

84 emnlp-2012-Linking Named Entities to Any Database


Source: pdf

Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates

Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. [sent-14, score-0.295]

2 We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. [sent-15, score-0.483]

3 Instead of relying solely on Wikipedia, we propose a novel approach to NED, which we refer to as Open-DB NED: the task is to resolve an entity to Wikipedia or to any relational database that meets mild conditions about the format of the data, described below. [sent-27, score-0.485]

4 The first strategy, a distant supervision approach, uses the relational information in a given database and a large corpus of unlabeled text to learn a database-specific model. [sent-33, score-0.738]

5 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl egy, a domain adaptation approach, assumes a single source database that has accompanying labeled data. [sent-36, score-0.598]

6 Classifiers in this setting must learn a model that transfers from the source database to any new database, without requiring new training data for the new database. [sent-37, score-0.336]

7 Sections 4 and 5 present our distant supervision strategy and domain-adaptation strategy, respectively. [sent-41, score-0.38]

8 Numerous previous studies have considered distant or weak supervision from a single relational database as an alternative to manual supervision for information extraction (Hoffmann et al. [sent-59, score-0.863]

9 In contrast to these systems, our distant supervision NED system provides a meta-algorithm for generating an NED system for any database and any entity type. [sent-65, score-0.714]

10 Existing domain adaptation or transfer learning approaches are inappropriate for the Open-DB NED task, either because they require labeled data in both the source and target domains (Daum ´e III et al. [sent-66, score-0.375]

11 , 2006; Huang and Yates, 2009), which does not apply to the database symbols across the two domains. [sent-69, score-0.349]

12 Instead, our domain adaptation technique uses domain-independent features of relational data, which apply regardless of the actual contents of the database, as explained further below. [sent-70, score-0.32]

13 ” For a particular database DB, we refer to its components as DB. [sent-91, score-0.296]

14 Given a corpus C, a set of mentions M that occur in C, and a set of databases D, the Open-DB NED task is to produce a function f : M → SD, wNhEiDch t aidsekn itsif itoes an appropriate target symbol f rSom one of the databases in D, or determines that the mention is OOD. [sent-97, score-0.446]

15 In the domain adaptation section below, we relax this condition somewhat, to allow labeled data for a small number of initial databases; the system must then transfer what it learns from the labeled domains to any new database. [sent-100, score-0.424]

16 In practice, this is a relatively safe assumption as database designers often aim for even stricter normal forms. [sent-112, score-0.296]

17 We will additionally assume that all attributes, including names and nicknames, of entities that are covered by the database are treated as functional dependencies of the entity. [sent-116, score-0.451]

18 Again, in practice, this is a fairly safe assumption as this is part of good database design, but if a database does not conform to this, then there will be some entities in the database that our algorithms cannot resolve to. [sent-117, score-1.011]

19 Finally, we will assume the existence of a function µ(s, t) which indicates whether the text t is a valid surface form of database symbol s. [sent-119, score-0.335]

20 4 A Distant Supervision Strategy for Open-DB NED Our first approach to the Open-DB NED problem re- lies on the fact that, while many mentions are indeed ambiguous and difficult to resolve correctly, most mentions have only a very small number of possible referents in a given database. [sent-122, score-0.416]

21 “Chris Johnson” is the name of doubtless thousands of people, but for articles that are reasonably well-aligned with our sports database, most of the time the name will refer to just three different people. [sent-123, score-0.379]

22 Most sports names are in fact less ambiguous still. [sent-124, score-0.316]

23 Thus, taking a corpus of unlabeled sports articles, we use the information in the database to provide (uncertain) labels, and then train a log-linear model from this probabilisticallylabeled data. [sent-125, score-0.569]

24 As mentioned above, we only need to consider resolving to database symbols s that are keys, or unique IDs, for some tuple in a database. [sent-136, score-0.349]

25 For an entity in the database with key id, the feature generation algorithm generates two types of feature functions: attribute counts and similar entity counts. [sent-137, score-0.651]

26 Each of these features measures the similarity between the information stored in the database about the entity id, and the information in the text in d surrounding mention m. [sent-138, score-0.492]

27 An attribute count feature function fia,tjt(m, id) for the jth attribute of relation ri counts how many 119 Algorithm: Feature Generation Input: DB, a database in BCNF Output: F, a set of feature functions Initialization: F ← ∅ Attribute Count Feature Functions: For each relation ri ∈ DB. [sent-139, score-0.897]

28 fia,tjt(m, id): count ← 0 Identify ←th 0e tuple t ∈ ri containing id Ivdaeln ← tj cvoauln ←t ← count + ContextMatches(val, m) return count F ← F ∪ {fia,jtt} Similar-Entity Count Feature Functions: For each relation ri ∈ DB. [sent-144, score-0.51]

29 The ContextMatches(s, m) function counts how many times a string that matches database symbol s appears in the context of m. [sent-153, score-0.414]

30 Matching between strings and database symbols is discussed in Sec. [sent-155, score-0.349]

31 For example, if id is 5 in the movie relation in Figure 1, the feature function for attribute year would count how often 2 0 10 matches the text surrounding mention m. [sent-159, score-0.73]

32 Defining precisely whether a database symbol “matches” a word or phrase is a subtle issue; we ex- plore several possibilities in Section 7. [sent-160, score-0.335]

33 In addition to attribute counts for attributes within a single relation, we also use attributes from relations that have been inner-joined on primary key and foreign key pairs. [sent-162, score-0.327]

34 High values for these attribute count features indicate that the text around m closely matches the information in the database about entity id, and therefore id is a strong candidate for the referent of m. [sent-164, score-0.814]

35 A similar entity count feature function fis,ijm(m, id) for the jth attribute in relation ri counts how many entities similar to id are mentioned in the neighborhood of m. [sent-166, score-0.632]

36 As an example, consider a mention of “Chris Johnson”, id = 3, and the similar entity feature for the pos it ion attribute of the players relation in the sports database. [sent-167, score-0.846]

37 Likewise, the similar entity feature for the team id attribute would count how many teammates of the player with id = 3 appear near “Chris Johnson”. [sent-171, score-0.756]

38 A high count for this teammate feature is a strong clue that id is the correct referent for m, while a high count for players of the same position is a weak but still valuable clue. [sent-172, score-0.477]

39 3 Parameter Estimation via Distant Supervision Using string similarity, we can heuristically determine that three IDs with name attribute Chri s Johns on are highly likely to be the correct target for a mention of “Chris Johnson”. [sent-174, score-0.285]

40 Our distant supervision parameter estimation strategy is to move as much probability mass as possible onto the set of realistic referents obtained via string similarity. [sent-175, score-0.486]

41 Since our features rely on finding attributes and similar entities, the side effect of this strategy is that most of the probability mass for a particular mention is moved onto the one target ID with high attribute count and similar entity count features, thus disambiguating the entity. [sent-176, score-0.597]

42 Although the string-similarity heuristic is typically noisy, the strong information in 120 the database and the fact that many entity mentions are typically not ambiguous allows the technique to learn effectively from unlabeled text. [sent-177, score-0.552]

43 Let φ(m, DB) be a heuristic string-matching function that returns a set of plausible ID values in database DB for mention m. [sent-178, score-0.417]

44 5 [fi(m, id)] are taken according to A Domain-Adaptation Strategy for Open-DB NED Our domain-adaptation strategy builds an Open-DB NED system by training it on labeled examples from an initial database or small set of initial databases. [sent-184, score-0.382]

45 Counting how — many times the director of a movie appears is highly useful in the movie domain, but worthless in the sports domain. [sent-189, score-0.608]

46 For example, rather than counting how often the director of a movie appears in the context around a movie mention, we create a domain-independent Count Att(m, s) feature function that counts how often any attribute of s appears in the context of m. [sent-191, score-0.575]

47 In the sports domain, Count Att will add together counts for appearances of a player’s height, position, salary, etc. [sent-193, score-0.323]

48 Thus there is a hope for training a model with domain-independent features like Count Att on labeled data from one domain, say movies, and producing a model that has high accuracy on the sports domain. [sent-196, score-0.29]

49 We say that a domain consists of a database DB as well as a distribution D(M), where M is the space of mentions. [sent-198, score-0.436]

50 In domain adaptation, a system observes a set of training examples (m, s, g(m, s)), where instances m ∈ M are drawn f(rmom,s a source d,o wmheairen’s in dstiasntrciebsut mion ∈ DS aanred drreafwernents s are udrrcawen d ofrmoamin t’hse d source idoonm Dain’s database DBS. [sent-201, score-0.436]

51 The system must then learn a hypothesis for classifying examples (m, s) drawn from a target domain’s distribution DT and database DBT. [sent-203, score-0.296]

52 , s = g(m)) for the mention, since the set of possible referents changes from domain to domain, and therefore the output of g would be completely different from one domain to the next. [sent-206, score-0.386]

53 Table 1: Primary feaPture functions for a domain adaptation approach to NED. [sent-210, score-0.308]

54 These features made the biggest difference in our experiments, but we also tested variations such as counting unique numeric attribute appearances, counting unique similar entities, counting relation name appearances, counting extended attributed appearances, and others. [sent-211, score-0.397]

55 These features use the attribute counts and similar entity counts from the distant supervision model as subrou- tines. [sent-213, score-0.595]

56 By aggregating over those domain-dependent feature functions, the domain adaptation system arrives at feature functions that can be defined for any database, rather than for a specific database. [sent-214, score-0.368]

57 Note that there is a tradeoff between the domain adaptation technique and the distant supervision technique. [sent-215, score-0.596]

58 The domain adaptation model has access to labeled data, unlike the distant supervision model. [sent-216, score-0.645]

59 In addition, the domain adaptation model requires no text whatsoever from the target domain, not even an unlabeled corpus, to set weights for the target domain. [sent-217, score-0.285]

60 Once trained, it is ready for NED over any database that meets our assumptions, out of the box. [sent-218, score-0.296]

61 However, because the model needs to be able to transfer to arbitrary new domains, the domain adaptation model is restricted to domain-independent features, which are “coarsergrained. [sent-219, score-0.286]

62 ” That is, the distant supervision model has the ability to place more weight on attributes like director rather than genre, or team rather than position, if those attributes are more discriminative. [sent-220, score-0.642]

63 The domain adaptation model cannot place different weights on the different attributes, since those weights would not transfer across databases. [sent-221, score-0.286]

64 As with distant supervision, the domain adaptation strategy uses a log-linear model over these feature functions. [sent-222, score-0.506]

65 To address this question, we design a Hybrid model with features and training strategies from both distant supervision and domain adaptation. [sent-229, score-0.535]

66 The training data consists of a set LS of labeled mentions from a source domain, a source database DBS, a set of unlabeled mentions MT from the target domain, and the target-domain database DBT. [sent-230, score-0.901]

67 The full feature set of the Hybrid model is the union of the distant supervision feature functions for the target domain and the domain-independent domain adaptation feature functions. [sent-231, score-0.881]

68 Note that the distant supervision feature functions are domain-specific, so they almost always will be uniformly zero on LS, but the domain adaptation feature functions will be activated on both LS and MT. [sent-232, score-0.766]

69 The combined training objective for the Hybrid model is: LL(LS, MT, w) = CLL(LS, w) + MLL(MT, w) 7 Experiments Our experiments compare our strategies for OpenDB NED against one another, as well as against a Wikipedia NED system from previous work, on two domains: sports and movies. [sent-233, score-0.293]

70 1 Data For the movie domain, we collected a set of 156 cult movie titles from an online movie site (www. [sent-235, score-0.499]

71 Nearly all topfive results included at least one mention of an entity not found in Wikipedia; overall, only 16% of the mentions could be linked to Wikipedia. [sent-239, score-0.31]

72 To provide labels for these mentions, we use both a movie database and Wikipedia. [sent-243, score-0.453]

73 For the sports domain, we downloaded all player data from Yahoo! [sent-250, score-0.331]

74 ’s sports database for the years 2011-2012 and two American sports leagues, the National Football League (NFL) and Major League Baseball (MLB). [sent-252, score-0.778]

75 From the database, we extracted ambiguous player names and team names, including names like “Philadelphia” which may refer to Phi lade lphia Eagle s in the NFL data, Phi lade lphia Phi l ie s in the MLB data, or l the city of Phi ladelphia itself (in both types of data). [sent-253, score-0.381]

76 news articles which include a mention that partially matches at least one of these database symbols. [sent-255, score-0.5]

77 We manually labeled a random sample of 564 mentions from this data, including 279 player name mentions and 285 city name mentions. [sent-256, score-0.469]

78 Many player name and place name mentions are ambiguous between the two sports leagues, as well as with teams or players from other leagues. [sent-257, score-0.706]

79 In order to focus on the hardest cases, we specifically exclude mentions like “Philadelphia” from the labeled data if any of their domain |M| E|φ(m, DB) | OOD Wiki |M|E|φ(m,DB)| movies 770 2. [sent-258, score-0.445]

80 5 0% 100% Table 2: Number of mentions, average number of referents per mention, % of mentions that are OOD, and % of mentions that are in Wikipedia in our movie and sports data. [sent-260, score-0.732]

81 As before, the set of possible referents includes the symbol OOD, key values from the sports database, and Wikipedia articles, and a given mention may be labeled with both a sports entity and a Wikipedia article, if appropriate. [sent-262, score-0.872]

82 This is judged correct if matches the correct label s exactly, or (in cases where both a Wikipedia and a database entity are considered correct) if one of the labels matches exactly. [sent-267, score-0.465]

83 One important question in the design of our systems is how to determine the “match” between database symbols and text. [sent-271, score-0.349]

84 For instance, the database value Chri s Johnson 123 System No-Wikipedia Domain Adapt. [sent-275, score-0.296]

85 On the other hand, if we use µpartial for computing our models’ feature functions, like the Count Att(m, s) in the domain adaptation model, counts varied widely across domains. [sent-290, score-0.315]

86 A simple version of the domain adaptation classifier (only the Count All and Count Unique features) trained on sports data and tested on movies achieved an accuracy of 24% using µpartial, compared with 61% using µexact. [sent-291, score-0.636]

87 The domain-adaptation model is trained on the labeled data for sports when testing on movies, and vice versa. [sent-308, score-0.29]

88 For the distant supervision strategy, we use the entire collection of texts from each domain as input (1300 articles for sports, 770 articles for movies), with the labels removed during training. [sent-310, score-0.555]

89 We also test a hypothetical system, Oracle Wikifier, which is given no information about entities in IMDB, but is assumed to be able to correctly resolve any mention that refers to an entity found in Wikipedia. [sent-316, score-0.319]

90 Finally, we compare against a system that trains the domain adaptation model using distant supervision (“DA Trained with DS”). [sent-320, score-0.596]

91 Encouragingly, the Hybrid model consistently outperforms both distant supervision and domain adaptation, suggesting that the two sources of evidence are partially complementary. [sent-324, score-0.483]

92 Distant supervision performs better on the movies test, whereas domain adaptation has the advantage on sports. [sent-325, score-0.552]

93 The domain adaptation system outperforms DA Trained with DS on both domains, suggesting that labeled data from a separate domain is better evidence for parameter estimates than unlabeled data from the same domain. [sent-327, score-0.474]

94 The distant supervision system also outperforms DA Trained with 1Alternatively, one could make the oracle system predict OOD on all mentions that fall outside of Wikipedia. [sent-328, score-0.485]

95 1, supplied all of the features from the distant supervision model, and manually set w = 1. [sent-342, score-0.343]

96 For both the movie and sports domain, approximately 80% of the Hybrid model’s errors are because of predicting database symbols, when the correct referent is a Wikipedia page or OOD. [sent-348, score-0.745]

97 This nearly always occurs because some words in the context of a mention match an attribute of an incorrect database referent. [sent-349, score-0.53]

98 In the movie domain, most of the remaining errors are incorrect OOD predictions for mentions that should resolve to the database, but the article contains no attributes or similar entities to the database entity. [sent-351, score-0.781]

99 In the sports domain, many of the remaining errors were due to predicting incorrect player referents. [sent-352, score-0.331]

100 Quite often, this was because the document discusses a fantasy sports league or team, where players from different professional sports teams are mixed together on a “fantasy team” belonging to a fan of the sport. [sent-353, score-0.674]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ned', 0.519), ('database', 0.296), ('sports', 0.241), ('wikipedia', 0.215), ('distant', 0.186), ('movie', 0.157), ('supervision', 0.157), ('id', 0.152), ('movies', 0.142), ('domain', 0.14), ('bcnf', 0.125), ('mention', 0.121), ('mentions', 0.114), ('adaptation', 0.113), ('attribute', 0.113), ('ood', 0.111), ('wikifier', 0.111), ('referents', 0.106), ('attributes', 0.091), ('player', 0.09), ('databases', 0.086), ('db', 0.084), ('players', 0.084), ('lesk', 0.083), ('count', 0.08), ('entities', 0.076), ('att', 0.076), ('entity', 0.075), ('numeric', 0.071), ('disambiguation', 0.071), ('relational', 0.067), ('door', 0.066), ('team', 0.064), ('johnson', 0.063), ('philadelphia', 0.06), ('functions', 0.055), ('hybrid', 0.055), ('symbols', 0.053), ('catalog', 0.053), ('director', 0.053), ('strategies', 0.052), ('name', 0.051), ('referent', 0.051), ('appearances', 0.05), ('imdb', 0.05), ('labeled', 0.049), ('ki', 0.048), ('phi', 0.048), ('resolve', 0.047), ('matches', 0.047), ('ri', 0.044), ('temple', 0.043), ('chris', 0.043), ('partial', 0.043), ('chri', 0.042), ('contextmatches', 0.042), ('dalvi', 0.042), ('mll', 0.042), ('domains', 0.04), ('requiring', 0.04), ('teams', 0.04), ('names', 0.04), ('symbol', 0.039), ('functional', 0.039), ('zhou', 0.039), ('ls', 0.039), ('yahoo', 0.039), ('bellare', 0.037), ('strategy', 0.037), ('leagues', 0.036), ('cll', 0.036), ('fantasy', 0.036), ('bunescu', 0.036), ('articles', 0.036), ('ambiguous', 0.035), ('sd', 0.034), ('transfer', 0.033), ('linking', 0.033), ('counting', 0.033), ('league', 0.032), ('counts', 0.032), ('unlabeled', 0.032), ('named', 0.03), ('feature', 0.03), ('hoffart', 0.03), ('relation', 0.03), ('genre', 0.03), ('gmai', 0.028), ('oracle', 0.028), ('cronin', 0.028), ('cult', 0.028), ('eeaficnhe', 0.028), ('houston', 0.028), ('innc', 0.028), ('ivdaeln', 0.028), ('jfu', 0.028), ('jis', 0.028), ('lade', 0.028), ('lphia', 0.028), ('margins', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 84 emnlp-2012-Linking Named Entities to Any Database

Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates

Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.

2 0.18623945 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

Author: Mihai Surdeanu ; Julie Tibshirani ; Ramesh Nallapati ; Christopher D. Manning

Abstract: Distant supervision for relation extraction (RE) gathering training data by aligning a database of facts with text – is an efficient approach to scale RE to thousands of different relations. However, this introduces a challenging learning scenario where the relation expressed by a pair of entities found in a sentence is unknown. For example, a sentence containing Balzac and France may express BornIn or Died, an unknown relation, or no relation at all. Because of this, traditional supervised learning, which assumes that each example is explicitly mapped to a label, is not appropriate. We propose a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables. Our model performs competitively on two difficult domains. –

3 0.1698243 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

Author: Lev Ratinov ; Dan Roth

Abstract: We explore the interplay of knowledge and structure in co-reference resolution. To inject knowledge, we use a state-of-the-art system which cross-links (or “grounds”) expressions in free text to Wikipedia. We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. To maximize the utility of the injected knowledge, we deploy a learningbased multi-sieve approach and develop novel entity-based features. Our end system outperforms the state-of-the-art baseline by 2 B3 F1 points on non-transcript portion of the ACE 2004 dataset.

4 0.14424123 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

5 0.12865347 19 emnlp-2012-An Entity-Topic Model for Entity Linking

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

6 0.12548713 97 emnlp-2012-Natural Language Questions for the Web of Data

7 0.12248832 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

8 0.11723527 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

9 0.11184468 41 emnlp-2012-Entity based QA Retrieval

10 0.10168789 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

11 0.082303219 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

12 0.079883739 6 emnlp-2012-A New Minimally-Supervised Framework for Domain Word Sense Disambiguation

13 0.075969741 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

14 0.074354082 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

15 0.069172449 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

16 0.067431405 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification

17 0.065537766 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

18 0.064252123 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

19 0.064049527 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic

20 0.062014282 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.224), (1, 0.234), (2, 0.026), (3, -0.121), (4, -0.066), (5, -0.081), (6, 0.055), (7, 0.178), (8, 0.016), (9, -0.11), (10, 0.075), (11, 0.038), (12, -0.018), (13, -0.076), (14, 0.077), (15, 0.12), (16, 0.141), (17, 0.04), (18, -0.104), (19, 0.062), (20, 0.077), (21, -0.11), (22, -0.001), (23, -0.069), (24, -0.174), (25, -0.001), (26, -0.021), (27, 0.068), (28, -0.069), (29, -0.14), (30, 0.095), (31, 0.009), (32, 0.034), (33, 0.067), (34, 0.01), (35, 0.072), (36, -0.035), (37, 0.085), (38, 0.009), (39, -0.073), (40, 0.074), (41, 0.087), (42, -0.035), (43, -0.009), (44, 0.006), (45, 0.022), (46, 0.088), (47, -0.018), (48, 0.033), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96925992 84 emnlp-2012-Linking Named Entities to Any Database

Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates

Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.

2 0.71605128 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

3 0.64819521 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

Author: Mihai Surdeanu ; Julie Tibshirani ; Ramesh Nallapati ; Christopher D. Manning

Abstract: Distant supervision for relation extraction (RE) gathering training data by aligning a database of facts with text – is an efficient approach to scale RE to thousands of different relations. However, this introduces a challenging learning scenario where the relation expressed by a pair of entities found in a sentence is unknown. For example, a sentence containing Balzac and France may express BornIn or Died, an unknown relation, or no relation at all. Because of this, traditional supervised learning, which assumes that each example is explicitly mapped to a label, is not appropriate. We propose a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables. Our model performs competitively on two difficult domains. –

4 0.61517435 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

Author: Lev Ratinov ; Dan Roth

Abstract: We explore the interplay of knowledge and structure in co-reference resolution. To inject knowledge, we use a state-of-the-art system which cross-links (or “grounds”) expressions in free text to Wikipedia. We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. To maximize the utility of the injected knowledge, we deploy a learningbased multi-sieve approach and develop novel entity-based features. Our end system outperforms the state-of-the-art baseline by 2 B3 F1 points on non-transcript portion of the ACE 2004 dataset.

5 0.6122421 19 emnlp-2012-An Entity-Topic Model for Entity Linking

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

6 0.57058579 41 emnlp-2012-Entity based QA Retrieval

7 0.46208879 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

8 0.44855052 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

9 0.43631673 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

10 0.39508185 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

11 0.38279811 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

12 0.3810491 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

13 0.37302974 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification

14 0.35928828 6 emnlp-2012-A New Minimally-Supervised Framework for Domain Word Sense Disambiguation

15 0.35319895 97 emnlp-2012-Natural Language Questions for the Web of Data

16 0.34802634 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

17 0.3354122 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

18 0.33077434 62 emnlp-2012-Identifying Constant and Unique Relations by using Time-Series Text

19 0.32784301 10 emnlp-2012-A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories

20 0.31778258 85 emnlp-2012-Local and Global Context for Supervised and Unsupervised Metonymy Resolution


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.021), (16, 0.024), (25, 0.016), (34, 0.055), (45, 0.01), (60, 0.572), (63, 0.04), (64, 0.021), (65, 0.031), (70, 0.015), (73, 0.016), (74, 0.027), (76, 0.026), (80, 0.011), (86, 0.013), (95, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99113744 84 emnlp-2012-Linking Named Entities to Any Database

Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates

Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.

2 0.99081171 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs

Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat

Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.

3 0.99062037 68 emnlp-2012-Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation

Author: Wenbin Jiang ; Fandong Meng ; Qun Liu ; Yajuan Lu

Abstract: In this paper we first describe the technology of automatic annotation transformation, which is based on the annotation adaptation algorithm (Jiang et al., 2009). It can automatically transform a human-annotated corpus from one annotation guideline to another. We then propose two optimization strategies, iterative training and predict-selfreestimation, to further improve the accuracy of annotation guideline transformation. Experiments on Chinese word segmentation show that, the iterative training strategy together with predictself reestimation brings significant improvement over the simple annotation transformation baseline, and leads to classifiers with significantly higher accuracy and several times faster processing than annotation adaptation does. On the Penn Chinese Treebank 5.0, , it achieves an F-measure of 98.43%, significantly outperforms previous works although using a single classifier with only local features.

4 0.987095 41 emnlp-2012-Entity based QA Retrieval

Author: Amit Singh

Abstract: Bridging the lexical gap between the user’s question and the question-answer pairs in the Q&A; archives has been a major challenge for Q&A; retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. While useful, the effectiveness of these models is highly dependant on the availability of quality corpus in the absence of which they are troubled by noise issues. Moreover these models perform word based expansion in a context agnostic manner resulting in translation that might be mixed and fairly general. This results in degraded retrieval performance. In this work we address the above issues by extending the lexical word based translation model to incorporate semantic concepts (entities). We explore strategies to learn the translation probabilities between words and the concepts using the Q&A; archives and a popular entity catalog. Experiments conducted on a large scale real data show that the proposed techniques are promising.

5 0.96826732 61 emnlp-2012-Grounded Models of Semantic Representation

Author: Carina Silberer ; Mirella Lapata

Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.

6 0.94068813 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

7 0.89193255 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

8 0.86011744 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation

9 0.83930689 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging

10 0.83859402 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

11 0.83479422 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing

12 0.83195388 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

13 0.83080149 19 emnlp-2012-An Entity-Topic Model for Entity Linking

14 0.82783616 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

15 0.81431973 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction

16 0.80847698 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

17 0.80464923 72 emnlp-2012-Joint Inference for Event Timeline Construction

18 0.80021447 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

19 0.79361802 13 emnlp-2012-A Unified Approach to Transliteration-based Text Input with Online Spelling Correction

20 0.7898553 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models