emnlp emnlp2012 emnlp2012-19 knowledge-graph by maker-knowledge-mining

19 emnlp-2012-An Entity-Topic Model for Entity Linking


Source: pdf

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. [sent-6, score-0.684]

2 Given many name mentions in a document, the goal of EL is to predict their referent entities in a given knowledge base (KB), such as the Wikipedia1. [sent-13, score-0.714]

3 org 105 shown in Figure 1, an EL system should identify the referent entities of the three mentions WWDC, Apple and Lion correspondingly are the entities Apple Worldwide Developers Conference, Apple Inc. [sent-16, score-0.716]

4 For instance, in many applications we need to collect all appearances of a specific entity in different documents, EL is an effective way to resolve such an information integration problem. [sent-19, score-0.314]

5 Traditionally, there have been two distinct directions in EL to resolve the name ambiguity problem: one focusing on the effects of mention’ s context compatibility and the other dealing with the effects of document’s topic coherence. [sent-29, score-0.904]

6 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl compatibility assume that “the referent entity of a mention is reflected by its context”(Mihalcea & Cosomai, 2007; Zhang et al. [sent-34, score-1.004]

7 For example, the context compatibility based methods will identify the referent entity of the mention Lion in Figure 1 is the entity Mac OS X Lion, since this entity is more compatible with its context words operating system and release than other candidates such as Lion(big cats) or Lion(band). [sent-38, score-1.766]

8 EL methods based on topic coherence assume that “a mention ’s referent entity should be coherent with document’ ’s main topics” (Medelyan et al. [sent-39, score-1.111]

9 For example, the topic coherence based methods will link the mention Apple in Figure 1 to the entity Apple Inc. [sent-43, score-0.932]

10 , since it is more coherent with the document’s topic MAC OS X Lion Release than other referent candidates such as Apple (band) or Apple Bank. [sent-44, score-0.495]

11 , the context compatibility and the topic coherence are first separately modeled, then their EL evidence are combined through an additional model. [sent-51, score-0.88]

12 The main drawback of these hybrid methods, however, is that they model the context compatibility and the topic coherence separately, which makes it difficult to capture the mutual reinforcement effect between the above two directions. [sent-53, score-0.942]

13 That is, the topic coherence and the context compatibility are highly correlated and their evidence can be used to reinforce each other in EL decisions. [sent-54, score-0.905]

14 For example, in Figure 1, if the context compatibility gives a high likelihood the mention Apple refers to the entity Apple Inc. [sent-55, score-0.857]

15 , then this likelihood will give more evidence for this 106 document’s topic is about MAC OS X Lion, and it in turn will reinforce the topic coherence between the entity MAC OS X Lion and the document. [sent-56, score-1.08]

16 In reverse, once we known the topic of this document is about MAC OS X Lion, the context compatibility between the mention Apple and the entity Apple Inc. [sent-57, score-1.263]

17 can be improved as the importance of the context words operating system and release will be increased using the topic knowledge. [sent-58, score-0.348]

18 In this way, we believe that modeling the above two directions jointly, rather than separately, will further improve the EL performance by capturing the mutual reinforcement effect between the context compatibility and the topic coherence. [sent-59, score-0.789]

19 In this paper, we propose a method to jointly model and exploit the context compatibility, the topic coherence and the correlation between them for better EL performance. [sent-60, score-0.588]

20 tends to occur in documents about IT, but the entity Apple Bank will more likely to occur in documents about bank or investment. [sent-63, score-0.37]

21 2) Context compatibility assumption: The context words of a mention should be centered on its referent entity. [sent-64, score-0.757]

22 For example, the words computer, phone and music tends to occur in the context of the entity Apple Inc. [sent-65, score-0.381]

23 , meanwhile the words loan, invest and deposit will more likely to – occur in the context of the entity Apple Bank. [sent-66, score-0.381]

24 And the EL problem can now be decomposed into the following two inference tasks: 1) Predicting the underlying topics and the underlying entities of a document based on the observed information and the global knowledge. [sent-68, score-0.361]

25 Notice that the topic knowledge, the entity name knowledge and the entity context knowledge are all not previously given, thus we need to estimate them from data. [sent-70, score-1.211]

26 We propose a generative probabilistic model, the entity-topic model, which can jointly model and exploit the context compatibility, the topic coherence and the correlation between them for better EL performance; ? [sent-75, score-0.618]

27 In following we first demonstrate how to capture the context compatibility, the topic coherence and the correlation between them in the document generative process, then we incorporate the global knowledge generation into our model for knowledge estimation from data. [sent-84, score-0.887]

28 1 Document Generative Process As shown in Section 1, we jointly model the context compatibility and the topic coherence as the statistical dependencies in the entity-topic model by assuming that all documents are generated in a topical coherent and context 107 we describe the compatible way. [sent-86, score-1.002]

29 Formally, we represent a document as: A document is a collection of M mentions and N words, denoted as d = {m1, mM; w1, wN}, with mi the ith mention and wj the jth word. [sent-91, score-0.534]

30 Topic Knowledge Á (The entity distribution of topics): In our model, all entities in a document are generated based on its underlying topics, with each topic is a group of semantically related entities. [sent-94, score-0.905]

31 Statistically, we model each topic as a multinomial distribution of entities, with the probability indicating the likelihood an entity to be extracted from this topic. [sent-95, score-0.639]

32 08, …}, indicating the likelihood of the entity Steve Jobs be extracted from this topic is 0. [sent-99, score-0.595]

33 Entity Name Knowledge Ã(The name distribution of entities): In our model, all name mentions are generated using the name knowledge of its referent entity. [sent-102, score-0.789]

34 Specifically, we model the name knowledge of an entity as a multinomial distribution of its names, with the probability indicating the likelihood this entity is mentioned by the name. [sent-103, score-0.84]

35 For example, the name knowledge of the entity Apple Inc. [sent-104, score-0.482]

36 Entity Context Knowledge (The context word distribution of entities): In our model, all … » context words of an entity’s mention are generated using its context knowledge. [sent-114, score-0.368]

37 Concretely, we model the context knowledge of an entity as a multinomial distribution of words, with the probability indicating the likelihood a word appearing in this entity’s context. [sent-115, score-0.52]

38 002, }, indicating that the word computer appearing in the context of the entity Apple Inc. [sent-120, score-0.409]

39 To demonstrate the generation process, we also demonstrate how the document in Figure 1 can be generated using our model in following steps: Step 1: The model generates the topic distribution of the document as μμdd = {Apple Inc. [sent-124, score-0.605]

40 According to the topic distribution μμdd, the model generates their topic assignments as z1=Apple Inc. [sent-128, score-0.775]

41 According to the topic knowledge ÁApple , ÁOS and the topic assignments z1, z2, z3, the model generates their entity assignments as e1 = Apple Worldwide Developers Conference, e2 = Apple Inc. [sent-131, score-1.251]

42 According to the referent entity set in document ed = {Apple Worldwide Developers Conference, Apple Inc. [sent-136, score-0.653]

43 , Mac OS X Lion }, the model generates the target entity they describes as 108 Conference and a3=Apple Worldwide Developers a4=Apple Inc. [sent-137, score-0.344]

44 According to their target entity and the context knowledge of these entities, the model generates the context words in the document. [sent-139, score-0.545]

45 For example, according to the context knowledge of the entities Apple Worldwide Developers Conference, the model generates its context word w3 =conference, and according to the context knowledge of the entity Apple Inc. [sent-140, score-0.82]

46 Furthermore, the generation of topics, entities, mentions and words are highly correlated, thus our model can capture the correlation between the topic coherence and the context compatibility. [sent-143, score-0.722]

47 2 Global Knowledge Generative Process The entity-topic model relies on three types of global knowledge (including the topic knowledge, the entity name knowledge and the entity context knowledge) to generate a document. [sent-145, score-1.248]

48 3 Inference using Gibbs Sampling In this section, we describe how to resolve the entity linking problem using the entity-topic model. [sent-155, score-0.383]

49 Given a document d, predicting its entity assignments (eedd for mentions and aadd for words) and topic assignments ( zzdd ). [sent-157, score-1.184]

50 Notice that here the EL decisions are just the prediction of per-mention entity assignments (eedd). [sent-158, score-0.453]

51 Given a corpus D={d1, d2, dD}, estimating the global knowledge (including the entity distribution of topics Á, the name distribution Ãand the context word distribution of entities) from data. [sent-160, score-0.776]

52 In Gibbs sampling, we first construct the posterior distribution PP((zz;; ee;; aajjDD)) , then this posterior distribution is used to: 1) estimate μ, Á,Ã and and 2) predict the entities and the topics of all documents in D. [sent-165, score-0.315]

53 1) is the probability of the joint topic assignment z to all mentions m in corpus D, and PP((e j z ) = = ( ¡ ¡ ( ( E¯E¯) ¯ EE)) )TTttYY=TT=11QQ¡¡e( eEE¡¡(¯ (¯ ++ ++ CC CCtTTt¤ tTtTEEeeEE)) ) (3. [sent-169, score-0.495]

54 2) is the conditional probability of the joint entity assignments e to all mentions m in corpus D given all topic assignments z, and PP((mmjje ) = = ( (¡ ¡ ( ( K°K°) ° KK)) )EEeeYY=EE=11QQ¡¡mm((KK¡¡(° (° ++ ++ CC CCeEEe¤ eEeEMMmmMM)) ) (3. [sent-170, score-1.034]

55 3) is the conditional probability of all mentions mm given all per-mention entity assignments e, and PP((a j e ) = dYdY= DD11e YY½½eed ¡ CCCCdDd DDdDe¤e¤EEEE¢¢CCddDDeeAA (3. [sent-171, score-0.654]

56 4) is the conditional probability of the joint entity assignments a to all words w in corpus D given all per-mention entity assignments e, and PP((wwjja ) = ( ¡ ¡ ( ( V± V) ± VV)) )EEYYeeE=E=1 QQ¡¡ww((VV¡¡ ±( ±(± + + CC CCeEEe¤ eEeEWWwwWW)) ) (3. [sent-172, score-0.906]

57 5) is the conditional probability of all words ww given all per-word entity assignments a . [sent-173, score-0.485]

58 In all above formulas, ¡¡((::)) is the Gamma function, CCDdDdttTT is the times topic t has been assigned for all mentions in document d, CCDdDd¤¤TT == tt CCDdDdttTT is the topic number in document d, and CCTttTeeEE, CCEEeemmMM,CCdDdDeeEE, CCDddDeeAA, CCEeEewwWW have similar explanation. [sent-174, score-1.037]

59 For entity-topic model, each state in the Markov chain is an assignment (including topic assignment to a mention, entity assignment to a mention and entity assignment to a word). [sent-176, score-1.244]

60 Finally, using the above three conditional distributions, we iteratively update all assignments of corpus D until coverage, then the global knowledge is estimated using the final assignments, and the final entity assignments are used as the referents of their corresponding mentions. [sent-182, score-0.696]

61 , we iteratively update the entity assignments and the topic assignments of an unseen document as ° ° the same as the above inference process, but with the previously learned global knowledge fixed. [sent-187, score-1.102]

62 For we notice that KK°° is the number of pseudo names added to each entity, when == 00 our model only mentions an entity using its previously used names. [sent-190, score-0.475]

63 Observed that an entity typically has a fixed set of names, we set to a small value by setting KK°° == 11::00. [sent-191, score-0.314]

64 As there is typically a relatively loose correlation between an entity and its context words, we set to a relatively large value by fixing the total smoothing words added to each entity, a typical value is VV ±± = 2000. [sent-193, score-0.415]

65 ) as entities, so the entity in this paper may not strictly follow its definition. [sent-204, score-0.314]

66 For each document, the name mentions’ referent entities in Wikipedia are manually annotated to be as exhaustive as possible. [sent-215, score-0.456]

67 In total, 17,200 name mentions are annotated, with 161 name mentions per document on average. [sent-216, score-0.649]

68 In our experiments, we use only the name mentions whose referent entities are contained in Wikipedia. [sent-217, score-0.617]

69 This is a context compatibility based EL method using vector space model (Mihalcea & Csomai, 2007). [sent-231, score-0.42]

70 computes the context compatibility using the word overlap between the mention’s context and the entity’s Wikipedia entry. [sent-233, score-0.487]

71 This is a statistical context compatibility based EL method described in Han & Sun(201 1), which computes the compatibility by integrating the evidence from the entity popularity, the entity name knowledge and the context word distribution of entities. [sent-235, score-1.68]

72 ; This is a relational topic coherence based EL method described in Milne & Witten(2008). [sent-237, score-0.46]

73 M&W; measures an entity’s topic coherence to a document as its average semantic relatedness to the unambiguous entities in the document. [sent-238, score-0.726]

74 This is an EL method which combines context compatibility and topic coherence using a hybrid method (Kulkarni et al. [sent-240, score-0.915]

75 Except for CSA W and EL-Graph, all other baselines are designed only to link the salient name mentions (i. [sent-245, score-0.326]

76 e 65387t19203 From the overall results in Table 1, we can see that: 1) By jointly modeling and exploiting the context compatibility and the topic coherence, our method can achieve competitive performance: ○1 compared with the context compatibility baselines Wikify! [sent-270, score-1.148]

77 and EM-Model, our method correspondingly gets 43% and 19% F1 improvement; ○2 compared with the topic coherence baselines M&W;, our method achieves 28% F1 improvement; ③ compared with the hybrid baselines CSA W and EL-Graph, our method correspondingly achieves 11% and 7% F1 improvement. [sent-271, score-0.613]

78 Generally, we believe the main advantages of our method are: 1) The effects of topic knowledge. [sent-286, score-0.342]

79 One main advantage of our model is that the topic knowledge can provide a document-specific entity prior for EL. [sent-287, score-0.662]

80 Concretely, using the topic knowledge and the topic distribution of documents, the prior for an entity appearing in a document d is highly related to the document’s topics: PP((eejjdd)) == PP((zzjjdd))PP((eejjzz)) This prior is obviously more reasonable than the “information less prior” (i. [sent-288, score-1.14]

81 , all entities have equal zz prior) or “a global entity popularity prior” (Han & Sun, 2011). [sent-290, score-0.551]

82 We can see that the topic knowledge can provide a reasonable prior for entities appearing in a document: the Apple Inc. [sent-293, score-0.517]

83 838 correspondingly), we believe this is because: ○1 The mentions to be linked in TAC data set are mostly salient mentions; ○2 The influence of the NIL referent entity problem, i. [sent-315, score-0.745]

84 , the referent entity is not contained in the given knowledge base: Most referent entities (67. [sent-317, score-0.95]

85 5%) on TAC 2009 are NIL entity and our method has no special handling on this problem, rather than other methods such as the EM-Model, which affects the overall performance of our method. [sent-318, score-0.314]

86 Traditionally, the context compatibility based methods link a mention to the entity which has the largest compatibility with it. [sent-321, score-1.245]

87 Cucerzan (2007) modeled the compatibility as the cosine similarity between the vector space representation of mention’s context and of entity’s Wikipedia entry. [sent-322, score-0.454]

88 (201 1) extended the vector space model with more information such as the entity category and the acronym expansion, etc. [sent-326, score-0.314]

89 Han & Sun (201 1) proposed a generative model which computes the compatibility using the evidences from entity’s popularity, name distribution and context word distribution. [sent-327, score-0.595]

90 (201 1) and Sen (2012) used a latent topic model to learn the context model of entities. [sent-329, score-0.348]

91 On the other side, the topic coherence based methods link a mention to the entity which are most coherent to the document containing it. [sent-335, score-1.057]

92 (2008) measured the topic coherence of an entity to a document as the weighted average of its relatedness to the unambiguous entities in the document. [sent-337, score-1.04]

93 Bhattacharya and Getoor (2006) modeled the topic coherence as the likelihood an entity is generated from the latent topics of a document. [sent-340, score-0.866]

94 Sen (2012) modeled the topic coherence as the groups of co-occurring entities. [sent-341, score-0.494]

95 (2009) modeled the topic coherence as the sum of all pair-wise relatedness between the referent entities of a document. [sent-343, score-0.849]

96 (201 1) modeled the topic coherence of an entity as its node importance in a graph which captures all mention-entity and entity-entity relations in a document. [sent-346, score-0.808]

97 6 Conclusions and Future Work This paper proposes a generative model, topic model, for entity linking. [sent-347, score-0.625]

98 By modeling context compatibility, topic and the correlation between them as the entityuniformly coherence statistical 114 dependencies, our model provides an effective way to jointly exploit them for better EL performance. [sent-348, score-0.588]

99 In this paper, the entity-topic model can only link mentions to the previously given entities in a knowledge base. [sent-349, score-0.404]

100 For future work, we want to overcome this limit by incorporating an entity discovery ability into our model, so that it can also discover and learn the knowledge of previously unseen entities from a corpus for linking name mentions to these entities. [sent-350, score-0.853]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('apple', 0.377), ('compatibility', 0.353), ('entity', 0.314), ('topic', 0.281), ('el', 0.264), ('referent', 0.214), ('lion', 0.202), ('coherence', 0.179), ('mentions', 0.161), ('dd', 0.16), ('entities', 0.141), ('pp', 0.139), ('assignments', 0.139), ('document', 0.125), ('mention', 0.123), ('ee', 0.103), ('name', 0.101), ('mac', 0.088), ('han', 0.081), ('tac', 0.081), ('os', 0.081), ('eejjzz', 0.076), ('worldwide', 0.076), ('linking', 0.069), ('context', 0.067), ('knowledge', 0.067), ('developers', 0.065), ('tt', 0.064), ('kulkarni', 0.064), ('correspondingly', 0.059), ('zz', 0.059), ('topics', 0.058), ('jj', 0.057), ('assignment', 0.053), ('wikipedia', 0.052), ('cccc', 0.05), ('ddiirr', 0.05), ('entitytopic', 0.05), ('kataria', 0.05), ('wwdc', 0.05), ('medelyan', 0.049), ('milne', 0.048), ('aa', 0.045), ('wikify', 0.045), ('gibbs', 0.045), ('distribution', 0.044), ('eedd', 0.043), ('sen', 0.043), ('mm', 0.04), ('vv', 0.04), ('mcnamee', 0.039), ('nil', 0.039), ('aajjdd', 0.038), ('band', 0.038), ('csa', 0.038), ('eeii', 0.038), ('iitb', 0.038), ('global', 0.037), ('doc', 0.036), ('sun', 0.036), ('link', 0.035), ('witten', 0.035), ('hybrid', 0.035), ('correlation', 0.034), ('directions', 0.034), ('modeled', 0.034), ('kk', 0.034), ('effects', 0.034), ('sampling', 0.033), ('ww', 0.032), ('cc', 0.03), ('generative', 0.03), ('base', 0.03), ('generates', 0.03), ('hyperparameter', 0.029), ('salient', 0.029), ('ii', 0.028), ('appearing', 0.028), ('documents', 0.028), ('dirichlet', 0.028), ('jointly', 0.027), ('csomai', 0.027), ('believe', 0.027), ('reinforcement', 0.027), ('collective', 0.026), ('aadd', 0.025), ('aapppplleeiinncc', 0.025), ('bhattacharya', 0.025), ('ccdddd', 0.025), ('ccddddeeaa', 0.025), ('ccddddtttt', 0.025), ('cceeee', 0.025), ('eeddjj', 0.025), ('gottipati', 0.025), ('mmdd', 0.025), ('mmddjjeedd', 0.025), ('reinforce', 0.025), ('wwdd', 0.025), ('traditionally', 0.025), ('dang', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 19 emnlp-2012-An Entity-Topic Model for Entity Linking

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

2 0.20661393 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

3 0.19692859 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

Author: Lev Ratinov ; Dan Roth

Abstract: We explore the interplay of knowledge and structure in co-reference resolution. To inject knowledge, we use a state-of-the-art system which cross-links (or “grounds”) expressions in free text to Wikipedia. We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. To maximize the utility of the injected knowledge, we deploy a learningbased multi-sieve approach and develop novel entity-based features. Our end system outperforms the state-of-the-art baseline by 2 B3 F1 points on non-transcript portion of the ACE 2004 dataset.

4 0.1846831 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao

Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.

5 0.17434064 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

6 0.14767028 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

7 0.13935721 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

8 0.13243994 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

9 0.13157649 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

10 0.12865347 84 emnlp-2012-Linking Named Entities to Any Database

11 0.12303295 41 emnlp-2012-Entity based QA Retrieval

12 0.11094415 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

13 0.10480966 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

14 0.088731959 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

15 0.086456582 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

16 0.086022355 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

17 0.08160165 72 emnlp-2012-Joint Inference for Event Timeline Construction

18 0.075573862 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic

19 0.071361557 97 emnlp-2012-Natural Language Questions for the Web of Data

20 0.063266829 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.226), (1, 0.3), (2, 0.054), (3, -0.058), (4, -0.308), (5, 0.027), (6, -0.039), (7, -0.009), (8, -0.165), (9, -0.069), (10, -0.015), (11, -0.063), (12, 0.138), (13, 0.008), (14, 0.081), (15, 0.054), (16, 0.221), (17, 0.046), (18, -0.001), (19, 0.1), (20, 0.031), (21, -0.003), (22, 0.001), (23, 0.034), (24, -0.125), (25, 0.079), (26, -0.001), (27, -0.083), (28, -0.024), (29, -0.146), (30, 0.019), (31, 0.038), (32, 0.051), (33, 0.125), (34, 0.165), (35, 0.056), (36, -0.103), (37, 0.028), (38, -0.021), (39, -0.026), (40, 0.024), (41, 0.053), (42, -0.009), (43, -0.078), (44, 0.04), (45, 0.01), (46, -0.05), (47, -0.002), (48, 0.041), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98436302 19 emnlp-2012-An Entity-Topic Model for Entity Linking

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

2 0.67842495 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

3 0.61118031 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

4 0.60341191 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao

Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.

5 0.59422207 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

Author: Lev Ratinov ; Dan Roth

Abstract: We explore the interplay of knowledge and structure in co-reference resolution. To inject knowledge, we use a state-of-the-art system which cross-links (or “grounds”) expressions in free text to Wikipedia. We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. To maximize the utility of the injected knowledge, we deploy a learningbased multi-sieve approach and develop novel entity-based features. Our end system outperforms the state-of-the-art baseline by 2 B3 F1 points on non-transcript portion of the ACE 2004 dataset.

6 0.58342904 84 emnlp-2012-Linking Named Entities to Any Database

7 0.56890643 41 emnlp-2012-Entity based QA Retrieval

8 0.521402 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

9 0.48339504 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

10 0.46575838 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

11 0.39212003 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

12 0.36132011 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

13 0.35714105 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

14 0.34980989 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

15 0.32920629 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

16 0.31173846 61 emnlp-2012-Grounded Models of Semantic Representation

17 0.29141086 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

18 0.27834719 97 emnlp-2012-Natural Language Questions for the Web of Data

19 0.26118886 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

20 0.24955299 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.017), (16, 0.025), (25, 0.013), (28, 0.31), (34, 0.053), (60, 0.191), (63, 0.092), (64, 0.017), (65, 0.036), (70, 0.027), (73, 0.011), (74, 0.034), (76, 0.021), (80, 0.013), (86, 0.013), (95, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80521262 19 emnlp-2012-An Entity-Topic Model for Entity Linking

Author: Xianpei Han ; Le Sun

Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1

2 0.57953006 61 emnlp-2012-Grounded Models of Semantic Representation

Author: Carina Silberer ; Mirella Lapata

Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.

3 0.57540983 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation

Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn

Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.

4 0.57528061 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson

Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.

5 0.57449955 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

6 0.56515229 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

7 0.56462252 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

8 0.56090474 84 emnlp-2012-Linking Named Entities to Any Database

9 0.56059784 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs

10 0.55991441 41 emnlp-2012-Entity based QA Retrieval

11 0.55833417 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

12 0.55506176 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

13 0.5539313 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

14 0.5530495 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing

15 0.55219406 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

16 0.55053103 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models

17 0.55029941 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

18 0.54966629 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging

19 0.54924518 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

20 0.548406 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules