emnlp emnlp2011 emnlp2011-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Swapna Gottipati ; Jing Jiang
Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.
Reference: text
sentIndex sentText sentNum sentScore
1 s g Abstract In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. [sent-4, score-0.656]
2 We use both local contexts and global world knowledge to expand query language models. [sent-5, score-0.587]
3 Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. [sent-7, score-0.351]
4 This task of linking mentions of entities within specific contexts to their corresponding entries in an existing knowledge base is called entity linking and has been proposed and studied in the Knowledge Base Population (KBP) track of the Text Analysis Conference (TAC) (McNamee and Dang, 2009). [sent-11, score-0.534]
5 Besides improving an online surfer’s browsing experience, entity linking also has potential us804 Jing Jiang School of Information Systems Singapore Management University Singapore j ing j iang@ smu . [sent-12, score-0.241]
6 The major challenge of entity linking is to resolve name ambiguities. [sent-15, score-0.437]
7 (2) Synonymy: This type of ambiguities refers to the case when more than one name variation refers to the same entity. [sent-25, score-0.274]
8 Synonymy affects entity linking when the entity mention in the document uses a name variation not covered in the entity’s knowledge base entry. [sent-30, score-0.691]
9 Intuitively, to disambiguate a polysemous entity name, we should make use of the context in which the name occurs, and to address synonymy, external world knowledge is usually needed to expand acronyms or find other name variations. [sent-31, score-0.742]
10 We use the KL-divergence retrieval model (Zhai and Lafferty, 2001) and expand the query language models by considering both the local contexts within the query documents and global world knowledge obtained from the Web. [sent-38, score-1.021]
11 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 8c0s4–813, We evaluate our retrieval method with query expansion on the 2010 TAC-KBP data set. [sent-41, score-0.483]
12 We find that our expanded query language models can indeed improve the performance significantly, demonstrat- ing the effectiveness of our principled and yet simple techniques. [sent-42, score-0.531]
13 Each KB entry E represents a unique entity and has three fields: (1) a name string NE, which can be regarded as the official name of the entity, (2) an entity type TE, which is one of {PER, ORG, GPE, UNKNOWN}, a wnhdi (3) some disambiguation tGexPtE DE. [sent-50, score-0.974]
14 GKivNeOnW a query Q w(3h)ic shom mcoen dsiisstasm obfi a query name string NQ and a query document DQ where the name occurs, the task is to return a single KB entry to which the query name string refers or Nil if there is no such KB entry. [sent-51, score-2.572]
15 It is fairly natural to address entity linking by ranking the KB entries given a query. [sent-52, score-0.321]
16 In this section 805 we present an overview of our system, which consists of two major stages: a candidate selection stage to identify a set of candidate KB entries through name matching, and a ranking stage to link the query entity to the most likely KB entry. [sent-53, score-1.071]
17 In both stages, we consider the query’s local context in the query document and world knowledge obtained from the Web. [sent-54, score-0.567]
18 Intuitively, we determine whether two entities are the same by comparing their name strings. [sent-60, score-0.288]
19 We therefore need to compare the query name string NQ with the name string NE of each KB entry. [sent-61, score-0.975]
20 However, because of the name ambiguity problem, we cannot expect the correct KB entry to always have exactly the same name string as the query. [sent-62, score-0.641]
21 To address this problem, we use a set of alternative name strings expanded from NQ and select KB entries whose name strings match at least one of them. [sent-63, score-0.793]
22 These alternative name strings come from two sources: the query document DQ and the Web. [sent-64, score-0.76]
23 First, we observe that some useful alternative name strings come from the query document. [sent-65, score-0.712]
24 For example, a PER query name string may contain only a person’s last name but the query document contains the person’s full name, which is clearly a less ambiguous name string to use. [sent-66, score-1.631]
25 Similarly, a GPE query name string may contain only the name of a city or town but the query document contains the state or province, which also helps disambiguate the query entity. [sent-67, score-1.733]
26 Given query Q, let SQ denote the set of faoltlelrownaitnivge. [sent-69, score-0.393]
27 t Ihne-istihaelllyf SNER tagger to identify named entities from the query document DQ. [sent-73, score-0.549]
28 We denote these alternative name strings as {NQl,i}iK=Q1, where lindicates that these name strings come locally from DQ and KQ is the total number of such name strings. [sent-76, score-0.819]
29 Sometimes alternative name strings have to come from external knowledge. [sent-80, score-0.339]
30 For example, one of the queries we have contains the name string “AMPAS,” and the query document also uses only this acronym to refer to this entity. [sent-81, score-0.796]
31 But the full name of the entity, “Academy of Motion Pictures Arts and Sciences,” is needed in order to locate the correct KB entry. [sent-82, score-0.233]
32 Given query name string NQ, we check whether the following link exists: http : / / en . [sent-84, score-0.71]
33 So if the link exists, we use the title of the Wikipedia page as another alternative name string for NQ. [sent-88, score-0.351]
34 We refer to this name string as NQg to indicate that it is a global name variant. [sent-89, score-0.54]
35 For each name string N in SQ, we find KB entries whFoosre name strings imngat Nch Nn S. [sent-92, score-0.64]
36 We take the union of 806 Query name string (NQ): Mobile Query document (DQ): The site is near Mount Vernon in the Calvert community on the Tombigbee River, some 25 miles (40 kilometers) north of Mobile. [sent-93, score-0.339]
37 Alternative Query Strings (SQ): fArlotmer nloatcaivle c Qounteerxyt: S t Mrionbgisle (, Mobile Mount Vernon, Mobile Calvert, Mobile River, Mobile Mexico, Mobile Alabama, Mobile Brazil Figure 1: An example GPE query from TAC 2010. [sent-96, score-0.393]
38 Query name string (NQ): Coppola Query document (DQ): I had no idea of all these semi-obscure connections, felicia! [sent-97, score-0.339]
39 I think I once saw a picture of him sometime ago Alternative Query Strings (SQ): fArlotmer nloactaivl eco Qnuteexryt: Coppola, Sophia Coppola, Sofia Coppola from world knowledge(Wikipedia): Sofia Coppola Figure 2: An example PER query from TAC 2010. [sent-100, score-0.457]
40 These are steh es ectasn odfid KaBte eKntBr eenst arineds rfeofre query Q. [sent-102, score-0.393]
41 Given a KB entry E and query Q, we score E based on the KL-divergence defined below: s(E,Q) = −Div(θQ∥θE) = −∑p(w|θQ)logpp((ww||θθEQ)). [sent-106, score-0.528]
42 µ (1) Here θQ and θE are the query language model and the KB entry language model, respectively. [sent-107, score-0.528]
43 To estimate θQ, typically we can use the empirical query word distribution: p(w|θQ) =c(w|N,NQ|Q), (3) where c(w, NQ) is the count of w in NQ and |NQ | is the length of NQ. [sent-117, score-0.393]
44 We call this model thea original query language model. [sent-118, score-0.393]
45 After ranking the candidate KB entries in EQ using Equation (1), we perform entity linking as follows. [sent-119, score-0.347]
46 First, using an NER tagger, we determine the entity type of the query name string NQ. [sent-120, score-0.8]
47 The system links the query entity to this KB entry. [sent-123, score-0.509]
48 1 that using the original query name string NQ itself may not be enough to obtain the correct KB entry, and additional words from both the query document and external knowledge can be useful. [sent-126, score-1.175]
49 In this section, we discuss how to expand the query language model θQ with these additional words in a principled way in order to rank KB entries based on how likely they match the query entity. [sent-128, score-0.942]
50 During the KB entry ranking stage, if we use θQ estimated from NQ, which contains only the word 807 “Coppola,” the retrieval function is unlikely to rank the correct KB entry on the top. [sent-131, score-0.371]
51 But if we include the contextual word “Sophia” from the query document when estimating the query language model, KL-divergence retrieval model is likely to rank the correct KB entry on the top. [sent-132, score-1.035]
52 This idea of using contextual words to expand the query is very similar to (pseudo) relevance feedback in information retrieval. [sent-133, score-0.534]
53 We can treat the query document DQ as our only feedback document. [sent-134, score-0.513]
54 We then linearly interpolate the feedback language model with the original query language model to form an expanded query language model: p(w|θQL) = αp(w|θQ) + (1 − α)p(w|θDQ ), (5) where α is a parameter between 0 and 1, to control the amount of feedback. [sent-143, score-0.963]
55 L indicates that the query expansion comes from local context. [sent-145, score-0.474]
56 For entity linking, we suspect that named entities surrounding the query name string in DQ are particularly useful for disambiguation and thus should be emphasized over other words. [sent-148, score-0.956]
57 Positional Model Another observation is that words closer to the query name string in the query document are likely to be more important than words farther away. [sent-151, score-1.125]
58 Intuitively, we can use the distance between a word and the query name string to help weigh the word. [sent-152, score-0.684]
59 2 Using Global World Knowledge Similar to the way we incorporate words from DQ into the query language model, we can also construct a feedback language model using the most likely official name ofthe query entity obtained from Wikipedia. [sent-158, score-1.242]
60 (9) We can then linearly interpolate θNgQ with the original query language model θQ to form an expanded query language model θQG: p(w|θQG) = αp(w|θQ) + (1− α)p(w|θNQg). [sent-160, score-0.891]
61 (10) Here G indicates that the query expansion comes from global world knowledge. [sent-161, score-0.54]
62 3 Combining Local Context and World Knowledge We can further combine the two kinds of additional words into the query language model as follows: p(w|θLQ+G) = (βp(w|θDQ) +(1 − β)p(w|θNQg)). [sent-166, score-0.393]
63 The data set contains 2250 queries and query documents come from news wire and Web pages. [sent-172, score-0.457]
64 This piece of text comes from a query document where the query name string is “Jackman. [sent-178, score-1.125]
65 ” We can see that the NER tagger can help locate the full name of the person. [sent-179, score-0.233]
66 Methods to Compare: Recall that our system consists of a KB entry selection stage and a KB entry ranking stage. [sent-192, score-0.4]
67 At the selection stage, a set SQ of aralntekrinnagtiv seta name strings are utisoend tstoa sgeel,ec at sceatnd Sidate KB entries. [sent-193, score-0.305]
68 We first define a few settings where different alternative name string sets are used to select candidate KB entries: • Q represents the baseline setting which uses only pthrees original query name string NQ hto u select candidate KB entries. [sent-194, score-1.101]
69 • • • Q+L represents the setting where alternative name strings tosbta thinee dse ftrinomg wthhee query ndaotcivuement DQ are combined with NQ to select candidate KB entries. [sent-195, score-0.758]
70 Q+G represents the setting where the alternatQiv+eG name string hoebt saeitnteindg fr wohmer Wikipedia aiscombined with NQ to select candidate KB entries. [sent-196, score-0.337]
71 r 1, tehntats is, a slteettrinnagtiv aes name strings from both DQ and Wikipedia are used together with NQ to select candidate KB entries. [sent-198, score-0.331]
72 After selecting candidate KB entries, in the KB entry ranking stage, we have four options for the query language model and two options for the KB entry language model. [sent-199, score-0.724]
73 Before examining the effect of query expansion in ranking, we now compare the effect of using different sets of alternative query name strings in the candidate KB entry selection stage. [sent-207, score-1.335]
74 For this set of experiments, we fix the query language model to θQ and the KB entry language model to θNE in the ranking stage. [sent-208, score-0.563]
75 the Nil case), the performance of Q, Q+L, Q+G and Q+L+G is very close, indicating that the additional alternative query name strings do not help. [sent-218, score-0.712]
76 It shows that the alternative query name strings are most useful for queries that do have their correct entries in the KB. [sent-219, score-0.84]
77 panded query language models θQL, θQG and We first analyze the results without using the KB disambiguation text, i. [sent-221, score-0.441]
78 Table 5 shows the comparison between θQ and other expanded query language models in terms of micro-averaged accuracy. [sent-224, score-0.498]
79 The results reveal that the expanded query language models can indeed improve the overall performance (the both Nil and non-Nil case) under all settings. [sent-225, score-0.498]
80 This shows the effectiveness of using the principled query expansion technique coupled with KL-divergence retrieval model to rank KB entries. [sent-226, score-0.541]
81 86333587 Table 5: Comparison between the performance of θQ and expanded query language models in terms of micro average accuracy. [sent-299, score-0.534]
82 While in Table 4 the alternative name strings do not affect the performance much for Nil queries, now the expanded query language models actually hurt the performance for Nil queries. [sent-302, score-0.817]
83 When we expand the query language model, we can possibly introduce noise, especially when we use the external knowledge obtained from Wikipedia, which largely depends on what Wikipedia considers to be the most popular official name of a query name string. [sent-304, score-1.353]
84 With noisy terms in the expanded query language model we increase the chance to link the query to a KB entry which is not the correct match. [sent-305, score-1.052]
85 The challenge is that we do not know when additional terms in the expanded query language model are noise and when they are not, because for non-Nil queries we do observe a substantial amount of improvement brought by query expansion, especially with external world knowledge. [sent-306, score-1.039]
86 We now further study the impact of using the KB disambiguation text associated with each entry to estimate the KB entry language model used in the KLdivergence ranking function. [sent-308, score-0.353]
87 Without the KB disambiguation text both the KB entry Mobile Alabama and the entry Mobile River are given the same score, resulting in inaccurate linking in the θNE case. [sent-313, score-0.424]
88 However, we observe that such cases are very rare in the TAC 2010 query list and thus the overall improvement observed is minimal. [sent-315, score-0.393]
89 69372 Table 7: The KL-divergence scores of KB entities for the query Mobile. [sent-320, score-0.466]
90 88335577 Table 6: Comparing the performance using KB text and without using KB text for all methods using expanded query models in terms of micro average accuracy on 2250 queries. [sent-407, score-0.534]
91 Recall that all the expanded query language models also have a control parameters α. [sent-426, score-0.498]
92 context and the global world knowledge are weighed equally for aiding disambiguation and improving the entity linking performance. [sent-441, score-0.398]
93 In their work, they took an assumption that every entity has a KB entry and thus the NIL entries are not handled. [sent-460, score-0.341]
94 (2010) took the approach that large number of entities will be unlinkable, as there is a probability that the relevant KB entry is unavailable. [sent-468, score-0.234]
95 But their proposal for handling the alias name or stage name via multiple lists is not scalable. [sent-470, score-0.53]
96 Similarly, for acronyms we use the global knowledge that aids unabbreviating and thus entity disambiguation. [sent-472, score-0.249]
97 We integrated some of their ideas like world knowledge with our new techniques to achieve efficient entity linking accuracy. [sent-483, score-0.316]
98 6 Conclusions In this paper we proposed a novel approach to entity linking based on statistical language model-based information retrieval with query expansion using the local context from the query document as well as world knowledge from the Web. [sent-484, score-1.272]
99 Document language models, query models, and risk minimization for information retrieval. [sent-508, score-0.393]
100 A comparative study of methods for estimating query language models with pseudo feedback. [sent-522, score-0.429]
wordName wordTfidf (topN-words)
[('kb', 0.659), ('query', 0.393), ('name', 0.215), ('dq', 0.212), ('nq', 0.191), ('ql', 0.148), ('entry', 0.135), ('entity', 0.116), ('tac', 0.108), ('linking', 0.106), ('expanded', 0.105), ('mobile', 0.1), ('ne', 0.096), ('nil', 0.087), ('gpe', 0.085), ('string', 0.076), ('stage', 0.075), ('coppola', 0.074), ('entities', 0.073), ('feedback', 0.072), ('wikipedia', 0.071), ('strings', 0.07), ('queries', 0.064), ('entries', 0.064), ('world', 0.064), ('nqg', 0.062), ('sensitivity', 0.06), ('sq', 0.053), ('official', 0.053), ('zhai', 0.05), ('expansion', 0.049), ('alabama', 0.049), ('nnee', 0.049), ('disambiguation', 0.048), ('acronyms', 0.048), ('qg', 0.048), ('document', 0.048), ('river', 0.042), ('kbp', 0.042), ('positional', 0.042), ('retrieval', 0.041), ('base', 0.039), ('eq', 0.038), ('lq', 0.037), ('pseudo', 0.036), ('micro', 0.036), ('relevance', 0.035), ('ranking', 0.035), ('named', 0.035), ('chengxiang', 0.035), ('global', 0.034), ('expand', 0.034), ('alternative', 0.034), ('lv', 0.033), ('principled', 0.033), ('local', 0.032), ('mcnamee', 0.032), ('sophia', 0.032), ('population', 0.032), ('ji', 0.03), ('knowledge', 0.03), ('org', 0.029), ('link', 0.026), ('took', 0.026), ('candidate', 0.026), ('dredze', 0.025), ('alias', 0.025), ('ampas', 0.025), ('brazil', 0.025), ('calvert', 0.025), ('farlotmer', 0.025), ('fqb', 0.025), ('mount', 0.025), ('ngq', 0.025), ('sofia', 0.025), ('vernon', 0.025), ('rank', 0.025), ('zheng', 0.023), ('synonymy', 0.022), ('ner', 0.021), ('tq', 0.021), ('route', 0.021), ('aids', 0.021), ('lehmann', 0.021), ('yuanhua', 0.021), ('affects', 0.021), ('selection', 0.02), ('select', 0.02), ('management', 0.02), ('external', 0.02), ('refers', 0.02), ('lafferty', 0.019), ('ambiguities', 0.019), ('lavrenko', 0.019), ('redirect', 0.019), ('heterogenous', 0.019), ('smu', 0.019), ('careful', 0.019), ('singapore', 0.018), ('locate', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion
Author: Swapna Gottipati ; Jing Jiang
Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.
2 0.31151402 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking
Author: Zheng Chen ; Heng Ji
Abstract: In this paper, we present a new ranking scheme, collaborative ranking (CR). In contrast to traditional non-collaborative ranking scheme which solely relies on the strengths of isolated queries and one stand-alone ranking algorithm, the new scheme integrates the strengths from multiple collaborators of a query and the strengths from multiple ranking algorithms. We elaborate three specific forms of collaborative ranking, namely, micro collaborative ranking (MiCR), macro collaborative ranking (MaCR) and micro-macro collab- orative ranking (MiMaCR). Experiments on entity linking task show that our proposed scheme is indeed effective and promising.
3 0.13343954 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base
Author: Ni Lao ; Tom Mitchell ; William W. Cohen
Abstract: t om . We consider the problem of performing learning and inference in a large scale knowledge base containing imperfect knowledge with incomplete coverage. We show that a soft inference procedure based on a combination of constrained, weighted, random walks through the knowledge base graph can be used to reliably infer new beliefs for the knowledge base. More specifically, we show that the system can learn to infer different target relations by tuning the weights associated with random walks that follow different paths through the graph, using a version of the Path Ranking Algorithm (Lao and Cohen, 2010b). We apply this approach to a knowledge base of approximately 500,000 beliefs extracted imperfectly from the web by NELL, a never-ending language learner (Carlson et al., 2010). This new system improves significantly over NELL’s earlier Horn-clause learning and inference method: it obtains nearly double the precision at rank 100, and the new learning method is also applicable to many more inference tasks.
4 0.10735326 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
Author: Johannes Hoffart ; Mohamed Amir Yosef ; Ilaria Bordino ; Hagen Furstenau ; Manfred Pinkal ; Marc Spaniol ; Bilyana Taneva ; Stefan Thater ; Gerhard Weikum
Abstract: Disambiguating named entities in naturallanguage text maps mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base such as DBpedia or YAGO. This paper presents a robust method for collective disambiguation, by harnessing context from knowledge bases and using a new form of coherence graph. It unifies prior approaches into a comprehensive framework that combines three measures: the prior probability of an entity being mentioned, the similarity between the contexts of a mention and a candidate entity, as well as the coherence among candidate entities for all mentions together. The method builds a weighted graph of mentions and candidate entities, and computes a dense subgraph that approximates the best joint mention-entity mapping. Experiments show that the new method significantly outperforms prior methods in terms of accuracy, with robust behavior across a variety of inputs.
5 0.078607686 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
6 0.078485265 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels
7 0.058598615 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
8 0.056460064 128 emnlp-2011-Structured Relation Discovery using Generative Models
9 0.046818752 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
10 0.045757275 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
11 0.043700233 114 emnlp-2011-Relation Extraction with Relation Topics
12 0.040485568 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
13 0.039873414 110 emnlp-2011-Ranking Human and Machine Summarization Systems
14 0.038957044 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization
15 0.038933083 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases
16 0.03867586 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
17 0.038366083 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
18 0.037226196 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
19 0.036307465 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
20 0.035870764 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
topicId topicWeight
[(0, 0.124), (1, -0.098), (2, -0.046), (3, -0.054), (4, -0.063), (5, -0.117), (6, 0.011), (7, -0.111), (8, -0.082), (9, 0.242), (10, 0.076), (11, -0.13), (12, -0.193), (13, 0.135), (14, 0.137), (15, -0.194), (16, 0.349), (17, 0.197), (18, 0.134), (19, -0.106), (20, -0.122), (21, -0.053), (22, -0.02), (23, 0.197), (24, -0.099), (25, -0.107), (26, 0.031), (27, -0.127), (28, 0.019), (29, -0.064), (30, -0.159), (31, -0.091), (32, -0.03), (33, -0.0), (34, -0.013), (35, 0.042), (36, -0.012), (37, 0.025), (38, 0.003), (39, -0.109), (40, -0.106), (41, 0.008), (42, 0.002), (43, 0.034), (44, 0.034), (45, -0.092), (46, -0.019), (47, -0.018), (48, -0.072), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.97942513 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion
Author: Swapna Gottipati ; Jing Jiang
Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.
2 0.9177227 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking
Author: Zheng Chen ; Heng Ji
Abstract: In this paper, we present a new ranking scheme, collaborative ranking (CR). In contrast to traditional non-collaborative ranking scheme which solely relies on the strengths of isolated queries and one stand-alone ranking algorithm, the new scheme integrates the strengths from multiple collaborators of a query and the strengths from multiple ranking algorithms. We elaborate three specific forms of collaborative ranking, namely, micro collaborative ranking (MiCR), macro collaborative ranking (MaCR) and micro-macro collab- orative ranking (MiMaCR). Experiments on entity linking task show that our proposed scheme is indeed effective and promising.
3 0.39168605 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base
Author: Ni Lao ; Tom Mitchell ; William W. Cohen
Abstract: t om . We consider the problem of performing learning and inference in a large scale knowledge base containing imperfect knowledge with incomplete coverage. We show that a soft inference procedure based on a combination of constrained, weighted, random walks through the knowledge base graph can be used to reliably infer new beliefs for the knowledge base. More specifically, we show that the system can learn to infer different target relations by tuning the weights associated with random walks that follow different paths through the graph, using a version of the Path Ranking Algorithm (Lao and Cohen, 2010b). We apply this approach to a knowledge base of approximately 500,000 beliefs extracted imperfectly from the web by NELL, a never-ending language learner (Carlson et al., 2010). This new system improves significantly over NELL’s earlier Horn-clause learning and inference method: it obtains nearly double the precision at rank 100, and the new learning method is also applicable to many more inference tasks.
4 0.37733805 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
Author: Fabien Cromieres ; Sadao Kurohashi
Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.
5 0.32316691 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
Author: Johannes Hoffart ; Mohamed Amir Yosef ; Ilaria Bordino ; Hagen Furstenau ; Manfred Pinkal ; Marc Spaniol ; Bilyana Taneva ; Stefan Thater ; Gerhard Weikum
Abstract: Disambiguating named entities in naturallanguage text maps mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base such as DBpedia or YAGO. This paper presents a robust method for collective disambiguation, by harnessing context from knowledge bases and using a new form of coherence graph. It unifies prior approaches into a comprehensive framework that combines three measures: the prior probability of an entity being mentioned, the similarity between the contexts of a mention and a candidate entity, as well as the coherence among candidate entities for all mentions together. The method builds a weighted graph of mentions and candidate entities, and computes a dense subgraph that approximates the best joint mention-entity mapping. Experiments show that the new method significantly outperforms prior methods in terms of accuracy, with robust behavior across a variety of inputs.
6 0.2456287 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
8 0.1950184 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
9 0.18104453 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
10 0.16412456 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization
11 0.15780286 110 emnlp-2011-Ranking Human and Machine Summarization Systems
12 0.15252465 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
13 0.1492468 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
14 0.14739789 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
15 0.14713536 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
16 0.14612645 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
17 0.13649416 32 emnlp-2011-Computing Logical Form on Regulatory Texts
18 0.12634885 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
19 0.12433616 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
20 0.12185169 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model
topicId topicWeight
[(23, 0.089), (36, 0.028), (37, 0.015), (45, 0.521), (53, 0.015), (54, 0.023), (62, 0.019), (64, 0.012), (66, 0.023), (69, 0.021), (79, 0.029), (82, 0.011), (87, 0.012), (96, 0.019), (98, 0.047)]
simIndex simValue paperId paperTitle
1 0.98784077 21 emnlp-2011-Bayesian Checking for Topic Models
Author: David Mimno ; David Blei
Abstract: Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved.
2 0.98332793 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman
Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.
3 0.96621716 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP
Author: Amit Goyal ; Hal Daume III
Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).
same-paper 4 0.93490487 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion
Author: Swapna Gottipati ; Jing Jiang
Abstract: In this paper we present a novel approach to entity linking based on a statistical language model-based information retrieval with query expansion. We use both local contexts and global world knowledge to expand query language models. We place a strong emphasis on named entities in the local contexts and explore a positional language model to weigh them differently based on their distances to the query. Our experiments on the TAC-KBP 2010 data show that incorporating such contextual information indeed aids in disambiguating the named entities and consistently improves the entity linking performance. Compared with the official results from KBP 2010 participants, our system shows competitive performance.
5 0.89184266 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
Author: Emily M. Bender ; Dan Flickinger ; Stephan Oepen ; Yi Zhang
Abstract: In order to obtain a fine-grained evaluation of parser accuracy over naturally occurring text, we study 100 examples each of ten reasonably frequent linguistic phenomena, randomly selected from a parsed version of the English Wikipedia. We construct a corresponding set of gold-standard target dependencies for these 1000 sentences, operationalize mappings to these targets from seven state-of-theart parsers, and evaluate the parsers against this data to measure their level of success in identifying these dependencies.
6 0.78581238 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
7 0.7396906 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models
8 0.73734969 64 emnlp-2011-Harnessing different knowledge sources to measure semantic relatedness under a uniform model
9 0.72134072 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
10 0.69213355 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
11 0.69086909 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
12 0.67574799 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
13 0.67069048 55 emnlp-2011-Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
14 0.63321805 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
15 0.62869787 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
16 0.61782479 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context
17 0.61457998 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
18 0.61325765 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion
19 0.60950094 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
20 0.60413402 107 emnlp-2011-Probabilistic models of similarity in syntactic context