emnlp emnlp2012 emnlp2012-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 cn , Abstract In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. [sent-3, score-0.997]
2 We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. [sent-5, score-0.552]
3 We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. [sent-7, score-0.848]
4 The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets. [sent-8, score-0.382]
5 1 Introduction Resolving ambiguity associated with person names found on the Web is a key challenge in many Internet applications, such as information retrieval, question answering, open information extraction, automatic knowledge acquisition(Wu and Weld, 2008) and so on. [sent-9, score-0.273]
6 This motivates an intensive study in automatically resolving person name ambiguity in various web applications. [sent-14, score-0.648]
7 However, resolving web person name ambiguity is not a trivial task. [sent-15, score-0.648]
8 The former refers to the case that two web pages may describe the same person but use different words thus the word overlap between them are small. [sent-19, score-0.35]
9 As far as we know, there is less work focusing on exploring person specific information to relieve the lack of clues problem. [sent-32, score-0.343]
10 Beyond bag-of-features, two kinds of features are explored, co-occurrences of entities and Wikipedia based semantic relationship between entities, both of which provide a reasonable relatedness for entity pairs. [sent-35, score-0.368]
11 Han and Zhao try to model both aspects, but their co-occurrence estimation, estimated from held-out resources, fails to capture the person specific importance for a feature, which is crucial to enhance limited clues in a corpus level, e. [sent-39, score-0.39]
12 In this paper, we explore different usages of features and propose an approach which mines cross document information to capture the person specific importance for a feature. [sent-42, score-0.348]
13 By incorporating both the Wikipedia and topic information into 833 our person name similarity, our model exploits both Wikipedia based background knowledge and person specific importance. [sent-44, score-0.984]
14 In the rest of this paper, we first review related work, and in Section 3, show how we exploit the person specific importance in our disambiguation model. [sent-47, score-0.441]
15 2 Related Work Web person name ambiguity resolution can be formally defined as follows: Given a set of web pages {d1, d2 , . [sent-50, score-0.648]
16 , n) dcontains an ambiguous name Nag ewh dich may correspond to several persons holding this name among these pages. [sent-56, score-0.63]
17 The disambiguation system should group these name observations into j cluster {c1, c2, . [sent-57, score-0.504]
18 Those methods pay more attention to extracting informative features and their co-occurrences, but they usually treat the features locally, and ignore the semantic relatedness of features beyond the current document. [sent-72, score-0.315]
19 By employing Wikipedia, the largest online encyclopedia, rich background knowledge about the semantic relatedness between entities can be leveraged to im- prove the disambiguation performance, and relieve the coverage problem, to some extent. [sent-80, score-0.483]
20 Han and Zhao adopt Wikipedia semantic relatedness to compute the similarity between name observations. [sent-82, score-0.661]
21 They also combine multiple knowledge sources and capture explicit semantic relatedness between concepts and implicit semantic relationship embedded in a semantic graph simultaneously(Han and Zhao, 2010). [sent-83, score-0.76]
22 Most approaches discussed above explore various features in the current page or rely on external knowledge resources to bridge the vocabulary gap, but pay less attention to the lack of clues since they ignore the person specific evidence in the current corpus level. [sent-84, score-0.568]
23 Our model focuses on solving the data sparsity problem by utilizing other web pages in the same name observation set to provide a robust but person specific weighting for discriminative features beyond the current document alone. [sent-85, score-0.937]
24 The WS model uses Wikipedia to capture the relation- ship between entities in the local context to bridge the vocabulary gap, but it is incapable to evaluate the importance of a feature with regarding to the target name, hence is unable to make use of limited clues in the current web page. [sent-87, score-0.531]
25 Our method captures person specific evidences by generating topics from 834 all concepts in the current name observation set and weighting a feature accordingly. [sent-88, score-1.258]
26 In this case, discriminative features that are sparse in the current page can be globally weighted so as to provide a more accurate and stable person name similarity. [sent-89, score-0.687]
27 3 The Model Our model consists of three steps: feature extraction, topic generation and name disambiguation. [sent-90, score-0.514]
28 For an ambiguous name, we first extract three types of features and construct a semantic graph from all Wikipedia concepts extracted from the current name observation set. [sent-91, score-0.904]
29 At last, we incorporate the proposed topic representation into the person name similarity functionand adopt the hierarchical agglomerative clustering (HAC) algorithm to group these web pages. [sent-93, score-1.01]
30 1 Feature Extraction We extract features from the contexts of ambiguous names, including Wikipedia concepts, named entities and biographical information, such as email addresses, phone numbers and birth years. [sent-95, score-0.357]
31 Wikipedia Concept Extraction Each concept in Wikipedia is described by an article containing hyperlinks to other concepts which are supposed to related to the current one. [sent-96, score-0.458]
32 We collect Wikipedia concepts from all web pages in the dataset by comparing all n-grams (up to 8) from the dataset to Wikipedia anchor text dictionary and checking whether it is a Wikipedia concept surface form. [sent-98, score-0.526]
33 We further prune the extracted concepts ac- cording to their keyphraseness(Mihalcea and Csomai, 2007). [sent-99, score-0.319]
34 Initially, each concept is weighted according to its average semantic relateness(David and Ian, 2008) with other concepts in the current page. [sent-100, score-0.533]
35 For convenience, we will also call concept features for Wikipedia concept features and non-concept features for the other two in the rest of this paper. [sent-107, score-0.308]
36 2 Topic Generation and Weighting Scheme Now we proceed to describe the key step of our model, topic generation and weighting strategy. [sent-109, score-0.306]
37 The purpose of introducing topics into our model is to exploit the corpus level importance of a feature for a given name so that we will not miss any discriminative features which are few in the current name observation but have shown significant importance over the whole name observation set. [sent-110, score-1.462]
38 Graph Construction In our model, we capture the topic structure through a semantic graph. [sent-111, score-0.291]
39 Specifically, for each name observation set, we connect all Wikipedia concepts appearing in the current observation set by their pairwise semantic relatednessDavid and Ian (2008)to form a semantic graph. [sent-112, score-0.968]
40 The constructed graph is usually very dense since any pair of unrelated concepts would be connected by a small semantic relatedness resulting in many light-weighted or even meaningless edges. [sent-113, score-0.61]
41 The green node Sports League is a hub node, and the yellow node Pro Football Weekly is an outlier. [sent-118, score-0.389]
42 Some general concepts, such as swimming, football, basketball and golf, will be measured highly related with each other by Wikipedia semantic relatedness and thus are very likely to be grouped into one topic, however, they are discriminative on their own when disambiguating different persons. [sent-120, score-0.395]
43 For example, the concept swimming is discriminative enough to distinguish Russian swimmer Popov from basketball player Popov. [sent-121, score-0.285]
44 After the pruning step, for each ambiguous name, we get a semantic graph from all Wikipedia concepts extracted in this name observation set. [sent-126, score-0.933]
45 Graph Clustering Considering the graph construction strategy we use, it is more suitable for us µ to group the concepts on the graph into several topics using a density-based clustering model. [sent-128, score-0.579]
46 If yes, the algorithm will expand a cluster from this vertex recursively, otherwise the vertex will be assigned either a hub node or an outlier depending on the number of its neighboring clusters. [sent-134, score-0.53]
47 A hub node connects to more than one cluster, while an outlier connects to one or no cluster. [sent-135, score-0.467]
48 Take the semantic graph in Figure 1 for example, the node Sports League is a hub node, while the node Pro Football Weekly is an outlier. [sent-136, score-0.549]
49 Finally, all concepts in the graph are grouped into K + 2 parts (K is the number of the clusters, and is determined automatically), including K clusters, the set of hub nodes and the set of outliers. [sent-137, score-0.607]
50 This new similarity function contains two parts: the neighborhood similarity and the semantic relatedness between two concepts. [sent-140, score-0.558]
51 However, we found that hub nodes usually correspond to general concepts which may be related to many topics, but with a loose relatedness. [sent-146, score-0.522]
52 We thus distribute each general concept into its every related topic, but with a lower weight to distinguish from ordinary concepts in this topic. [sent-147, score-0.544]
53 We calculate the average semantic relatedness of an outlier with its neighbor concepts that belong to one topic. [sent-149, score-0.599]
54 We thus distribute each general concept into its every related topic, but with a lower weight to distinguish from ordinary concepts in this topic. [sent-156, score-0.544]
55 Outliers are found to contain concepts which are far away from main topics of the document set and look like noise con- cepts. [sent-157, score-0.358]
56 We therefore calculate the average semantic relatedness of an outlier node with its neighboring concepts which belong to some topics. [sent-158, score-0.661]
57 Weighting Topics After generating all topics, we should weight each topic according to its importance in the current name observation set as well as the quality of the topic (cluster). [sent-160, score-1.027]
58 Intuitively, if most concepts in the topic are considered to be discriminative in the current name set and they are closely related to each other, this topic should be weighted as important. [sent-161, score-1.078]
59 By properly weighting the generated topics, we can capture the importance of a concept reliably in the corpus level (in the current name observation set) rather than in the current page solely. [sent-162, score-0.887]
60 Suppose a hub node h connects to a topic t with n neighbors, namely c1, c2, · · · , cn. [sent-164, score-0.576]
61 The similarity between this hub node and t,h ·e· topic is computed by averaging the semantic relatedness between this hub node and these n neighbors: sim(h,t) =n1Xi=n1sr(h,ci). [sent-165, score-1.233]
62 Now we proceed to weight the topic t by taking into account the frequencies of its concepts and the coherence between the concepts and their neighborhood in topic t: Pn Pn P f(ci) P n coh(ci,t) w(t) =iP=1n iP=1n (3) where topic t contains n concepts {c1, c2, . [sent-167, score-1.581]
63 , cn}, f(c) ies ttohep frequency nofs concept c over current name observation set, specially, when c is a hub node con- cept, we will distribute its frequency according to equation (2), having ft(c) = f(c)sim(c, t). [sent-170, score-0.967]
64 And n coh(c, t) is the neighborhood coherence of concept c with topic t, defined as: P sr(q,c) n coh(c,t) =q∈NP|N(c)(∩ct) ∩ t| (4) where N(c) is the neighboring node set of concept c. [sent-171, score-0.686]
65 By incorporating corpus level concept frequencies into topic weighting, discriminative concepts that are sparse in one document and suppressed by conventional models can benefit from their corpus level importance as well as their coherence in related topics. [sent-172, score-0.751]
66 3 Clustering Person Name Observations Now the remaining key step is to compute the similarity between two name observations. [sent-174, score-0.393]
67 However, this similarity bears a shortcoming that the bridge tags shared by the two documents require an exact match of features, which does not take any semantic relatedness into consideration. [sent-180, score-0.464]
68 If two web pages mentioning the same person but have few features in common, the GRAPE similarity may not work properly. [sent-181, score-0.445]
69 We, therefore, propose a new similarity measure combining topic similarity, topic based connectivity strength and GRAPE’s connectivity strength. [sent-182, score-0.629]
70 Matching Topics to Person Name Observations We first describe how to match the generated topics to different name observations. [sent-183, score-0.399]
71 In order to avoid unreliable estimation, we only match a topic to a name observation when they share at least one concept. [sent-184, score-0.622]
72 The underlying idea of the equation is, if two name observations × share more and closer common topics, and also these topics receive higher weights according to the current name observation set, then the two observations should be more related to each other. [sent-187, score-0.93]
73 We consider common topics as the bridge tags and define our topic based connectivity strength between two name observations as: TCS(o1,o2) = 21 X sim(o1 ∩ t,o2 ∩ t) t∈TX(o1,o2) omit the details for brevity). [sent-193, score-0.806]
74 Finally, we linearly combine equation (6), (7) and CS(o1 , o2) into the ×× person name similarity function as: S(o1, o2)= α1 TSm(o1, o2) +(1 − α1 − α2) + α2 TCS(o1, o2) CS(o1 , o2) (9) where α1 and α2 are optimized during training. [sent-194, score-0.628]
75 This final similarity function will then be embedded into a normal HAC algorithm to group the web pages into different namesakes where we compute the centroid-based distance between clusters(Mann and Yarowsky, 2003). [sent-195, score-0.286]
76 We identified over 4,000,000 highly connected concepts in this dump; each concept links (Cohs(o1,t) + Cohs(o2,t)) (7)to 10 other concepts in average. [sent-201, score-0.668]
77 Cohs(o, t) tius a cohesion measure to capture the relatedness between non-concept features in o and concept features in t, defined as: Cohs(o,t) = X w(t)× X cX∈o∩t occ(c,q)fo(c)fo(q) q∈EXB(o) (8) where EB(o) contains all non-concept features in o (e. [sent-203, score-0.401]
78 , non-Wikipedia entities and biographical information), occ(c, q) is the co-occurring number of concept c and feature q, fo(q) is the relative frequency of q in observation o. [sent-205, score-0.432]
79 It is easy to find that a higher cohesion can be achieved by larger overlap between o and t, higher topic weight and more cooccurrences of concept features in t and other features in o. [sent-206, score-0.486]
80 This system uses Wikipedia to enhance the results of name disambiguation. [sent-222, score-0.298]
81 (4)SSR: the Structural Semantic relatedness model(Han and Zhao, 2010) creates a semantic graph to re-calculate the semantic related- ness between features, and captures both explicit semantic relations and implicit structural semantic knowledge. [sent-223, score-0.578]
82 The semantic graph pruning threshold is set to 0. [sent-231, score-0.278]
83 By introducing the corpus level topic weighting scheme, our model improves in average 1. [sent-243, score-0.306]
84 Recall that our topic weightings are obtained over the whole name observation set beyond local context, this improvement indicates that this corpus level person specific evidences render the person similarity more reasonably than that of single document. [sent-245, score-1.342]
85 This 839 Table 1: Web person name disambiguation results on all three WePS datasets shows that our co-occurrence based pruning strategy can help render the semantic graph with less noisy edges, thus generate more reasonable topics. [sent-248, score-0.928]
86 We notice there are many noisy or short web pages which lead to inaccurate concept extraction, but this cross document evidences, to some extent, can remedy this. [sent-256, score-0.269]
87 We think the reason may be that some name observation sets are too small to estimate non-concept relatedness via random walk. [sent-259, score-0.599]
88 The former does not use Wikipedia relatedness but only includes local relationship, and performs even slightly better than WS in WePS2, which indicates that non-Wikipedia concepts are important disambiguation features as well. [sent-263, score-0.576]
89 The first one is the edge pruning threshold during graph construction; the second one is the weight α in SCAN algorithm; the third one and the forth one are the combination parameters in the final similarity function. [sent-267, score-0.36]
90 The larger this weight is, the more neighborhood information can influence the similarity between two nodes in the semantic graph. [sent-277, score-0.332]
91 01, which indicates that neighborhood similarity is as considerable as semantic relatedness. [sent-289, score-0.27]
92 Our current work utilizes the topic information shared in one name observation set but is incapable to handle sparse name set, which needs more accurate relation extraction inside the name observations. [sent-293, score-1.322]
93 Jointly modeling entity linking and person (entity) disambiguation tasks will be an interesting direction where the two tasks are closely related and usually need to be considered at the same time. [sent-294, score-0.405]
94 Investigating the person name disambiguation task in different web applications will also be of great importance, e. [sent-295, score-0.774]
95 , disambiguating a name in streaming data or during knowledge base construction. [sent-297, score-0.332]
96 The semeval-2007 weps evaluation: establishing a benchmark for the web people search task. [sent-308, score-0.346]
97 An effective, low-cost measure of semantic relatedness obtained from wikipedia links. [sent-332, score-0.476]
98 Person name disambiguation on the web by two-stage clustering. [sent-360, score-0.539]
99 Weakly supervised learning for cross-document person name disambiguation supported by information extraction. [sent-423, score-0.659]
100 Jhu1 : an unsupervised approach to person name disambiguation using web snippets. [sent-434, score-0.774]
wordName wordTfidf (topN-words)
[('name', 0.298), ('hub', 0.265), ('concepts', 0.257), ('person', 0.235), ('topic', 0.216), ('wikipedia', 0.208), ('relatedness', 0.193), ('weps', 0.189), ('grape', 0.163), ('sim', 0.156), ('concept', 0.154), ('disambiguation', 0.126), ('evidences', 0.122), ('han', 0.121), ('web', 0.115), ('biographical', 0.114), ('kalashnikov', 0.114), ('observation', 0.108), ('scan', 0.103), ('topics', 0.101), ('bridge', 0.101), ('neighborhood', 0.1), ('similarity', 0.095), ('weighting', 0.09), ('graph', 0.085), ('zhao', 0.082), ('hac', 0.081), ('importance', 0.08), ('jiang', 0.079), ('pruning', 0.076), ('cohs', 0.076), ('ikeda', 0.076), ('namesakes', 0.076), ('nutritionist', 0.076), ('semantic', 0.075), ('clues', 0.075), ('outlier', 0.074), ('semeval', 0.066), ('bender', 0.065), ('page', 0.063), ('weight', 0.062), ('node', 0.062), ('prune', 0.062), ('ws', 0.061), ('vsm', 0.059), ('yarowsky', 0.058), ('coh', 0.057), ('incapable', 0.057), ('nutrition', 0.057), ('entities', 0.056), ('cohesion', 0.054), ('stroudsburg', 0.052), ('connectivity', 0.051), ('ba', 0.051), ('clustering', 0.051), ('named', 0.049), ('ssr', 0.049), ('artiles', 0.049), ('basketball', 0.049), ('current', 0.047), ('mann', 0.045), ('javier', 0.044), ('football', 0.044), ('emily', 0.044), ('vertex', 0.044), ('discriminative', 0.044), ('entity', 0.044), ('people', 0.042), ('threshold', 0.042), ('extra', 0.041), ('cluster', 0.041), ('ian', 0.04), ('observations', 0.039), ('email', 0.038), ('ordinary', 0.038), ('abridged', 0.038), ('configure', 0.038), ('iria', 0.038), ('mehrotra', 0.038), ('pilz', 0.038), ('simnb', 0.038), ('swimming', 0.038), ('tsm', 0.038), ('yiming', 0.038), ('names', 0.038), ('tuned', 0.035), ('brevity', 0.034), ('disambiguating', 0.034), ('ambiguous', 0.034), ('distribute', 0.033), ('connects', 0.033), ('phone', 0.033), ('sr', 0.033), ('render', 0.033), ('tcs', 0.033), ('purity', 0.033), ('relieve', 0.033), ('usages', 0.033), ('birth', 0.033), ('occ', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
2 0.1846831 19 emnlp-2012-An Entity-Topic Model for Entity Linking
Author: Xianpei Han ; Le Sun
Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1
3 0.13682361 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler
Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.
4 0.13253012 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
5 0.12248832 84 emnlp-2012-Linking Named Entities to Any Database
Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates
Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.
6 0.12199519 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
7 0.12177428 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
8 0.10689532 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model
9 0.10572877 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing
10 0.10518527 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge
11 0.1035409 97 emnlp-2012-Natural Language Questions for the Web of Data
12 0.1029784 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model
13 0.10140036 41 emnlp-2012-Entity based QA Retrieval
14 0.091956809 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
15 0.089141794 85 emnlp-2012-Local and Global Context for Supervised and Unsupervised Metonymy Resolution
16 0.083991341 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques
17 0.071086951 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
18 0.067656912 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
19 0.066800892 50 emnlp-2012-Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
20 0.065904185 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
topicId topicWeight
[(0, 0.252), (1, 0.177), (2, 0.015), (3, 0.044), (4, -0.22), (5, 0.02), (6, 0.055), (7, 0.141), (8, -0.12), (9, -0.094), (10, 0.033), (11, -0.134), (12, 0.107), (13, 0.059), (14, 0.166), (15, -0.059), (16, 0.227), (17, -0.025), (18, -0.001), (19, 0.143), (20, -0.043), (21, 0.074), (22, 0.061), (23, -0.052), (24, 0.004), (25, 0.042), (26, -0.083), (27, 0.067), (28, 0.018), (29, 0.056), (30, -0.029), (31, -0.142), (32, -0.015), (33, -0.06), (34, 0.122), (35, 0.004), (36, 0.049), (37, -0.029), (38, 0.006), (39, 0.007), (40, -0.102), (41, 0.084), (42, -0.063), (43, 0.064), (44, -0.012), (45, -0.048), (46, 0.101), (47, 0.004), (48, -0.054), (49, -0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.97746795 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
2 0.6122874 19 emnlp-2012-An Entity-Topic Model for Entity Linking
Author: Xianpei Han ; Le Sun
Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1
3 0.5885821 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing
Author: Hui Yang
Abstract: Taxonomies can serve as browsing tools for document collections. However, given an arbitrary collection, pre-constructed taxonomies could not easily adapt to the specific topic/task present in the collection. This paper explores techniques to quickly derive task-specific taxonomies supporting browsing in arbitrary document collections. The supervised approach directly learns semantic distances from users to propose meaningful task-specific taxonomies. The approach aims to produce globally optimized taxonomy structures by incorporating path consistency control and usergenerated task specification into the general learning framework. A comparison to stateof-the-art systems and a user study jointly demonstrate that our techniques are highly effective. .
4 0.55434567 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze
Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
5 0.51866388 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler
Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.
6 0.49942151 85 emnlp-2012-Local and Global Context for Supervised and Unsupervised Metonymy Resolution
7 0.49167946 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model
8 0.48394623 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings
9 0.48353973 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
10 0.47664511 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
11 0.45156363 41 emnlp-2012-Entity based QA Retrieval
12 0.44796285 84 emnlp-2012-Linking Named Entities to Any Database
13 0.44060439 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge
14 0.42774788 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model
15 0.40046829 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
16 0.36607847 97 emnlp-2012-Natural Language Questions for the Web of Data
17 0.32913324 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
18 0.32786301 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections
19 0.32336956 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
20 0.3132312 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types
topicId topicWeight
[(2, 0.016), (12, 0.293), (16, 0.034), (25, 0.017), (34, 0.094), (45, 0.012), (60, 0.127), (63, 0.059), (64, 0.023), (65, 0.041), (70, 0.023), (73, 0.019), (74, 0.036), (76, 0.048), (80, 0.018), (86, 0.023), (95, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.78468603 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao
Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.
2 0.60165429 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
Author: Heeyoung Lee ; Marta Recasens ; Angel Chang ; Mihai Surdeanu ; Dan Jurafsky
Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.
3 0.56774253 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing
Author: Hui Yang
Abstract: Taxonomies can serve as browsing tools for document collections. However, given an arbitrary collection, pre-constructed taxonomies could not easily adapt to the specific topic/task present in the collection. This paper explores techniques to quickly derive task-specific taxonomies supporting browsing in arbitrary document collections. The supervised approach directly learns semantic distances from users to propose meaningful task-specific taxonomies. The approach aims to produce globally optimized taxonomy structures by incorporating path consistency control and usergenerated task specification into the general learning framework. A comparison to stateof-the-art systems and a user study jointly demonstrate that our techniques are highly effective. .
4 0.53444558 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
Author: Jayant Krishnamurthy ; Tom Mitchell
Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.
5 0.53429532 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
Author: Mahesh Joshi ; Mark Dredze ; William W. Cohen ; Carolyn Rose
Abstract: We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multidomain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced class label setting, although in practice many multidomain settings have domain-specific class label biases. When multi-domain learning is applied to these settings, (2) are multidomain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art.
6 0.53245151 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
7 0.53186768 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
8 0.52868003 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
9 0.52856112 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
10 0.52752024 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
11 0.52677333 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
12 0.52638608 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
13 0.5263443 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
14 0.52596796 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
15 0.52498317 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
16 0.52493316 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
17 0.52464026 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
18 0.52356791 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
19 0.52307916 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
20 0.5228405 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation