emnlp emnlp2013 emnlp2013-130 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li
Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.
Reference: text
sentIndex sentText sentNum sentScore
1 Entity linking in long text has been well studied in previous works. [sent-2, score-0.302]
2 However few work has focused on short text such as microblog post. [sent-3, score-0.429]
3 Previous method can extract few features from the post context. [sent-5, score-0.459]
4 In this paper we propose to use extra posts for the microblog entity linking task. [sent-6, score-1.417]
5 Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8. [sent-7, score-0.34]
6 Millions of new microblog posts are , generated on such open broadcasting platforms every day 1. [sent-13, score-0.792]
7 A necessary step for the information acquisition on microblog is to identify which entities a post is about. [sent-15, score-0.916]
8 Such identification can be challenging because the entity mention may be ambiguous. [sent-16, score-0.253]
9 cn i} r This post is about an Australia political leader, Tony Abbot, and his opinion on flood tax policy. [sent-25, score-0.554]
10 To understand that this post mentions Tony Abbot is not trivial because the name Abbot can refer to many people and organizations. [sent-26, score-0.566]
11 Wikipedia), entity linking is the task to identify the referent KB entity of a target name mention in plain text. [sent-31, score-0.911]
12 Most current entity linking techniques are designed for long text such as news/blog articles (Mihalcea and Csomai, 2007; Cucerzan, 2007; Milne and Witten, 2008; Han and Sun, 2011; Zhang et al. [sent-32, score-0.513]
13 Entity linking for microblog posts has not been well studied. [sent-37, score-1.076]
14 Comparing with news/blog articles, microblog posts are: short each post contains no more than 140 characters; fresh the new entity-related content may have not been included in the knowledge base; informal acronyms and spoken language writing style are common. [sent-38, score-1.259]
15 Without enough features, previous entity linking methods may fail. [sent-40, score-0.513]
16 The content of post (2) is highly related to post (1). [sent-47, score-0.918]
17 In contrast to the confusing post (1), the text in post (2) explicitly indicates that the Abbott here refers to the Australian political leader. [sent-48, score-0.987]
18 This inspires us to bridge the confusing post and the knowledge base with other posts. [sent-49, score-0.563]
19 In this paper, we approach the microblog entity linking by leveraging extra posts. [sent-50, score-1.106]
20 A straightforward method is to expand the post context with similar posts, which we call Context-Expansion-based Microblog Entity Linking (CEMEL). [sent-51, score-0.494]
21 In this method, we first construct a query with the given post and then search for it in a collection of posts. [sent-52, score-0.514]
22 From the search result, we select the most similar posts for the context expansion. [sent-53, score-0.345]
23 The disambiguation will benefit from the extra posts if, hopefully, they are related to the given post in content and include explicit features for the disambiguation. [sent-54, score-0.958]
24 In contrast to CEMEL, the extra posts in GMEL are not directly added into the context. [sent-56, score-0.475]
25 Instead, they are represented as nodes in a graph, and weighted by their similarity with the target post. [sent-57, score-0.142]
26 We use an iterative algorithm in this graph to propagate the entity weights through the edges between the post nodes. [sent-58, score-0.771]
27 We conduct experiments on real microblog data which we harvested from Twitter. [sent-59, score-0.429]
28 Current entity linking corpus, such as the TAC-KBP data (McNamee and Dang, 2009), mainly focuses on long text. [sent-60, score-0.513]
29 And few microblog entity linking corpus is publicly available. [sent-61, score-0.942]
30 In this work, we manually annotated a microblog entity linking corpus. [sent-62, score-0.942]
31 This corpus inherit the target names from TAC-KBP2009. [sent-63, score-0.134]
32 Experimental results show that the performance of previous methods drops on microblog posts comparing with on long text. [sent-65, score-0.802]
33 Both of CEMEL and GMEL can significantly improve the performance 864 over baselines, which means that entity linking system on microblog can be improved by leveraging extra posts. [sent-66, score-1.106]
34 • • • We propose a context-expansion-based and a graph-based m ae cthonodte xfot-re microblog entity nlindk aing by leveraging extra posts. [sent-69, score-0.804]
35 We annotate a microblog entity linking corpus wWheic ahn iso comparable to an existing long text corpus. [sent-70, score-0.967]
36 We show the inefficiency of previous method on th sheo microblog corpus a onfd our mouesth modet can significantly improve the results. [sent-71, score-0.429]
37 2 Task defination The microblog entity linking task is that, for a name mention in a microblog post, the system is to find the referent entity of the name in a knowledge base, or return a NIL mark if the entity is absence from the knowledge base. [sent-72, score-2.057]
38 This definition is close to the entity linking task in the TAC-KBP evaluation (Ji and Grishman, 2011) except for the context of the target name is microblog post whereas in TAC-KBP the context is news article or web log. [sent-73, score-1.52]
39 Several related tasks have been studied on microblog posts. [sent-74, score-0.429]
40 (2012)’s work, they link a post, rather than a name mention in the post, to relevant Wikipedia concepts. [sent-76, score-0.157]
41 (2013) define entity linking as to first detect all the mentions in a post and then link the mentions to the knowledge base. [sent-79, score-1.111]
42 In contrast, our definition (as well as the TAC-KBP definition) focuses on a concerned name mention across different posts/documents. [sent-80, score-0.114]
43 3 Method A typical entity linking system can be broken down into two steps: candidate generation This step narrows down the candidate entity range from any entity in the world to a limited set. [sent-81, score-1.035]
44 candidate ranking This step ranks the candidates and output the top ranked entity as the result. [sent-82, score-0.261]
45 Each post node is connected to the corresponding candidate nodes from the knowledge base. [sent-91, score-0.643]
46 The edges between the nodes are weighted by the similarity between them. [sent-92, score-0.12]
47 In this paper, we use the candidate generation method described in Guo et al. [sent-93, score-0.05]
48 For the candidate ranking, we use a Vector Space Model (VSM) and a Learning to Rank (LTR) as baselines. [sent-95, score-0.05]
49 The major challenge in microblog entity linking is the lack of context in the post. [sent-98, score-0.942]
50 An ideal solution is to expand the context with the posts which contain the same entity. [sent-99, score-0.38]
51 However, automatically judging whether a name mention in two documents refers to the same entity, namely cross document coreference, is not trivial. [sent-100, score-0.133]
52 Here our solution is to rank the posts by their possibility of co-reference to the target one and select the most possible co-referent posts for the expansion. [sent-101, score-0.737]
53 CEMEL is based on the assumption that, given a name and two posts where the name is mentioned, the higher similarity between the posts the higher possibility of their co-reference and that the coreferent posts may contains useful features for the disambiguation. [sent-102, score-1.201]
54 However, two literally similar posts may not be co-referent. [sent-103, score-0.345]
55 If such non co-referent post is expanded to the context, noises may be included. [sent-104, score-0.486]
56 URL This post is similar to post (1) because they both contains “says” and “URL”. [sent-107, score-0.918]
57 But the Abbott in post (3) refers to the Texas Attorney General Greg Abbott. [sent-108, score-0.478]
58 In this mean, the expanded context in post (3) 865 could mislead the disambiguation for post (1). [sent-109, score-0.969]
59 Such noise can be controlled by setting a strict number of posts to expand the context or weighting the contribution of this post to the target one. [sent-110, score-0.886]
60 Our CEMEL method consists of the following steps: First we construct a query with the terms from the target post. [sent-111, score-0.083]
61 Second we search for the query in a microblog post collection using a common information retrieval model such as the vector space model. [sent-112, score-0.943]
62 Note that here we limit the searched posts must contain the target name mention. [sent-113, score-0.5]
63 Then we expand the target post with top N similar posts and use a typical entity linking method (such as VSM and LTR) with the expanded context. [sent-114, score-1.426]
64 Each node of this graph represents an candidate entity (e. [sent-116, score-0.327]
65 p4) In this graph, each node represents an entity or a post of the given target name. [sent-126, score-0.752]
66 Between each pair of post nodes, each pair of entity nodes and each post node and its candidate entity nodes, there is an edge. [sent-127, score-1.498]
67 Entity nodes are labeled by themselves and candidate nodes are initialized as unlabeled nodes. [sent-129, score-0.196]
68 For the edges between post node pairs and entity node pairs, we use cosine similarity. [sent-130, score-0.765]
69 For the edges between a post node and its candidate entity nodes, we use the score given by traditional entity linking methods. [sent-131, score-1.313]
70 We use an iterative algorithm on this graph to propagate the labels from the entity nodes to the post nodes. [sent-132, score-0.819]
71 1 Data Annotation Till now, few microblog entity linking data is publicly available. [sent-135, score-0.942]
72 In this work, we manually annotate a data set on microblog posts2. [sent-136, score-0.454]
73 6 mil- lion microblog posts in Twitter dated from January 23 to February 8, 2011. [sent-138, score-0.792]
74 In order to compare with existing entity linking on long text, we select a subset of target names from TAC-KBP2009 and inherit the knowledge base in the TAC-KBP evaluation. [sent-139, score-0.723]
75 Figure 2: Percentage of the co-reference posts in the top N similar posts Figure 3: Impact of expansion post number in CEMEL TAC-KBP2009 data set includes 5 13 target names. [sent-141, score-1.227]
76 We search for all the target names in the post collection and get 26,643 matches. [sent-142, score-0.571]
77 We randomly sample 120 posts for each of the top 30 most frequently matched target names and filter out non-English and overly short (i. [sent-143, score-0.438]
78 Then we get 2,258 posts for 25 target names and manually link the target name mentions in the posts to the TAC-KBP knowledge base. [sent-146, score-1.006]
79 In order to evaluate the assumption in CEMEL: similar posts tend to co-reference, we randomly select 10 posts for 5 target names respectively and search for the posts in the post collection. [sent-147, score-1.587]
80 From the search result of each of the 50 posts, we select the top 20 posts and manually annotate if they coreference with the query post. [sent-148, score-0.406]
81 We 866 Figure 4: Accuracy of GMEL with different rate of extra post nodes use Lucene and ListNet with default settings for the VSM and LTR implementation respectively. [sent-154, score-0.662]
82 Given a target name, the GMEL graph includes all the evaluation posts as well as a set of extra post nodes searched from the post collection with the query of the target name. [sent-158, score-1.682]
83 We filter out determiners, interjections, punctuations, emoticons, discourse markers and URLs in the posts with a twitter part-of-speech tagger (Owoputi et al. [sent-159, score-0.367]
84 The similarity between a post and its candidate entities is set with the score given by VSM or LTR and the similarity between other nodes is set with the corresponding cosine similarity. [sent-161, score-0.654]
85 When the N is up to 10, about 60% of the similar posts co-reference with the query post and the decrease speed slows down. [sent-166, score-0.84]
86 Figure 3 shows the impact of the extra post number for the context expansion in CEMEL. [sent-170, score-0.62]
87 com/parthatalukdar/junto Figure 5: Label entropy of GMEL with different rate of extra post nodes Figure 6: Accuracy of the systems by CEMEL. [sent-172, score-0.662]
88 Then more extra posts will pull down the accuracy. [sent-174, score-0.475]
89 The x-axis is the rate of the extra post number over the evaluation post number. [sent-176, score-1.048]
90 We can see that the accuracy of MAD increases with the number of extra post nodes at first and then turns to be stable. [sent-177, score-0.68]
91 The accuracy of LP increases at first and drops when more extra posts are added into the graph. [sent-178, score-0.521]
92 867 Figure 6 shows the performances of the systems on the microblog data. [sent-187, score-0.448]
93 We set the optimal expansion post number of CEMEL and use MAD algorithm for GMEL with all searched extra post nodes. [sent-188, score-1.115]
94 5 Conclusion In this paper we approach microblog entity linking by leveraging extra posts. [sent-199, score-1.106]
95 Experimental results on our data set show that the performance of traditional method drops on the microblog data. [sent-201, score-0.477]
96 In the graph-based method the modified adsorption algorithm performs better than the label propagation algorithm. [sent-203, score-0.059]
97 A generative entitymention model for linking entities with knowledge – base. [sent-234, score-0.356]
98 Overview of the tac 2009 knowledge base population track. [sent-257, score-0.131]
99 Linden: linking named entities with knowledge base via semantic knowledge. [sent-289, score-0.406]
100 Entity linking with effective acronym expansion, instance selection, and topic modeling. [sent-303, score-0.302]
wordName wordTfidf (topN-words)
[('post', 0.459), ('microblog', 0.429), ('posts', 0.345), ('linking', 0.302), ('gmel', 0.223), ('entity', 0.211), ('ltr', 0.21), ('cemel', 0.203), ('vsm', 0.18), ('abbott', 0.135), ('extra', 0.13), ('nodes', 0.073), ('name', 0.072), ('lp', 0.064), ('abbot', 0.061), ('mad', 0.053), ('candidate', 0.05), ('base', 0.05), ('guo', 0.05), ('target', 0.047), ('names', 0.046), ('link', 0.043), ('mention', 0.042), ('adsorption', 0.041), ('flood', 0.041), ('inherit', 0.041), ('meij', 0.041), ('varma', 0.041), ('tony', 0.04), ('talukdar', 0.04), ('query', 0.036), ('searched', 0.036), ('mentions', 0.035), ('yuhang', 0.035), ('hachey', 0.035), ('owoputi', 0.035), ('sheng', 0.035), ('node', 0.035), ('expand', 0.035), ('wikipedia', 0.035), ('says', 0.035), ('tac', 0.035), ('leveraging', 0.034), ('url', 0.033), ('zheng', 0.032), ('mcnamee', 0.032), ('tax', 0.032), ('expansion', 0.031), ('graph', 0.031), ('wei', 0.029), ('confusing', 0.028), ('qin', 0.028), ('entities', 0.028), ('drops', 0.028), ('ny', 0.028), ('expanded', 0.027), ('oregon', 0.026), ('knowledge', 0.026), ('heng', 0.026), ('referent', 0.026), ('milne', 0.026), ('ting', 0.026), ('york', 0.025), ('edges', 0.025), ('kulkarni', 0.025), ('texas', 0.025), ('annotate', 0.025), ('disambiguation', 0.024), ('portland', 0.023), ('propagate', 0.023), ('pages', 0.023), ('political', 0.022), ('similarity', 0.022), ('iterative', 0.022), ('twitter', 0.022), ('posted', 0.022), ('kb', 0.022), ('population', 0.02), ('traditional', 0.02), ('ratinov', 0.02), ('pereira', 0.02), ('baselines', 0.02), ('liu', 0.019), ('mihalcea', 0.019), ('performances', 0.019), ('millions', 0.019), ('cikm', 0.019), ('kumar', 0.019), ('association', 0.019), ('collection', 0.019), ('refers', 0.019), ('han', 0.019), ('georgia', 0.018), ('atlanta', 0.018), ('accuracy', 0.018), ('day', 0.018), ('propagation', 0.018), ('lion', 0.018), ('instant', 0.018), ('meziane', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li
Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.
2 0.32848537 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen
Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.
3 0.18930431 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
Author: Dong Nguyen ; A. Seza Dogruoz
Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.
4 0.16692632 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang
Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.
5 0.15260717 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso
Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.
6 0.13008127 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes
8 0.1231474 160 emnlp-2013-Relational Inference for Wikification
9 0.10766286 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
10 0.10444896 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves
11 0.10343507 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution
12 0.092673153 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
13 0.08157336 24 emnlp-2013-Application of Localized Similarity for Web Documents
14 0.054724328 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
15 0.053789482 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
16 0.050899498 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution
17 0.050846592 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution
18 0.04930836 143 emnlp-2013-Open Domain Targeted Sentiment
19 0.04902257 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
20 0.048885714 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
topicId topicWeight
[(0, -0.165), (1, 0.125), (2, 0.046), (3, -0.076), (4, 0.06), (5, -0.015), (6, 0.077), (7, 0.235), (8, 0.175), (9, -0.078), (10, -0.018), (11, 0.161), (12, 0.037), (13, 0.152), (14, -0.363), (15, -0.065), (16, 0.045), (17, 0.072), (18, -0.335), (19, 0.041), (20, 0.258), (21, -0.066), (22, -0.057), (23, -0.147), (24, -0.03), (25, 0.05), (26, -0.003), (27, 0.053), (28, 0.058), (29, -0.11), (30, 0.081), (31, -0.039), (32, 0.004), (33, -0.005), (34, -0.016), (35, 0.045), (36, -0.03), (37, 0.03), (38, -0.033), (39, 0.004), (40, 0.017), (41, 0.017), (42, -0.031), (43, 0.041), (44, 0.077), (45, 0.01), (46, 0.005), (47, 0.006), (48, 0.074), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.97263992 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li
Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.
2 0.86756873 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen
Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.
3 0.59719318 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
Author: Dong Nguyen ; A. Seza Dogruoz
Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.
4 0.55686164 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes
Author: Ruihong Huang ; Ellen Riloff
Abstract: The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task.
5 0.46246588 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang
Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.
7 0.34160921 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
8 0.33129245 151 emnlp-2013-Paraphrasing 4 Microblog Normalization
9 0.30767366 160 emnlp-2013-Relational Inference for Wikification
10 0.30410528 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
11 0.30195373 24 emnlp-2013-Application of Localized Similarity for Web Documents
12 0.28589588 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves
13 0.26952648 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
14 0.23663679 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution
15 0.2082061 23 emnlp-2013-Animacy Detection with Voting Models
16 0.20748305 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
18 0.1949662 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision
19 0.18535127 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
20 0.17849308 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model
topicId topicWeight
[(3, 0.025), (9, 0.013), (18, 0.027), (22, 0.076), (25, 0.269), (30, 0.106), (36, 0.015), (47, 0.012), (50, 0.013), (51, 0.147), (66, 0.038), (71, 0.033), (75, 0.024), (77, 0.011), (90, 0.011), (96, 0.08)]
simIndex simValue paperId paperTitle
same-paper 1 0.76379865 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts
Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li
Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.
2 0.60937971 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
Author: Qiming Diao ; Jing Jiang
Abstract: With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand, when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors on events are often closely tied to their personal interests. In this paper, we try to model topics, events and users on Twitter in a unified way. We propose a model which combines an LDA-like topic model and the Recurrent Chinese Restaurant Process to capture topics and events. We further propose a duration-based regularization component to find bursty events. We also propose to use event-topic affinity vectors to model the asso- . ciation between events and topics. Our experiments shows that our model can accurately identify meaningful events and the event-topic affinity vectors are effective for event recommendation and grouping events by topics.
3 0.60918272 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou
Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1
Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu
Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).
5 0.59529841 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
Author: Shashank Srivastava ; Dirk Hovy ; Eduard Hovy
Abstract: In this paper, we propose a walk-based graph kernel that generalizes the notion of treekernels to continuous spaces. Our proposed approach subsumes a general framework for word-similarity, and in particular, provides a flexible way to incorporate distributed representations. Using vector representations, such an approach captures both distributional semantic similarities among words as well as the structural relations between them (encoded as the structure of the parse tree). We show an efficient formulation to compute this kernel using simple matrix operations. We present our results on three diverse NLP tasks, showing state-of-the-art results.
6 0.59038198 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes
7 0.5883202 143 emnlp-2013-Open Domain Targeted Sentiment
8 0.58685672 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
9 0.58536822 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
10 0.58274132 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
11 0.58185458 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
12 0.57927144 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
13 0.57906199 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution
14 0.57860816 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
15 0.57856834 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
16 0.57718021 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
17 0.57701606 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
18 0.57682645 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
19 0.57580966 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
20 0.57562131 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction