emnlp emnlp2013 emnlp2013-130 knowledge-graph by maker-knowledge-mining

130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts


Source: pdf

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Entity linking in long text has been well studied in previous works. [sent-2, score-0.302]

2 However few work has focused on short text such as microblog post. [sent-3, score-0.429]

3 Previous method can extract few features from the post context. [sent-5, score-0.459]

4 In this paper we propose to use extra posts for the microblog entity linking task. [sent-6, score-1.417]

5 Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8. [sent-7, score-0.34]

6 Millions of new microblog posts are , generated on such open broadcasting platforms every day 1. [sent-13, score-0.792]

7 A necessary step for the information acquisition on microblog is to identify which entities a post is about. [sent-15, score-0.916]

8 Such identification can be challenging because the entity mention may be ambiguous. [sent-16, score-0.253]

9 cn i} r This post is about an Australia political leader, Tony Abbot, and his opinion on flood tax policy. [sent-25, score-0.554]

10 To understand that this post mentions Tony Abbot is not trivial because the name Abbot can refer to many people and organizations. [sent-26, score-0.566]

11 Wikipedia), entity linking is the task to identify the referent KB entity of a target name mention in plain text. [sent-31, score-0.911]

12 Most current entity linking techniques are designed for long text such as news/blog articles (Mihalcea and Csomai, 2007; Cucerzan, 2007; Milne and Witten, 2008; Han and Sun, 2011; Zhang et al. [sent-32, score-0.513]

13 Entity linking for microblog posts has not been well studied. [sent-37, score-1.076]

14 Comparing with news/blog articles, microblog posts are: short each post contains no more than 140 characters; fresh the new entity-related content may have not been included in the knowledge base; informal acronyms and spoken language writing style are common. [sent-38, score-1.259]

15 Without enough features, previous entity linking methods may fail. [sent-40, score-0.513]

16 The content of post (2) is highly related to post (1). [sent-47, score-0.918]

17 In contrast to the confusing post (1), the text in post (2) explicitly indicates that the Abbott here refers to the Australian political leader. [sent-48, score-0.987]

18 This inspires us to bridge the confusing post and the knowledge base with other posts. [sent-49, score-0.563]

19 In this paper, we approach the microblog entity linking by leveraging extra posts. [sent-50, score-1.106]

20 A straightforward method is to expand the post context with similar posts, which we call Context-Expansion-based Microblog Entity Linking (CEMEL). [sent-51, score-0.494]

21 In this method, we first construct a query with the given post and then search for it in a collection of posts. [sent-52, score-0.514]

22 From the search result, we select the most similar posts for the context expansion. [sent-53, score-0.345]

23 The disambiguation will benefit from the extra posts if, hopefully, they are related to the given post in content and include explicit features for the disambiguation. [sent-54, score-0.958]

24 In contrast to CEMEL, the extra posts in GMEL are not directly added into the context. [sent-56, score-0.475]

25 Instead, they are represented as nodes in a graph, and weighted by their similarity with the target post. [sent-57, score-0.142]

26 We use an iterative algorithm in this graph to propagate the entity weights through the edges between the post nodes. [sent-58, score-0.771]

27 We conduct experiments on real microblog data which we harvested from Twitter. [sent-59, score-0.429]

28 Current entity linking corpus, such as the TAC-KBP data (McNamee and Dang, 2009), mainly focuses on long text. [sent-60, score-0.513]

29 And few microblog entity linking corpus is publicly available. [sent-61, score-0.942]

30 In this work, we manually annotated a microblog entity linking corpus. [sent-62, score-0.942]

31 This corpus inherit the target names from TAC-KBP2009. [sent-63, score-0.134]

32 Experimental results show that the performance of previous methods drops on microblog posts comparing with on long text. [sent-65, score-0.802]

33 Both of CEMEL and GMEL can significantly improve the performance 864 over baselines, which means that entity linking system on microblog can be improved by leveraging extra posts. [sent-66, score-1.106]

34 • • • We propose a context-expansion-based and a graph-based m ae cthonodte xfot-re microblog entity nlindk aing by leveraging extra posts. [sent-69, score-0.804]

35 We annotate a microblog entity linking corpus wWheic ahn iso comparable to an existing long text corpus. [sent-70, score-0.967]

36 We show the inefficiency of previous method on th sheo microblog corpus a onfd our mouesth modet can significantly improve the results. [sent-71, score-0.429]

37 2 Task defination The microblog entity linking task is that, for a name mention in a microblog post, the system is to find the referent entity of the name in a knowledge base, or return a NIL mark if the entity is absence from the knowledge base. [sent-72, score-2.057]

38 This definition is close to the entity linking task in the TAC-KBP evaluation (Ji and Grishman, 2011) except for the context of the target name is microblog post whereas in TAC-KBP the context is news article or web log. [sent-73, score-1.52]

39 Several related tasks have been studied on microblog posts. [sent-74, score-0.429]

40 (2012)’s work, they link a post, rather than a name mention in the post, to relevant Wikipedia concepts. [sent-76, score-0.157]

41 (2013) define entity linking as to first detect all the mentions in a post and then link the mentions to the knowledge base. [sent-79, score-1.111]

42 In contrast, our definition (as well as the TAC-KBP definition) focuses on a concerned name mention across different posts/documents. [sent-80, score-0.114]

43 3 Method A typical entity linking system can be broken down into two steps: candidate generation This step narrows down the candidate entity range from any entity in the world to a limited set. [sent-81, score-1.035]

44 candidate ranking This step ranks the candidates and output the top ranked entity as the result. [sent-82, score-0.261]

45 Each post node is connected to the corresponding candidate nodes from the knowledge base. [sent-91, score-0.643]

46 The edges between the nodes are weighted by the similarity between them. [sent-92, score-0.12]

47 In this paper, we use the candidate generation method described in Guo et al. [sent-93, score-0.05]

48 For the candidate ranking, we use a Vector Space Model (VSM) and a Learning to Rank (LTR) as baselines. [sent-95, score-0.05]

49 The major challenge in microblog entity linking is the lack of context in the post. [sent-98, score-0.942]

50 An ideal solution is to expand the context with the posts which contain the same entity. [sent-99, score-0.38]

51 However, automatically judging whether a name mention in two documents refers to the same entity, namely cross document coreference, is not trivial. [sent-100, score-0.133]

52 Here our solution is to rank the posts by their possibility of co-reference to the target one and select the most possible co-referent posts for the expansion. [sent-101, score-0.737]

53 CEMEL is based on the assumption that, given a name and two posts where the name is mentioned, the higher similarity between the posts the higher possibility of their co-reference and that the coreferent posts may contains useful features for the disambiguation. [sent-102, score-1.201]

54 However, two literally similar posts may not be co-referent. [sent-103, score-0.345]

55 If such non co-referent post is expanded to the context, noises may be included. [sent-104, score-0.486]

56 URL This post is similar to post (1) because they both contains “says” and “URL”. [sent-107, score-0.918]

57 But the Abbott in post (3) refers to the Texas Attorney General Greg Abbott. [sent-108, score-0.478]

58 In this mean, the expanded context in post (3) 865 could mislead the disambiguation for post (1). [sent-109, score-0.969]

59 Such noise can be controlled by setting a strict number of posts to expand the context or weighting the contribution of this post to the target one. [sent-110, score-0.886]

60 Our CEMEL method consists of the following steps: First we construct a query with the terms from the target post. [sent-111, score-0.083]

61 Second we search for the query in a microblog post collection using a common information retrieval model such as the vector space model. [sent-112, score-0.943]

62 Note that here we limit the searched posts must contain the target name mention. [sent-113, score-0.5]

63 Then we expand the target post with top N similar posts and use a typical entity linking method (such as VSM and LTR) with the expanded context. [sent-114, score-1.426]

64 Each node of this graph represents an candidate entity (e. [sent-116, score-0.327]

65 p4) In this graph, each node represents an entity or a post of the given target name. [sent-126, score-0.752]

66 Between each pair of post nodes, each pair of entity nodes and each post node and its candidate entity nodes, there is an edge. [sent-127, score-1.498]

67 Entity nodes are labeled by themselves and candidate nodes are initialized as unlabeled nodes. [sent-129, score-0.196]

68 For the edges between post node pairs and entity node pairs, we use cosine similarity. [sent-130, score-0.765]

69 For the edges between a post node and its candidate entity nodes, we use the score given by traditional entity linking methods. [sent-131, score-1.313]

70 We use an iterative algorithm on this graph to propagate the labels from the entity nodes to the post nodes. [sent-132, score-0.819]

71 1 Data Annotation Till now, few microblog entity linking data is publicly available. [sent-135, score-0.942]

72 In this work, we manually annotate a data set on microblog posts2. [sent-136, score-0.454]

73 6 mil- lion microblog posts in Twitter dated from January 23 to February 8, 2011. [sent-138, score-0.792]

74 In order to compare with existing entity linking on long text, we select a subset of target names from TAC-KBP2009 and inherit the knowledge base in the TAC-KBP evaluation. [sent-139, score-0.723]

75 Figure 2: Percentage of the co-reference posts in the top N similar posts Figure 3: Impact of expansion post number in CEMEL TAC-KBP2009 data set includes 5 13 target names. [sent-141, score-1.227]

76 We search for all the target names in the post collection and get 26,643 matches. [sent-142, score-0.571]

77 We randomly sample 120 posts for each of the top 30 most frequently matched target names and filter out non-English and overly short (i. [sent-143, score-0.438]

78 Then we get 2,258 posts for 25 target names and manually link the target name mentions in the posts to the TAC-KBP knowledge base. [sent-146, score-1.006]

79 In order to evaluate the assumption in CEMEL: similar posts tend to co-reference, we randomly select 10 posts for 5 target names respectively and search for the posts in the post collection. [sent-147, score-1.587]

80 From the search result of each of the 50 posts, we select the top 20 posts and manually annotate if they coreference with the query post. [sent-148, score-0.406]

81 We 866 Figure 4: Accuracy of GMEL with different rate of extra post nodes use Lucene and ListNet with default settings for the VSM and LTR implementation respectively. [sent-154, score-0.662]

82 Given a target name, the GMEL graph includes all the evaluation posts as well as a set of extra post nodes searched from the post collection with the query of the target name. [sent-158, score-1.682]

83 We filter out determiners, interjections, punctuations, emoticons, discourse markers and URLs in the posts with a twitter part-of-speech tagger (Owoputi et al. [sent-159, score-0.367]

84 The similarity between a post and its candidate entities is set with the score given by VSM or LTR and the similarity between other nodes is set with the corresponding cosine similarity. [sent-161, score-0.654]

85 When the N is up to 10, about 60% of the similar posts co-reference with the query post and the decrease speed slows down. [sent-166, score-0.84]

86 Figure 3 shows the impact of the extra post number for the context expansion in CEMEL. [sent-170, score-0.62]

87 com/parthatalukdar/junto Figure 5: Label entropy of GMEL with different rate of extra post nodes Figure 6: Accuracy of the systems by CEMEL. [sent-172, score-0.662]

88 Then more extra posts will pull down the accuracy. [sent-174, score-0.475]

89 The x-axis is the rate of the extra post number over the evaluation post number. [sent-176, score-1.048]

90 We can see that the accuracy of MAD increases with the number of extra post nodes at first and then turns to be stable. [sent-177, score-0.68]

91 The accuracy of LP increases at first and drops when more extra posts are added into the graph. [sent-178, score-0.521]

92 867 Figure 6 shows the performances of the systems on the microblog data. [sent-187, score-0.448]

93 We set the optimal expansion post number of CEMEL and use MAD algorithm for GMEL with all searched extra post nodes. [sent-188, score-1.115]

94 5 Conclusion In this paper we approach microblog entity linking by leveraging extra posts. [sent-199, score-1.106]

95 Experimental results on our data set show that the performance of traditional method drops on the microblog data. [sent-201, score-0.477]

96 In the graph-based method the modified adsorption algorithm performs better than the label propagation algorithm. [sent-203, score-0.059]

97 A generative entitymention model for linking entities with knowledge – base. [sent-234, score-0.356]

98 Overview of the tac 2009 knowledge base population track. [sent-257, score-0.131]

99 Linden: linking named entities with knowledge base via semantic knowledge. [sent-289, score-0.406]

100 Entity linking with effective acronym expansion, instance selection, and topic modeling. [sent-303, score-0.302]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('post', 0.459), ('microblog', 0.429), ('posts', 0.345), ('linking', 0.302), ('gmel', 0.223), ('entity', 0.211), ('ltr', 0.21), ('cemel', 0.203), ('vsm', 0.18), ('abbott', 0.135), ('extra', 0.13), ('nodes', 0.073), ('name', 0.072), ('lp', 0.064), ('abbot', 0.061), ('mad', 0.053), ('candidate', 0.05), ('base', 0.05), ('guo', 0.05), ('target', 0.047), ('names', 0.046), ('link', 0.043), ('mention', 0.042), ('adsorption', 0.041), ('flood', 0.041), ('inherit', 0.041), ('meij', 0.041), ('varma', 0.041), ('tony', 0.04), ('talukdar', 0.04), ('query', 0.036), ('searched', 0.036), ('mentions', 0.035), ('yuhang', 0.035), ('hachey', 0.035), ('owoputi', 0.035), ('sheng', 0.035), ('node', 0.035), ('expand', 0.035), ('wikipedia', 0.035), ('says', 0.035), ('tac', 0.035), ('leveraging', 0.034), ('url', 0.033), ('zheng', 0.032), ('mcnamee', 0.032), ('tax', 0.032), ('expansion', 0.031), ('graph', 0.031), ('wei', 0.029), ('confusing', 0.028), ('qin', 0.028), ('entities', 0.028), ('drops', 0.028), ('ny', 0.028), ('expanded', 0.027), ('oregon', 0.026), ('knowledge', 0.026), ('heng', 0.026), ('referent', 0.026), ('milne', 0.026), ('ting', 0.026), ('york', 0.025), ('edges', 0.025), ('kulkarni', 0.025), ('texas', 0.025), ('annotate', 0.025), ('disambiguation', 0.024), ('portland', 0.023), ('propagate', 0.023), ('pages', 0.023), ('political', 0.022), ('similarity', 0.022), ('iterative', 0.022), ('twitter', 0.022), ('posted', 0.022), ('kb', 0.022), ('population', 0.02), ('traditional', 0.02), ('ratinov', 0.02), ('pereira', 0.02), ('baselines', 0.02), ('liu', 0.019), ('mihalcea', 0.019), ('performances', 0.019), ('millions', 0.019), ('cikm', 0.019), ('kumar', 0.019), ('association', 0.019), ('collection', 0.019), ('refers', 0.019), ('han', 0.019), ('georgia', 0.018), ('atlanta', 0.018), ('accuracy', 0.018), ('day', 0.018), ('propagation', 0.018), ('lion', 0.018), ('instant', 0.018), ('meziane', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

2 0.32848537 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

3 0.18930431 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

4 0.16692632 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang

Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.

5 0.15260717 151 emnlp-2013-Paraphrasing 4 Microblog Normalization

Author: Wang Ling ; Chris Dyer ; Alan W Black ; Isabel Trancoso

Abstract: Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization replacing orthographically or lexically idiosyncratic forms with more standard variants can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained — — on parallel microblog data.

6 0.13008127 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

7 0.12668136 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

8 0.1231474 160 emnlp-2013-Relational Inference for Wikification

9 0.10766286 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

10 0.10444896 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

11 0.10343507 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

12 0.092673153 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

13 0.08157336 24 emnlp-2013-Application of Localized Similarity for Web Documents

14 0.054724328 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

15 0.053789482 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

16 0.050899498 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution

17 0.050846592 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

18 0.04930836 143 emnlp-2013-Open Domain Targeted Sentiment

19 0.04902257 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

20 0.048885714 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, 0.125), (2, 0.046), (3, -0.076), (4, 0.06), (5, -0.015), (6, 0.077), (7, 0.235), (8, 0.175), (9, -0.078), (10, -0.018), (11, 0.161), (12, 0.037), (13, 0.152), (14, -0.363), (15, -0.065), (16, 0.045), (17, 0.072), (18, -0.335), (19, 0.041), (20, 0.258), (21, -0.066), (22, -0.057), (23, -0.147), (24, -0.03), (25, 0.05), (26, -0.003), (27, 0.053), (28, 0.058), (29, -0.11), (30, 0.081), (31, -0.039), (32, 0.004), (33, -0.005), (34, -0.016), (35, 0.045), (36, -0.03), (37, 0.03), (38, -0.033), (39, 0.004), (40, 0.017), (41, 0.017), (42, -0.031), (43, 0.041), (44, 0.077), (45, 0.01), (46, 0.005), (47, 0.006), (48, 0.074), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97263992 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

2 0.86756873 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

3 0.59719318 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

4 0.55686164 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task.

5 0.46246588 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang

Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.

6 0.40517464 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

7 0.34160921 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

8 0.33129245 151 emnlp-2013-Paraphrasing 4 Microblog Normalization

9 0.30767366 160 emnlp-2013-Relational Inference for Wikification

10 0.30410528 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

11 0.30195373 24 emnlp-2013-Application of Localized Similarity for Web Documents

12 0.28589588 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

13 0.26952648 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

14 0.23663679 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

15 0.2082061 23 emnlp-2013-Animacy Detection with Voting Models

16 0.20748305 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

17 0.20410384 131 emnlp-2013-Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs

18 0.1949662 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

19 0.18535127 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

20 0.17849308 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.025), (9, 0.013), (18, 0.027), (22, 0.076), (25, 0.269), (30, 0.106), (36, 0.015), (47, 0.012), (50, 0.013), (51, 0.147), (66, 0.038), (71, 0.033), (75, 0.024), (77, 0.011), (90, 0.011), (96, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76379865 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

2 0.60937971 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

Author: Qiming Diao ; Jing Jiang

Abstract: With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand, when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors on events are often closely tied to their personal interests. In this paper, we try to model topics, events and users on Twitter in a unified way. We propose a model which combines an LDA-like topic model and the Recurrent Chinese Restaurant Process to capture topics and events. We further propose a duration-based regularization component to find bursty events. We also propose to use event-topic affinity vectors to model the asso- . ciation between events and topics. Our experiments shows that our model can accurately identify meaningful events and the event-topic affinity vectors are effective for event recommendation and grouping events by topics.

3 0.60918272 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

4 0.59690195 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu

Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).

5 0.59529841 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

Author: Shashank Srivastava ; Dirk Hovy ; Eduard Hovy

Abstract: In this paper, we propose a walk-based graph kernel that generalizes the notion of treekernels to continuous spaces. Our proposed approach subsumes a general framework for word-similarity, and in particular, provides a flexible way to incorporate distributed representations. Using vector representations, such an approach captures both distributional semantic similarities among words as well as the structural relations between them (encoded as the structure of the parse tree). We show an efficient formulation to compute this kernel using simple matrix operations. We present our results on three diverse NLP tasks, showing state-of-the-art results.

6 0.59038198 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes

7 0.5883202 143 emnlp-2013-Open Domain Targeted Sentiment

8 0.58685672 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

9 0.58536822 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

10 0.58274132 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

11 0.58185458 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

12 0.57927144 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

13 0.57906199 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

14 0.57860816 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

15 0.57856834 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

16 0.57718021 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

17 0.57701606 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

18 0.57682645 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

19 0.57580966 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

20 0.57562131 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction