emnlp emnlp2013 emnlp2013-79 knowledge-graph by maker-knowledge-mining

79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery


Source: pdf

Author: Ruiji Fu ; Bing Qin ; Ting Liu

Abstract: Hypernym discovery aims to extract such noun pairs that one noun is a hypernym of the other. Most previous methods are based on lexical patterns but perform badly on opendomain data. Other work extracts hypernym relations from encyclopedias but has limited coverage. This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. Given an entity name, we try to discover its hypernyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclopedias, and morphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we apply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the rank- ing model. We also present a heuristic strategy to build a large-scale noisy training data for the model without human annotation. Experimental results demonstrate that our approach outperforms the state-of-the-art methods on a manually labeled test dataset.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Other work extracts hypernym relations from encyclopedias but has limited coverage. [sent-3, score-0.836]

2 This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. [sent-4, score-0.674]

3 Given an entity name, we try to discover its hypernyms by leveraging knowledge from multiple sources, i. [sent-5, score-0.65]

4 , search engine results, encyclopedias, and morphology of the entity name. [sent-7, score-0.225]

5 First, we extract candidate hypernyms from the above sources. [sent-8, score-0.647]

6 1 Introduction Hypernym discovery is a task to extract such noun pairs that one noun is a hypernym of the other (Snow et al. [sent-13, score-0.821]

7 A noun H is a hypernym of another noun E if E is an instance or subclass of H. [sent-15, score-0.764]

8 For instance, “actor” is a hypernym of “Mel Gibson”; “dog” is a hypernym of “Caucasian sheepdog”; “medicine” is a hypernym of “Aspirin”. [sent-17, score-2.022]

9 Most previous methods on automatic hypernym discovery are based on lexical patterns and suffer from the problem that such patterns can only cover a small part of complex linguistic circumstances (Hearst, 1992; Turney et al. [sent-29, score-0.792]

10 Other work tries to extract hypernym relations from large-scale encyclopedias like Wikipedia and achieves high precision (Suchanek et al. [sent-32, score-0.901]

11 However, the coverage is limited since there exist many infrequent and new entities that are missing in encyclopedias (Lin et al. [sent-35, score-0.367]

12 This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. [sent-38, score-0.674]

13 Given an entity name, our goal is to discover its hypernyms by leveraging knowl- edge from multiple sources. [sent-39, score-0.65]

14 Considering the case where a person wants to know the meaning of an unknown entity, he/she may search it in a search engine and then finds out the answer after going through the search results. [sent-40, score-0.242]

15 The evidences from the above sources are integrated in our hypernym discovery model. [sent-46, score-0.81]

16 Our approach is composed of two major steps: hypernym candidate extraction and ranking. [sent-47, score-0.76]

17 In the first step, we collect hypernym candidates from multiple sources. [sent-48, score-0.805]

18 Given an entity name, we search it in a search engine and extract high-frequency nouns as its main candidate hypernyms from the search results. [sent-49, score-0.985]

19 We also collect the category tags for the entity from two Chinese encyclopedias and the head word of the entity as the candidates. [sent-50, score-0.493]

20 In the second step, we identify correct hypernyms from the candidates. [sent-51, score-0.571]

21 Our contributions are as follows: • • We are the first to discover hypernym for Chinese open-domain deisntciotiveesr by exploiting m Chuli-tiple sources. [sent-54, score-0.7]

22 We propose a set of novel and effective featWueres p froorp hypernym ranking. [sent-58, score-0.674]

23 1225 2 Related Work Previous methods for hypernym discovery can be summarized into two major categories, i. [sent-66, score-0.702]

24 Pattern-based methods make use of manually or automatically constructed patterns to mine hypernym relations from text corpora. [sent-69, score-0.768]

25 The pioneer work by Hearst (1992) finds that linking two noun phrases (NPs) via certain lexical constructions often implies hypernym relations. [sent-70, score-0.719]

26 For example, NP1 is a hypernym of NP2 in the lexical pattern “such NP1 as NP2”. [sent-71, score-0.674]

27 Evans (2004) considers the web data as a large corpus and uses search engines to identify hypernyms based on lexical patterns. [sent-77, score-0.605]

28 Given an arbitrary document, he takes each capitalized word sequence as an entity and aims to find its potential hypernyms through pattern-based web searching. [sent-78, score-0.644]

29 Then, in the retrieved documents, the nouns that immediately precede the pattern are recognized as the hypernyms of X. [sent-81, score-0.561]

30 (2005) for the first time propose to automatically extract large numbers of lexico-syntactic patterns and then detect hypernym relations from a large newswire corpus. [sent-85, score-0.774]

31 Encyclopedia-based methods extract hypernym relations from encyclopedias like Wikipedia (Suchanek et al. [sent-91, score-0.865]

32 The user-labeled information in encyclopedias, such as category tags in Wikipedia, is often used to derive hypernym relations. [sent-94, score-0.742]

33 (2012) connect the absent entities with the entities present in Wikipedia sharing common contexts. [sent-103, score-0.371]

34 They utilize the Freebase semantic types to label the present entities and then propagate the types to the absent entities. [sent-104, score-0.233]

35 Compared with previous work, our approach tries to identify hypernyms from multiple sources. [sent-107, score-0.532]

36 First, we collect candidate hypernyms from multiple sources for a given entity. [sent-111, score-0.728]

37 Then, a statistical model is built for hypernym ranking based on a set of effective features. [sent-112, score-0.738]

38 1 Candidate Hypernym Collection from Multiple Sources In this work, we collect potential hypernyms from four sources, i. [sent-115, score-0.588]

39 , search engine results, two encyclopedias, and morphology of the entity name. [sent-117, score-0.225]

40 We count the co-occurrence frequency between 1226 the target entities and other words in the returned snippets and titles, and select top N nouns (or noun phrases) as the main candidates. [sent-118, score-0.313]

41 As the experiments show, this method can find at least one hypernym for 86. [sent-119, score-0.674]

42 This roughly explains why people often can infer semantic meaning of unknown entities after going through several search results. [sent-122, score-0.262]

43 In this work, we consider two Chinese encyclopedias, Baidubaike and Hudongbaike1, as hypernym sources. [sent-125, score-0.674]

44 In addition, the head words of entities are also their hypernyms sometimes. [sent-126, score-0.742]

45 Thus we put head words into the hypernym candidates. [sent-130, score-0.723]

46 We combine all of these hypernym candidates together as the input of the second stage. [sent-137, score-0.749]

47 Considering that manually annotating a large-scale hypernym dataset is costly and time-consuming, we present a heuristic strategy to collect training data. [sent-144, score-0.818]

48 We compare three hypernym ranking models on this data set, including Support Vector Machine (SVM) with a linear kernel, SVM with a radial basis function (RBF) kernel and Logistic Regression (LR). [sent-145, score-0.738]

49 1 Features for Ranking The features for hypernym ranking are shown in Table 1. [sent-156, score-0.738]

50 Hypernym Prior: Intuitively, different words have different probabilities as hypernyms of some other words. [sent-158, score-0.532]

51 4 million pages in Baidubaike, and compute the prior probabilities prior(w) for a word w being a potential hypernym using Equation 1. [sent-164, score-0.697]

52 prior(w) =Pwco0uconutnCtTC(Tw()w0) (1) In Titles: When weP Penter a query into a search engine, the engine returns a search result list, which contains document titles and their snippet text. [sent-166, score-0.222]

53 We discover that the average frequency of occurrence of hypernyms in titles is 15. [sent-168, score-0.616]

54 6619 Table 2: Distributions of candidate hypernyms in titles and snippets is divided into three cases: greater than 15. [sent-177, score-0.711]

55 Synonyms: If there exist synonyms of a candidate hypernym in the candidate list, the candidate is probably correct answer. [sent-183, score-1.027]

56 ¬ – ratiosyn(h,le) =counltesny(nl(eh),le) (2) Given a hypernym candidate h of an entity e and the list of all candidates le, we compute the ratio of the synonyms of h in le. [sent-186, score-0.983]

57 2CilinE contains synonym and hypernym relations among 77 thousand words, which is manually organized as a hierarchy of five levels. [sent-189, score-0.723]

58 radical(e,h) =counletnR(Mh()e,h) (3) Here radical(e, h) denotes the ratio of characters radical-matched with the last character of the entity e in the hypernym h. [sent-198, score-0.812]

59 Firstly, we extract a number of open-domain entities from encyclopedias randomly. [sent-207, score-0.326]

60 Then their hypernym candidates are collected by using the method proposed in Section 3. [sent-208, score-0.749]

61 We select positive training instances following two principles: • • Principle 1: Among the four sources used for cParnindcidipalete collection, hthee f more sources dfro fmor which the hypernym candidate is extracted, the more likely it is a correct one. [sent-210, score-0.929]

62 Principle 2: The higher the prior of the candidPariten being a hypernym is, thee p more likely aitn nisd a correct one. [sent-211, score-0.736]

63 4 Experimental Setup In this work, we use Baidu3 search engine, the most popular search engine for Chinese, and get the top 100 search results for each entity. [sent-216, score-0.238]

64 Then we extract candidate hypernyms for the entities and ask two annotators to judge each hypernym relation pair true or false manually. [sent-225, score-1.503]

65 53 candidate hypernyms for each entity on average in which about 2. [sent-228, score-0.71]

66 4,330 hypernym relation pairs are judged by both the annotators. [sent-230, score-0.674]

67 com/ dict / Top N Figure 1: Effect of candidate hypernym coverage rate while varying N Kappa value is 0. [sent-239, score-0.866]

68 Coverage rate is the number of entities for which at least one correct hypernym is found divided by the total number of all entities. [sent-246, score-0.91]

69 Precision@1: Our method returns a ranked list of hypernyms for each entity. [sent-247, score-0.532]

70 We evaluate precision of top-1 hypernyms (the most probable ones) in the ranked lists, which is the number of correct top-1 hypernyms divided by the number of all entities. [sent-248, score-1.139]

71 R-precision: It is equivalent to Precision@R where R is the total number of candidates labeled as true hypernyms of an entity. [sent-249, score-0.607]

72 We check the candidate hypernyms of the whole 1,879 entities in the development and test sets and see how many entities we can collect at least one correct hypernym for. [sent-254, score-1.709]

73 For SR, we select top N frequent nouns (SRN) in the search results of an entity as its hypernym candidates. [sent-257, score-0.87]

74 When the candidates from all sources are merged, the coverage rate is further improved. [sent-261, score-0.235]

75 We can see that top 10 frequent nouns in the search results contain at least one correct hypernym for 86. [sent-264, score-0.795]

76 This coincides with the intuition that people usually can infer the semantic classes of unknown entities by searching them in web search engines. [sent-266, score-0.282]

77 The average number of candidate hypernyms from ET is 3. [sent-270, score-0.618]

78 The reason is that for many present entities, the category tags include not only hypernyms ∗For some of entities are rare, there may be less than 10 nouns in the search results. [sent-276, score-0.843]

79 Here the present entities mean the entities existing in the encyclopedias. [sent-281, score-0.322]

80 Among them, only “|, (arena)” is a proper hypernym whereas the others are some related words indicating merely thematic vicinity. [sent-289, score-0.696]

81 That is why we try to automatically extract hypernym relations. [sent-303, score-0.703]

82 We select top 100 search results of each query and get 1,285,209 results in all for the entities in the test set. [sent-318, score-0.257]

83 Then we use the patterns to extract hypernyms from the search results. [sent-319, score-0.659]

84 The result shows that 508 correct hypernyms are extracted for 568 entities (1,529 entities in total). [sent-320, score-0.893]

85 Only a small part of the entities can be extracted hypernyms for. [sent-321, score-0.693]

86 This is mainly be- ´ cause only a few hypernym relations are expressed in these fixed patterns in the web, and many ones are expressed in more flexible manners. [sent-322, score-0.745]

87 The hypernyms are ranked based on the count of evidences where the hypernyms are extracted. [sent-323, score-1.139]

88 Then, an LR classifier is built based on this patterns to recognize hypernym relations. [sent-327, score-0.719]

89 In Precision−Recall Curves on the Test Set Recall Figure 2: Precision-Recall curves on the test set our corpus, there are 652,181 candidates for 1,529 entities (426. [sent-330, score-0.236]

90 For fair comparison of R-precision and recall, we add the extra correct hypernyms from MPattern and MSnow to the test data set. [sent-353, score-0.571]

91 This indicates the importance of the source information for hypernym ranking. [sent-368, score-0.674]

92 7Uncovered entities are entities which we do not collect any correct hypernyms for in the first step. [sent-382, score-0.949]

93 8False positives are hypernyms ranked at the first places, but actually are not correct hypernyms. [sent-383, score-0.571]

94 The identification of their hypernyms requires more human-crafted knowledge. [sent-389, score-0.532]

95 For example, “) (organism)” is wrongly recognized as the most probable hypernym of “¯ UX ? [sent-393, score-0.674]

96 Œ £s (Ethanolamine phosphotransferase)”, because the entity often co-occurs with word “)Ô (organism)” and the latter is often used as a hypernym of some other entities. [sent-394, score-0.766]

97 The correct hypernyms actually are “s (enzyme)”, “zÆÔŸ (chemical substance)”, and so on. [sent-395, score-0.571]

98 6 = Conclusion This paper proposes a novel method for finding hypernyms of Chinese open-domain entities from multiple sources. [sent-396, score-0.693]

99 We collect candidate hypernyms with wide coverage from search results, encyclope- dia category tags and the head word of the entity. [sent-397, score-0.914]

100 Then, we propose a set of features to build statistical models to rank the candidate hypernyms on the training data collected automatically. [sent-398, score-0.618]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hypernym', 0.674), ('hypernyms', 0.532), ('entities', 0.161), ('encyclopedias', 0.136), ('entity', 0.092), ('chinese', 0.088), ('suchanek', 0.087), ('candidate', 0.086), ('baidubaike', 0.078), ('mlr', 0.078), ('candidates', 0.075), ('coverage', 0.07), ('ranking', 0.064), ('mpattern', 0.062), ('msnow', 0.062), ('titles', 0.058), ('engine', 0.058), ('lr', 0.056), ('collect', 0.056), ('synonyms', 0.056), ('evidences', 0.054), ('sources', 0.054), ('search', 0.053), ('absent', 0.049), ('head', 0.049), ('encyclopedia', 0.049), ('ciline', 0.047), ('mheuristic', 0.047), ('uncovered', 0.046), ('patterns', 0.045), ('noun', 0.045), ('snow', 0.043), ('wikipedia', 0.04), ('correct', 0.039), ('reaches', 0.038), ('hoffart', 0.037), ('zhenghua', 0.037), ('precision', 0.036), ('rate', 0.036), ('heuristic', 0.036), ('snippets', 0.035), ('radical', 0.035), ('rbf', 0.035), ('tags', 0.034), ('category', 0.034), ('medicine', 0.033), ('thesauri', 0.033), ('arena', 0.031), ('authenticate', 0.031), ('castellan', 0.031), ('hudongbaike', 0.031), ('organism', 0.031), ('siegel', 0.031), ('industry', 0.031), ('extract', 0.029), ('strategy', 0.029), ('nouns', 0.029), ('discovery', 0.028), ('hyponymy', 0.027), ('ltp', 0.027), ('msv', 0.027), ('oren', 0.027), ('svm', 0.027), ('ontology', 0.027), ('relations', 0.026), ('che', 0.026), ('hearst', 0.026), ('discover', 0.026), ('len', 0.025), ('radicals', 0.025), ('hints', 0.025), ('fabian', 0.025), ('mcnamee', 0.025), ('unknown', 0.025), ('principle', 0.024), ('domains', 0.024), ('character', 0.023), ('characters', 0.023), ('prior', 0.023), ('degraded', 0.023), ('wanxiang', 0.023), ('yago', 0.023), ('semantic', 0.023), ('turney', 0.023), ('manually', 0.023), ('movie', 0.022), ('morphology', 0.022), ('etzioni', 0.022), ('kind', 0.022), ('select', 0.022), ('ritter', 0.022), ('ciaramita', 0.022), ('merely', 0.022), ('get', 0.021), ('tag', 0.021), ('count', 0.021), ('false', 0.021), ('evans', 0.021), ('rion', 0.021), ('web', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

Author: Ruiji Fu ; Bing Qin ; Ting Liu

Abstract: Hypernym discovery aims to extract such noun pairs that one noun is a hypernym of the other. Most previous methods are based on lexical patterns but perform badly on opendomain data. Other work extracts hypernym relations from encyclopedias but has limited coverage. This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. Given an entity name, we try to discover its hypernyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclopedias, and morphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we apply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the rank- ing model. We also present a heuristic strategy to build a large-scale noisy training data for the model without human annotation. Experimental results demonstrate that our approach outperforms the state-of-the-art methods on a manually labeled test dataset.

2 0.11449888 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

Author: Hrushikesh Mohapatra ; Siddhanth Jain ; Soumen Chakrabarti

Abstract: Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguated entities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets. Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortunately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biased sample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broaden our goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are synergistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contexts help assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference. TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In experiments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, and over 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) compared to baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recent annotations of the same corpus with Freebase entities, and report considerable improvements within the people domain.

3 0.10061602 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang

Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.

4 0.10041779 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu

Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).

5 0.074379206 160 emnlp-2013-Relational Inference for Wikification

Author: Xiao Cheng ; Dan Roth

Abstract: Wikification, commonly referred to as Disambiguation to Wikipedia (D2W), is the task of identifying concepts and entities in text and disambiguating them into the most specific corresponding Wikipedia pages. Previous approaches to D2W focused on the use of local and global statistics over the given text, Wikipedia articles and its link structures, to evaluate context compatibility among a list of probable candidates. However, these methods fail (often, embarrassingly), when some level of text understanding is needed to support Wikification. In this paper we introduce a novel approach to Wikification by incorporating, along with statistical methods, richer relational analysis of the text. We provide an extensible, efficient and modular Integer Linear Programming (ILP) formulation of Wikification that incorporates the entity-relation inference problem, and show that the ability to identify relations in text helps both candi- date generation and ranking Wikipedia titles considerably. Our results show significant improvements in both Wikification and the TAC Entity Linking task.

6 0.065502375 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

7 0.061910674 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

8 0.05864523 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

9 0.057939522 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

10 0.057262387 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

11 0.056888174 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

12 0.054724328 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

13 0.05175003 137 emnlp-2013-Multi-Relational Latent Semantic Analysis

14 0.050712828 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

15 0.04983554 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

16 0.04743598 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

17 0.0437176 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

18 0.04326304 123 emnlp-2013-Learning to Rank Lexical Substitutions

19 0.042676408 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification

20 0.040750459 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.156), (1, 0.064), (2, 0.02), (3, -0.029), (4, -0.016), (5, 0.064), (6, 0.021), (7, 0.127), (8, 0.066), (9, 0.024), (10, 0.081), (11, 0.036), (12, -0.021), (13, -0.021), (14, -0.116), (15, -0.026), (16, -0.048), (17, 0.025), (18, 0.031), (19, 0.007), (20, -0.07), (21, 0.017), (22, -0.058), (23, 0.045), (24, 0.054), (25, -0.007), (26, 0.062), (27, 0.058), (28, -0.127), (29, -0.063), (30, 0.027), (31, 0.055), (32, 0.094), (33, -0.04), (34, 0.16), (35, -0.007), (36, -0.004), (37, -0.019), (38, -0.08), (39, -0.057), (40, -0.012), (41, 0.036), (42, -0.09), (43, -0.046), (44, -0.077), (45, 0.013), (46, 0.014), (47, 0.055), (48, -0.015), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93932623 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

Author: Ruiji Fu ; Bing Qin ; Ting Liu

Abstract: Hypernym discovery aims to extract such noun pairs that one noun is a hypernym of the other. Most previous methods are based on lexical patterns but perform badly on opendomain data. Other work extracts hypernym relations from encyclopedias but has limited coverage. This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. Given an entity name, we try to discover its hypernyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclopedias, and morphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we apply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the rank- ing model. We also present a heuristic strategy to build a large-scale noisy training data for the model without human annotation. Experimental results demonstrate that our approach outperforms the state-of-the-art methods on a manually labeled test dataset.

2 0.75387418 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

Author: Zhengyan He ; Shujie Liu ; Yang Song ; Mu Li ; Ming Zhou ; Houfeng Wang

Abstract: Entity disambiguation works by linking ambiguous mentions in text to their corresponding real-world entities in knowledge base. Recent collective disambiguation methods enforce coherence among contextual decisions at the cost of non-trivial inference processes. We propose a fast collective disambiguation approach based on stacking. First, we train a local predictor g0 with learning to rank as base learner, to generate initial ranking list of candidates. Second, top k candidates of related instances are searched for constructing expressive global coherence features. A global predictor g1 is trained in the augmented feature space and stacking is employed to tackle the train/test mismatch problem. The proposed method is fast and easy to implement. Experiments show its effectiveness over various algorithms on several public datasets. By learning a rich semantic relatedness measure be- . tween entity categories and context document, performance is further improved.

3 0.73436338 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

Author: Hrushikesh Mohapatra ; Siddhanth Jain ; Soumen Chakrabarti

Abstract: Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguated entities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets. Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortunately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biased sample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broaden our goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are synergistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contexts help assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference. TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In experiments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, and over 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) compared to baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recent annotations of the same corpus with Freebase entities, and report considerable improvements within the people domain.

4 0.68277746 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu

Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).

5 0.50176179 160 emnlp-2013-Relational Inference for Wikification

Author: Xiao Cheng ; Dan Roth

Abstract: Wikification, commonly referred to as Disambiguation to Wikipedia (D2W), is the task of identifying concepts and entities in text and disambiguating them into the most specific corresponding Wikipedia pages. Previous approaches to D2W focused on the use of local and global statistics over the given text, Wikipedia articles and its link structures, to evaluate context compatibility among a list of probable candidates. However, these methods fail (often, embarrassingly), when some level of text understanding is needed to support Wikification. In this paper we introduce a novel approach to Wikification by incorporating, along with statistical methods, richer relational analysis of the text. We provide an extensible, efficient and modular Integer Linear Programming (ILP) formulation of Wikification that incorporates the entity-relation inference problem, and show that the ability to identify relations in text helps both candi- date generation and ranking Wikipedia titles considerably. Our results show significant improvements in both Wikification and the TAC Entity Linking task.

6 0.48238394 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

7 0.4737924 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

8 0.46861464 131 emnlp-2013-Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs

9 0.45391256 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!

10 0.44487223 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

11 0.43040487 123 emnlp-2013-Learning to Rank Lexical Substitutions

12 0.40705162 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

13 0.3926374 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

14 0.38743347 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

15 0.38649866 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

16 0.38425109 197 emnlp-2013-Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness of Supervised Models in Situated Dialogue Systems

17 0.36907881 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

18 0.36236036 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

19 0.35693681 43 emnlp-2013-Cascading Collective Classification for Bridging Anaphora Recognition using a Rich Linguistic Feature Set

20 0.35176012 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.04), (4, 0.249), (18, 0.054), (22, 0.058), (30, 0.088), (45, 0.013), (50, 0.019), (51, 0.176), (66, 0.037), (71, 0.023), (75, 0.05), (77, 0.014), (96, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78477609 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

Author: Ruiji Fu ; Bing Qin ; Ting Liu

Abstract: Hypernym discovery aims to extract such noun pairs that one noun is a hypernym of the other. Most previous methods are based on lexical patterns but perform badly on opendomain data. Other work extracts hypernym relations from encyclopedias but has limited coverage. This paper proposes a simple yet effective distant supervision framework for Chinese open-domain hypernym discovery. Given an entity name, we try to discover its hypernyms by leveraging knowledge from multiple sources, i.e., search engine results, encyclopedias, and morphology of the entity name. First, we extract candidate hypernyms from the above sources. Then, we apply a statistical ranking model to select correct hypernyms. A set of novel features is proposed for the rank- ing model. We also present a heuristic strategy to build a large-scale noisy training data for the model without human annotation. Experimental results demonstrate that our approach outperforms the state-of-the-art methods on a manually labeled test dataset.

2 0.66625971 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

3 0.66044819 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

4 0.65751362 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

Author: Jason Weston ; Antoine Bordes ; Oksana Yakhnenko ; Nicolas Usunier

Abstract: This paper proposes a novel approach for relation extraction from free text which is trained to jointly use information from the text and from existing knowledge. Our model is based on scoring functions that operate by learning low-dimensional embeddings of words, entities and relationships from a knowledge base. We empirically show on New York Times articles aligned with Freebase relations that our approach is able to efficiently use the extra information provided by a large subset of Freebase data (4M entities, 23k relationships) to improve over methods that rely on text features alone.

5 0.65595913 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

Author: Yangfeng Ji ; Jacob Eisenstein

Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.

6 0.65183586 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

7 0.65095747 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

8 0.650186 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

9 0.64895427 143 emnlp-2013-Open Domain Targeted Sentiment

10 0.64888632 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

11 0.64877015 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

12 0.64832896 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

13 0.64741051 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

14 0.6473437 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

15 0.64697307 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

16 0.64653099 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

17 0.64619172 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

18 0.64593643 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

19 0.64593363 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

20 0.64592874 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching