acl acl2011 acl2011-231 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fan Zhang ; Shuming Shi ; Jing Liu ; Shuqi Sun ; Chin-Yew Lin
Abstract: This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. 1 Introduction1 An important task in text mining is the automatic extraction of entities and their lexical relations; this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy (or is-a) relation from largescale, open-domain web documents. From the viewpoint of entity classification, the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches (Hearst 1992; Pantel & Ravichandran 2004; Snow et al., 2005; Durme & Pasca, 2008; Talukdar et al., 2008) to address the problem. These methods typically exploited manually-designed or automatical* This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns (e.g., “NP such as NP”, “NP like NP”, “NP is a NP”). Although some degree of success has been achieved with these efforts, the results are still far from perfect, in terms of both recall and precision. As will be demonstrated in this paper, even by processing a large corpus of 500 million web pages with the most popular patterns, we are not able to extract correct labels for many (especially rare) entities. Even for popular terms, incorrect results often appear in their label lists. The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. We demonstrate that the specific way of counting has a great impact on result quality, and that the state-ofthe-art counting methods are not optimal. Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. It is hard to extract correct labels for rare terms from a web corpus due to the data sparseness problem. To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. For example, if we already know that 1) Helsinki and Tampere are cities, and 2) Porvoo is similar to Helsinki and Tampere, then Porvoo is ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s159–1168, very likely also a city. This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. For example, Mount Vesuvius and Kilimanjaro are volcanoes and Lhotse is similar to them, but Lhotse is not a volcano. Therefore we should be very conservative and careful in hypernym propagation. In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. This is one of the technologies adopted in our semantic search and min- ing system NeedleSeek2. In the next section, we discuss major related efforts and how they differ from our work. Section 3 is a brief description of the baseline approach. The probabilistic evidence combination model that we exploited is introduced in Section 4. Our main approach is illustrated in Section 5. Section 6 shows our experimental settings and results. Finally, Section 7 concludes this paper. 2 Related Work Existing efforts for hyponymy relation extraction have been conducted upon various types of data sources, including plain-text corpora (Hearst 1992; Pantel & Ravichandran, 2004; Snow et al., 2005; Snow et al., 2006; Banko, et al., 2007; Durme & Pasca, 2008; Talukdar et al., 2008), semistructured web pages (Cafarella et al., 2008; Shinzato & Torisawa, 2004), web search results (Geraci et al., 2006; Kozareva et al., 2008; Wang & Cohen, 2009), and query logs (Pasca 2010). Our target for optimization in this paper is the approaches that use lexico-syntactic patterns to extract hyponymy relations from plain-text corpora. Our future work will study the application of the proposed algorithms on other types of approaches. 2 http://research.microsoft.com/en-us/projects/needleseek/ or http://needleseek.msra.cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al., 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. In (Snow et al., 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. The independence of evidence was assumed in their model. In comparison, we show that better results will be obtained if the evidence correlation is modeled appropriately. Our evidence propagation is basically about using term similarity information to help instance labeling. There have been several approaches which improve hyponymy extraction with instance clusters built by distributional similarity. In (Pantel & Ravichandran, 2004), labels were assigned to the committee (i.e., representative members) of a semantic class and used as the hypernyms of the whole class. Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). In contrast to their method, our label scoring and ranking approach is applied to every single term rather than a semantic class. In addition, we also compute label scores in a nonlinear way, which improves results quality. In Snow et al. (2005), a supervised approach was proposed to improve hypernym classification using coordinate terms. In comparison, our approach is unsupervised. Durme & Pasca (2008) cleaned the set of instance-label pairs with a TF*IDF like method, by exploiting clusters of semantically related phrases. The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. Label propagation is performed in (Talukdar et al., 2008; Talukdar & Pereira, 2010) based on multiple instance-label graphs. Term similarity information was not used in their approach. Most existing work tends to utilize small-scale or private corpora, whereas the corpus that we used is publicly available and much larger than most of the existing work. We published our term sets (refer to Section 6. 1) and their corresponding user judgments so researchers working on similar topics can reproduce our results. H eTIsaryApst-eI {N aP nLd(|{iso,}ra (eNisn|.wuPcgalhsu|edwa.gs(e)r{PN|baiPent,c}lg*ur){nd(ai n|dg)o {r}N P,L}* IsA-II NP (is|are|was|were|being) {the, those} NPL IsA-III NP (is|are|was|were|being) {another, any} NPL Table 1. Patterns adopted in this paper (NP: named phrase representing an entity; NPL: label) 3 Preliminaries The problem addressed in this paper is corpusbased is-a relation mining: extracting hypernyms (as labels) for entities from a large-scale, open- domain document corpus. The desired output is a mapping from terms to their corresponding hypernyms, which can naturally be represented as a weighted bipartite graph (term-label graph). Typically we are only interested in top labels of a term in the graph. Following existing efforts, we adopt patternmatching as a basic way of extracting hypernymy/hyponymy relations. Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. One or more termlabel pairs can be extracted if a pattern matches a sentence. In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3.1) where m is the number of times the pair (T, L) is extracted from the corpus, DF(L) is the number of in-links of L in the graph, N is total number of terms in the graph, and IDF means the “inverse document frequency”. A term can only keep its top-k neighbors (according to the edge weight) in the graph as its final labels. 1161 Our pattern matching algorithm implemented in this paper uses part-of-speech (POS) tagging information, without adopting a parser or a chunker. The noun phrase boundaries (for terms and labels) are determined by a manually designed POS tag list. 4 Probabilistic Label-Scoring Model Here we model the hyponymy extraction problem from the probability theory point of view, aiming at estimating the score of a term-label pair (i.e., the score of a label w.r.t. a term) with probabilistic evidence combination. The model was studied in (Shi et al., 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). Ei: The observation that (T, L) is extracted from a sentence Si via pattern matching (i.e., Si is a sup- porting sentence of the pair). Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,..,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. Formally, we need to find a function f to satisfy, P(A|E1,… ,Em) = f(P(A), P(A|E1)… P(A|Em) ) (4.1) … … …, For simplicity, we first consider the case of m=2. The case of m>2 is quite similar. We start from the simple case of independent supporting sentences. That is, ( ) ( ) ( ) ( ) ( )( ) By applying Bayes rule, we get, ( (4.2) (4.3) ) ( ( ) )() ( ( ) ) ( ) ( () ) ( ) ( ) (4.4) ( ) ( ) ( ) Then define ( ) ( ( )) ( ( )) ( ( )) Here G(A|E) represents the log-probability-gain of A given E, with the meaning of the gain in the log-probability value of A after the evidence E is observed (or known). It is a measure of the impact of evidence E to the probability of event A. With the definition of G(A|E), Formula 4.4 can be transformed to, ( ) ( ) ( ) (4.5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. It is easy to prove (by following a similar procedure) that the above Formula holds for the case of m>2, as long as the pieces of evidence are mutually independent. Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . G(AT,L|E1… ,Em) ∑ ( ) ( ) (4.6) This is exactly the Formula 3. 1. By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. … TRaAb:le(2.A/E)Rv(ide)ncHd065epa.9r81sn7t-dIec10ys.7Ae3-1I0timEa2o:8nH0Is.4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. The probability is quite low because most term-label pairs do not have hyponymy relations. Once we have observed S1, however, the chance of (T, L) having a hyponymy relation in1162 creases. Therefore the chance of observing another supporting sentence becomes larger than before. Table 2 shows the rough estimation of ( ( ) ( ) ) (denoted as RA), ( ( ) ( ) ) (denoted as R), and their ratios. The statistics are obtained by performing maximal likelihood estimation (MLE) upon our corpus and a random selection of term-label pairs from our term sets (see Section 6. 1) together with their top labels3. The data verifies our analysis about the correlation between E1 and E2 (note that R=1 means independent). In addition, it can be seen that the conditional independence assumption of Formula 4.3 does not hold (because RA>1). It is hence necessary to consider the correlation between supporting sentences in the model. The estimation of Table 2 also indicates that, ( ( ) ( )) ( ( ) ( ) ) (4.7) By following a similar procedure as above, with Formulas 4.2 and 4.3 replaced by 4.7, we have, ( ) ( ) ( ) (4.8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w.r.t. term T (given both the sen- tences) is smaller than the sum of the gains caused by one sentence only. In the extreme case that sentence S2 fully depends on E1 (i.e. P(E2|E1)=1), it is easy to prove that ( ) ( ) It is reasonable, since event E2 does not bring in more information than E1. Formula 4.8 cannot be used directly for computing the posterior gain. What we really need is a function h satisfying () ( ( ) ( )) (4.9) and ( )∑ (4.10) Shi et al. (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. 11) 3 RA is estimated from the labels judged as “Good”; whereas the estimation of R is from all judged labels. ( ) √ ∑ (p>1) (4.12) In the next section, we use the above two h func- tions as basic building blocks to compute label scores for terms. 5 Our Approach Multiple types of patterns (Table 1) can be adopted to extract term-label pairs. For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. In Section 5. 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. Then in Section 5.2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. 5.1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5.1) where mi is the number of supporting sentences corresponding to pattern i. Also assume the gain score of Si,j is xi,j, i.e., xi,j=G(A|Si,j). Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. This can be verified by the data in Table-2. By ignoring the inter-pattern correlations, we make the following simplified assumption: Assumption: Supporting sentences corresponding to the same pattern are correlated, while those of different patterns are independent. According to this assumption, our label-scoring function is, ( ) ∑ ( ) (5.2) In the simple case that ( ) , if the h function of Formula 4. 12 is adopted, then, ( ) (∑ √ ) ( ) (5.3) 1163 We use an example to illustrate the above formula. Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. Also assume the supporting-sentence-count vector of label L2 is (25, 0, 0, 0, 0, 0). If we use Formula 5.3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. √ ; √ 5.2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). With the graphs, we can obtain the similarity between two terms without their hypernyms being available. The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). Our evidence propagation algorithm contains two phases. In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5.5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). We call the first type of graph DS and the second type PB. DS approaches are based on the distributional hypothesis (Harris, 1985), which says that terms appearing in analogous contexts tend to be similar. In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. The similarity between two terms is computed as the similarity between their corresponding feature vectors. In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. Two categories of patterns have been studied in the literature (Heast 1992; Pasca 2004; Kozareva et al., 2008; Zhang et al., 2009): sentence lexical patterns, and HTML tag patterns. An example of sentence lexical patterns is “T {, T} *{,} (and|or) T”. HTML tag patterns include HTML tables, drop-down lists, and other tag repeat patterns. In this paper, we generate the DS and PB graphs by adopting the best-performed methods studied in (Shi et al., 2010). We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. 6 Experiments 6.1 Experimental setup Corpus We adopt a publicly available dataset in our experiments: ClueWeb094. This is a very large dataset collected by Carnegie Mellon University in early 2009 and has been used by several tracks of the Text Retrieval Conference (TREC)5. The whole dataset consists of 1.04 billion web pages in ten languages while only those in English, about 500 million pages, are used in our experiments. The reason for selecting such a dataset is twofold: First, it is a corpus large enough for conducting webscale experiments and getting meaningful results. Second, since it is publicly available, it is possible for other researchers to reproduce the experiments in this paper. Term sets Approaches are evaluated by using two sets of selected terms: Wiki200, and Ext100. For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. Wiki200 is constructed by first randomly selecting 400 Wikipedia6 titles as our candidate terms, with the probability of a title T being selected to be ( ( )), where F(T) is the frequency of T in our data corpus. The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. Then 200 terms are manually selected from the 400 candidate terms, with the principle of maximizing the diversity of terms in terms of length (i.e., number of words) and type (person, location, organization, software, movie, song, animal, plant, etc.). Wiki200 is further divided into two subsets: Wiki100H and Wiki100L, containing respectively the 100 high-frequency and lowfrequency terms. Ext100 is built by first selecting 200 non-Wikipedia-title terms at random from the term-label graph generated by the baseline approach (Formula 3. 1), then manually selecting 100 terms. Some sample terms in the term sets are listed in Table 3. 4 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5 http://trec.nist.gov/ 6 http://www.wikipedia.org/ Annotation For each term in the term set, the top-5 results (i.e., hypernym labels) of various methods are mixed and judged by human annotators. Each annotator assigns each result item a judgment of “Good”, “Fair” or “Bad”. The annotators do not know the method by which a result item is generated. Six annotators participated in the labeling with a rough speed of 15 minutes per term. We also encourage the annotators to add new good results which are not discovered by any method. The term sets and their corresponding user anno- tations are available for download at the following links (dataset ID=data.queryset.semcat01): http://research.microsoft.com/en-us/projects/needleseek/ http://needleseek.msra.cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. The evaluation score on a term set is the average over all the terms. Precision@k: The percentage of relevant (good or fair) labels in the top-k results (labels judged as “Fair” are counted as 0.5) Recall@k: The ratio of relevant labels in the topk results to the total number of relevant labels R-Precision: Precision@R where R is the total number of labels judged as “Good” Mean average precision (MAP): The average of precision values at the positions of all good or fair results Before annotation and evaluation, the hypernym list generated by each method for each term is preprocessed to remove duplicate items. Two hypernyms are called duplicate items if they share the same head word (e.g., “military conflict” and “conflict”). For duplicate hypernyms, only the first (i.e., the highest ranked one) in the list is kept. The goal with such a preprocessing step is to partially con- sider results diversity in evaluation and to make a more meaningful comparison among different methods. Consider two hypernym lists for “subway”: List-1 : restaurant; chain restaurant; worldwide chain restaurant; franchise; restaurant franchise… List-2: restaurant; franchise; transportation; company; fast food… There are more detailed hypernyms in the first list about “subway” as a restaurant or a franchise; while the second list covers a broader range of meanings for the term. It is hard to say which is better (without considering the upper-layer applications). With this preprocessing step, we keep our focus on short hypernyms rather than detailed ones. … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6.2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4.1. In Table 4, Linear means that Formula 3. 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. 11 and 4. 12 being utilized. The performance improvement numbers shown in the table are based on the linear version; and the upward pointing arrows indicate relative percentage improvement over the baseline. From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. It is interesting to note that the performance improvement is more significant on Wiki100H, the set of high frequency terms. By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). So the scores computed by various fusion algorithms tend to be similar. In contrast, more supporting sentences can be discov- ered for high-frequency terms. Much information is contained in the sentences about the hypernyms of the high-frequency terms, but the linear function of Formula 3.1 fails to make effective use of it. The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. Please note that the terms in Ext100 do not appear in Wikipedia titles. Thanks to the scale of the data corpus we are using, even the baseline approach achieves reasonably good performance. Please note that the terms (refer to Table 3) we are using are “harder” than those adopted for evaluation in many existing papers. Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. Please note that the recall (R@1, R@5) in this paper is pseudo-recall, i.e., we treat the number of known relevant (Good or Fair) results as the total number of relevant ones. MTPLeNainotbhgrlmed5.M012P35A8e96r%5P4f0orRm-.aP40%2rn9ce 50P7o.@625m10p%5 ari0Ps.4o@07%n25 amR307o.@41n526g%0 vaRr0i.3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. The linear method of Formula 3. 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. Figure 1 shows that, for most of the term sets, the best performance is obtained for [2.0, 4.0]. The reason may be that the sentence correlations are better estimated with p values in this range. Figure 1. Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. The methods for comparison are, Base: The linear function without propagation. NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. LP: Linear propagation, i.e., the linear function is used to combine the evidence of pseudo supporting sentences. NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. Wiki200; Similarity graph: PB; Nonlinear formula: PNorm) In this paper, we generate the DS (distributional similarity) and PB (pattern-based) graphs by adopting the best-performed methods studied in (Shi et al., 2010). The performance improvement numbers (indicated by the upward pointing arrows) shown in tables 6~9 are relative percentage improvement 1166 over the base approach (i.e., linear function without propagation). The values of parameter are set to maximize the MAP values. Several observations can be made from Table 6. First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. The second observation is that the NL+NLP approach achieves a much larger performance improvement than NL and NLP. Similar results (omitted due to space limitation) can be observed on the Ext100 term set. evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. The performance improvement over Wiki100L is especially remarkable. This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. Acknowledgments We would like to thank Matt Callcut for reading through the paper. Thanks to the annotators for their efforts in judging the hypernym labels. Thanks to Yueguo Chen, Siyu Lei, and the anonymous reviewers for their helpful comments and suggestions. The first author is partially supported by the NSF of China (60903028,61070014), and Key Projects in the Tianjin Science and Technology Pillar Program. References E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proc. of NAACL-HLT’2009. M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open Information Extraction from the Web. In Proc. of IJCAI’2007. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. In Proceedings of the 34th Conference on Very Large Data Bases (VLDB’2008), pages 538–549, Auckland, New Zealand. B. Van Durme and M. Pasca. 2008. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain information extraction. Twenty-Third AAAI Conference on Artificial Intelligence. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. 2006. Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In Proceedings of the 13th Conference on String Processing and Information Retrieval (SPIRE’2006), pages 25–36, Glasgow, Scotland. Z. S. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press. M. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Fourteenth International Conference on Computational Linguistics, Nantes, France. Z. Kozareva, E. Riloff, E.H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In Proc. of ACL'2008. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu and V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP’2009. Singapore. P. Pantel and D. Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proc. of the 2004 Human Language Technology Conference (HLTNAACL’2004), 321–328. M. Pasca. 2004. Acquisition of Categorized Named Entities for Web Search. In Proc. of CIKM’2004. M. Pasca. 2010. The Role of Queries in Ranking Labeled Instances Extracted from Text. In Proc. of COLING’2010, Beijing, China. S. Shi, B. Lu, Y. Ma, and J.-R. Wen. 2009. Nonlinear Static-Rank Computation. In Proc. of CIKM’2009, Kong Kong. 1168 S. Shi, H. Zhang, X. Yuan, J.-R. Wen. 2010. Corpusbased Semantic Class Mining: Distributional vs. Pattern-Based Approaches. In Proc. of COLING’2010, Beijing, China. K. Shinzato and K. Torisawa. 2004. Acquiring Hyponymy Relations from Web Documents. In Proc. of the 2004 Human Language (HLT-NAACL’2004). Technology Conference R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems. R. Snow, D. Jurafsky, and A. Y. Ng. 2006. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), 801–808. P. P. Talukdar and F. Pereira. 2010. Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. In 48th Annual Meeting of the Association for Computational Linguistics (ACL’2010). P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. 2008. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP’2008), pages 581–589. R.C. Wang. W.W. Cohen. Automatic Set Instance Extraction using the Web. In Proc. of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’2009), gapore. pages 441–449, Sin- H. Zhang, M. Zhu, S. Shi, and J.-R. Wen. 2009. Employing Topic Models for Pattern-based Semantic Class Discovery. In Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP’2009), pages 441–449, Singapore.
Reference: text
sentIndex sentText sentNum sentScore
1 A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. [sent-3, score-0.609]
2 Based on the model, we design a set of evidence combination and propagation algorithms. [sent-4, score-0.522]
3 Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. [sent-6, score-0.531]
4 The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. [sent-19, score-0.754]
5 Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. [sent-20, score-0.629]
6 Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. [sent-22, score-0.94]
7 By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. [sent-23, score-1.063]
8 To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. [sent-25, score-0.636]
9 This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. [sent-29, score-0.299]
10 In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. [sent-32, score-1.706]
11 Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. [sent-33, score-1.259]
12 Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. [sent-34, score-0.682]
13 Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. [sent-35, score-1.67]
14 The probabilistic evidence combination model that we exploited is introduced in Section 4. [sent-39, score-0.33]
15 cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al. [sent-57, score-0.287]
16 , 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. [sent-58, score-0.613]
17 We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. [sent-59, score-1.22]
18 , 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. [sent-61, score-0.337]
19 Our evidence propagation is basically about using term similarity information to help instance labeling. [sent-64, score-0.775]
20 Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). [sent-69, score-0.566]
21 In addition, we also compute label scores in a nonlinear way, which improves results quality. [sent-71, score-0.432]
22 The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). [sent-76, score-0.463]
23 In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. [sent-77, score-0.797]
24 On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. [sent-78, score-0.378]
25 Typically we are only interested in top labels of a term in the graph. [sent-90, score-0.299]
26 Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. [sent-92, score-0.326]
27 In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3. [sent-94, score-0.515]
28 , 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. [sent-106, score-0.641]
29 We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). [sent-107, score-0.733]
30 Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,. [sent-111, score-0.563]
31 ,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. [sent-113, score-0.68]
32 We start from the simple case of independent supporting sentences. [sent-117, score-0.487]
33 5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. [sent-125, score-0.468]
34 Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . [sent-127, score-0.834]
35 By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. [sent-131, score-0.328]
36 4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. [sent-136, score-1.138]
37 Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. [sent-137, score-1.133]
38 The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. [sent-138, score-0.541]
39 Therefore the chance of observing another supporting sentence becomes larger than before. [sent-141, score-0.541]
40 It is hence necessary to consider the correlation between supporting sentences in the model. [sent-148, score-0.591]
41 8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w. [sent-154, score-0.792]
42 (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. [sent-166, score-0.379]
43 For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. [sent-171, score-0.591]
44 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. [sent-173, score-1.435]
45 2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. [sent-175, score-0.875]
46 1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5. [sent-177, score-1.358]
47 1) where mi is the number of supporting sentences corresponding to pattern i. [sent-178, score-0.593]
48 Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. [sent-182, score-0.697]
49 Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. [sent-189, score-1.482]
50 3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. [sent-192, score-0.511]
51 The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. [sent-193, score-1.108]
52 2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. [sent-195, score-1.492]
53 This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). [sent-196, score-0.632]
54 With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). [sent-197, score-0.896]
55 Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. [sent-198, score-0.376]
56 (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al. [sent-199, score-0.364]
57 The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. [sent-204, score-0.881]
58 The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). [sent-205, score-0.516]
59 In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. [sent-207, score-1.51]
60 Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. [sent-208, score-0.622]
61 Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5. [sent-209, score-1.643]
62 5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. [sent-210, score-0.541]
63 The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. [sent-211, score-1.775]
64 Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. [sent-212, score-1.662]
65 Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al. [sent-213, score-0.373]
66 In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. [sent-219, score-0.418]
67 In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. [sent-221, score-0.339]
68 We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. [sent-229, score-0.637]
69 For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. [sent-238, score-0.679]
70 The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. [sent-240, score-0.538]
71 Some sample terms in the term sets are listed in Table 3. [sent-248, score-0.291]
72 org/ Annotation For each term in the term set, the top-5 results (i. [sent-257, score-0.418]
73 cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. [sent-270, score-0.416]
74 … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6. [sent-284, score-0.39]
75 2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4. [sent-285, score-0.39]
76 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. [sent-288, score-0.432]
77 From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. [sent-292, score-0.627]
78 By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). [sent-294, score-1.541]
79 In contrast, more supporting sentences can be discov- ered for high-frequency terms. [sent-296, score-0.539]
80 The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. [sent-299, score-0.966]
81 The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. [sent-300, score-0.627]
82 Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. [sent-304, score-0.487]
83 3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. [sent-314, score-0.903]
84 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. [sent-316, score-1.05]
85 Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. [sent-322, score-0.498]
86 NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. [sent-324, score-0.39]
87 , the linear function is used to combine the evidence of pseudo supporting sentences. [sent-327, score-0.943]
88 NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. [sent-328, score-0.934]
89 NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. [sent-329, score-1.588]
90 First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. [sent-337, score-0.972]
91 The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. [sent-338, score-1.023]
92 evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. [sent-341, score-1.111]
93 As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. [sent-342, score-0.624]
94 We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. [sent-343, score-0.364]
95 This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. [sent-345, score-0.662]
96 As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. [sent-346, score-0.79]
97 evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. [sent-347, score-1.148]
98 We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. [sent-348, score-0.844]
99 The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. [sent-349, score-1.027]
100 We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. [sent-350, score-0.297]
wordName wordTfidf (topN-words)
[('supporting', 0.487), ('nonlinear', 0.379), ('propagation', 0.264), ('evidence', 0.234), ('term', 0.209), ('hypernym', 0.207), ('hyponymy', 0.188), ('fusion', 0.156), ('pnorm', 0.155), ('pseudo', 0.155), ('formula', 0.146), ('hypernyms', 0.116), ('shi', 0.111), ('pasca', 0.098), ('patterns', 0.092), ('labels', 0.09), ('graphs', 0.087), ('terms', 0.082), ('franchise', 0.078), ('talukdar', 0.078), ('formulas', 0.075), ('similarity', 0.068), ('restaurant', 0.067), ('snow', 0.064), ('adopting', 0.06), ('pantel', 0.06), ('npl', 0.058), ('distributional', 0.058), ('ds', 0.056), ('judged', 0.054), ('pattern', 0.054), ('label', 0.053), ('pb', 0.053), ('correlation', 0.052), ('sentences', 0.052), ('durme', 0.052), ('gain', 0.048), ('np', 0.048), ('graph', 0.046), ('adopted', 0.046), ('hyponym', 0.044), ('fair', 0.044), ('web', 0.044), ('nl', 0.043), ('ravichandran', 0.043), ('exploited', 0.043), ('rare', 0.041), ('idf', 0.041), ('hearst', 0.041), ('kozareva', 0.04), ('linear', 0.039), ('counting', 0.039), ('geraci', 0.039), ('lhotse', 0.039), ('porvoo', 0.039), ('shinzato', 0.039), ('subway', 0.039), ('tampere', 0.039), ('observations', 0.037), ('efforts', 0.035), ('independence', 0.034), ('discovered', 0.034), ('agirre', 0.033), ('please', 0.033), ('pair', 0.033), ('ra', 0.031), ('upward', 0.031), ('posterior', 0.03), ('mining', 0.03), ('tend', 0.03), ('cafarella', 0.03), ('helsinki', 0.03), ('chance', 0.03), ('probabilistic', 0.029), ('duplicate', 0.029), ('extraction', 0.028), ('combine', 0.028), ('private', 0.028), ('propagated', 0.028), ('viewpoint', 0.027), ('observation', 0.026), ('publicly', 0.026), ('improvement', 0.026), ('thanks', 0.026), ('utilizing', 0.026), ('philosophy', 0.026), ('correlations', 0.026), ('html', 0.026), ('existing', 0.025), ('corpusbased', 0.025), ('tables', 0.025), ('sentence', 0.024), ('acquisition', 0.024), ('score', 0.024), ('know', 0.024), ('map', 0.024), ('combination', 0.024), ('relation', 0.023), ('annotators', 0.023), ('pointing', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
Author: Fan Zhang ; Shuming Shi ; Jing Liu ; Shuqi Sun ; Chin-Yew Lin
Abstract: This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. 1 Introduction1 An important task in text mining is the automatic extraction of entities and their lexical relations; this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy (or is-a) relation from largescale, open-domain web documents. From the viewpoint of entity classification, the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches (Hearst 1992; Pantel & Ravichandran 2004; Snow et al., 2005; Durme & Pasca, 2008; Talukdar et al., 2008) to address the problem. These methods typically exploited manually-designed or automatical* This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns (e.g., “NP such as NP”, “NP like NP”, “NP is a NP”). Although some degree of success has been achieved with these efforts, the results are still far from perfect, in terms of both recall and precision. As will be demonstrated in this paper, even by processing a large corpus of 500 million web pages with the most popular patterns, we are not able to extract correct labels for many (especially rare) entities. Even for popular terms, incorrect results often appear in their label lists. The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. We demonstrate that the specific way of counting has a great impact on result quality, and that the state-ofthe-art counting methods are not optimal. Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. It is hard to extract correct labels for rare terms from a web corpus due to the data sparseness problem. To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. For example, if we already know that 1) Helsinki and Tampere are cities, and 2) Porvoo is similar to Helsinki and Tampere, then Porvoo is ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s159–1168, very likely also a city. This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. For example, Mount Vesuvius and Kilimanjaro are volcanoes and Lhotse is similar to them, but Lhotse is not a volcano. Therefore we should be very conservative and careful in hypernym propagation. In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. This is one of the technologies adopted in our semantic search and min- ing system NeedleSeek2. In the next section, we discuss major related efforts and how they differ from our work. Section 3 is a brief description of the baseline approach. The probabilistic evidence combination model that we exploited is introduced in Section 4. Our main approach is illustrated in Section 5. Section 6 shows our experimental settings and results. Finally, Section 7 concludes this paper. 2 Related Work Existing efforts for hyponymy relation extraction have been conducted upon various types of data sources, including plain-text corpora (Hearst 1992; Pantel & Ravichandran, 2004; Snow et al., 2005; Snow et al., 2006; Banko, et al., 2007; Durme & Pasca, 2008; Talukdar et al., 2008), semistructured web pages (Cafarella et al., 2008; Shinzato & Torisawa, 2004), web search results (Geraci et al., 2006; Kozareva et al., 2008; Wang & Cohen, 2009), and query logs (Pasca 2010). Our target for optimization in this paper is the approaches that use lexico-syntactic patterns to extract hyponymy relations from plain-text corpora. Our future work will study the application of the proposed algorithms on other types of approaches. 2 http://research.microsoft.com/en-us/projects/needleseek/ or http://needleseek.msra.cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al., 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. In (Snow et al., 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. The independence of evidence was assumed in their model. In comparison, we show that better results will be obtained if the evidence correlation is modeled appropriately. Our evidence propagation is basically about using term similarity information to help instance labeling. There have been several approaches which improve hyponymy extraction with instance clusters built by distributional similarity. In (Pantel & Ravichandran, 2004), labels were assigned to the committee (i.e., representative members) of a semantic class and used as the hypernyms of the whole class. Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). In contrast to their method, our label scoring and ranking approach is applied to every single term rather than a semantic class. In addition, we also compute label scores in a nonlinear way, which improves results quality. In Snow et al. (2005), a supervised approach was proposed to improve hypernym classification using coordinate terms. In comparison, our approach is unsupervised. Durme & Pasca (2008) cleaned the set of instance-label pairs with a TF*IDF like method, by exploiting clusters of semantically related phrases. The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. Label propagation is performed in (Talukdar et al., 2008; Talukdar & Pereira, 2010) based on multiple instance-label graphs. Term similarity information was not used in their approach. Most existing work tends to utilize small-scale or private corpora, whereas the corpus that we used is publicly available and much larger than most of the existing work. We published our term sets (refer to Section 6. 1) and their corresponding user judgments so researchers working on similar topics can reproduce our results. H eTIsaryApst-eI {N aP nLd(|{iso,}ra (eNisn|.wuPcgalhsu|edwa.gs(e)r{PN|baiPent,c}lg*ur){nd(ai n|dg)o {r}N P,L}* IsA-II NP (is|are|was|were|being) {the, those} NPL IsA-III NP (is|are|was|were|being) {another, any} NPL Table 1. Patterns adopted in this paper (NP: named phrase representing an entity; NPL: label) 3 Preliminaries The problem addressed in this paper is corpusbased is-a relation mining: extracting hypernyms (as labels) for entities from a large-scale, open- domain document corpus. The desired output is a mapping from terms to their corresponding hypernyms, which can naturally be represented as a weighted bipartite graph (term-label graph). Typically we are only interested in top labels of a term in the graph. Following existing efforts, we adopt patternmatching as a basic way of extracting hypernymy/hyponymy relations. Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. One or more termlabel pairs can be extracted if a pattern matches a sentence. In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3.1) where m is the number of times the pair (T, L) is extracted from the corpus, DF(L) is the number of in-links of L in the graph, N is total number of terms in the graph, and IDF means the “inverse document frequency”. A term can only keep its top-k neighbors (according to the edge weight) in the graph as its final labels. 1161 Our pattern matching algorithm implemented in this paper uses part-of-speech (POS) tagging information, without adopting a parser or a chunker. The noun phrase boundaries (for terms and labels) are determined by a manually designed POS tag list. 4 Probabilistic Label-Scoring Model Here we model the hyponymy extraction problem from the probability theory point of view, aiming at estimating the score of a term-label pair (i.e., the score of a label w.r.t. a term) with probabilistic evidence combination. The model was studied in (Shi et al., 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). Ei: The observation that (T, L) is extracted from a sentence Si via pattern matching (i.e., Si is a sup- porting sentence of the pair). Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,..,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. Formally, we need to find a function f to satisfy, P(A|E1,… ,Em) = f(P(A), P(A|E1)… P(A|Em) ) (4.1) … … …, For simplicity, we first consider the case of m=2. The case of m>2 is quite similar. We start from the simple case of independent supporting sentences. That is, ( ) ( ) ( ) ( ) ( )( ) By applying Bayes rule, we get, ( (4.2) (4.3) ) ( ( ) )() ( ( ) ) ( ) ( () ) ( ) ( ) (4.4) ( ) ( ) ( ) Then define ( ) ( ( )) ( ( )) ( ( )) Here G(A|E) represents the log-probability-gain of A given E, with the meaning of the gain in the log-probability value of A after the evidence E is observed (or known). It is a measure of the impact of evidence E to the probability of event A. With the definition of G(A|E), Formula 4.4 can be transformed to, ( ) ( ) ( ) (4.5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. It is easy to prove (by following a similar procedure) that the above Formula holds for the case of m>2, as long as the pieces of evidence are mutually independent. Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . G(AT,L|E1… ,Em) ∑ ( ) ( ) (4.6) This is exactly the Formula 3. 1. By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. … TRaAb:le(2.A/E)Rv(ide)ncHd065epa.9r81sn7t-dIec10ys.7Ae3-1I0timEa2o:8nH0Is.4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. The probability is quite low because most term-label pairs do not have hyponymy relations. Once we have observed S1, however, the chance of (T, L) having a hyponymy relation in1162 creases. Therefore the chance of observing another supporting sentence becomes larger than before. Table 2 shows the rough estimation of ( ( ) ( ) ) (denoted as RA), ( ( ) ( ) ) (denoted as R), and their ratios. The statistics are obtained by performing maximal likelihood estimation (MLE) upon our corpus and a random selection of term-label pairs from our term sets (see Section 6. 1) together with their top labels3. The data verifies our analysis about the correlation between E1 and E2 (note that R=1 means independent). In addition, it can be seen that the conditional independence assumption of Formula 4.3 does not hold (because RA>1). It is hence necessary to consider the correlation between supporting sentences in the model. The estimation of Table 2 also indicates that, ( ( ) ( )) ( ( ) ( ) ) (4.7) By following a similar procedure as above, with Formulas 4.2 and 4.3 replaced by 4.7, we have, ( ) ( ) ( ) (4.8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w.r.t. term T (given both the sen- tences) is smaller than the sum of the gains caused by one sentence only. In the extreme case that sentence S2 fully depends on E1 (i.e. P(E2|E1)=1), it is easy to prove that ( ) ( ) It is reasonable, since event E2 does not bring in more information than E1. Formula 4.8 cannot be used directly for computing the posterior gain. What we really need is a function h satisfying () ( ( ) ( )) (4.9) and ( )∑ (4.10) Shi et al. (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. 11) 3 RA is estimated from the labels judged as “Good”; whereas the estimation of R is from all judged labels. ( ) √ ∑ (p>1) (4.12) In the next section, we use the above two h func- tions as basic building blocks to compute label scores for terms. 5 Our Approach Multiple types of patterns (Table 1) can be adopted to extract term-label pairs. For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. In Section 5. 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. Then in Section 5.2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. 5.1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5.1) where mi is the number of supporting sentences corresponding to pattern i. Also assume the gain score of Si,j is xi,j, i.e., xi,j=G(A|Si,j). Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. This can be verified by the data in Table-2. By ignoring the inter-pattern correlations, we make the following simplified assumption: Assumption: Supporting sentences corresponding to the same pattern are correlated, while those of different patterns are independent. According to this assumption, our label-scoring function is, ( ) ∑ ( ) (5.2) In the simple case that ( ) , if the h function of Formula 4. 12 is adopted, then, ( ) (∑ √ ) ( ) (5.3) 1163 We use an example to illustrate the above formula. Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. Also assume the supporting-sentence-count vector of label L2 is (25, 0, 0, 0, 0, 0). If we use Formula 5.3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. √ ; √ 5.2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). With the graphs, we can obtain the similarity between two terms without their hypernyms being available. The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). Our evidence propagation algorithm contains two phases. In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5.5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). We call the first type of graph DS and the second type PB. DS approaches are based on the distributional hypothesis (Harris, 1985), which says that terms appearing in analogous contexts tend to be similar. In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. The similarity between two terms is computed as the similarity between their corresponding feature vectors. In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. Two categories of patterns have been studied in the literature (Heast 1992; Pasca 2004; Kozareva et al., 2008; Zhang et al., 2009): sentence lexical patterns, and HTML tag patterns. An example of sentence lexical patterns is “T {, T} *{,} (and|or) T”. HTML tag patterns include HTML tables, drop-down lists, and other tag repeat patterns. In this paper, we generate the DS and PB graphs by adopting the best-performed methods studied in (Shi et al., 2010). We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. 6 Experiments 6.1 Experimental setup Corpus We adopt a publicly available dataset in our experiments: ClueWeb094. This is a very large dataset collected by Carnegie Mellon University in early 2009 and has been used by several tracks of the Text Retrieval Conference (TREC)5. The whole dataset consists of 1.04 billion web pages in ten languages while only those in English, about 500 million pages, are used in our experiments. The reason for selecting such a dataset is twofold: First, it is a corpus large enough for conducting webscale experiments and getting meaningful results. Second, since it is publicly available, it is possible for other researchers to reproduce the experiments in this paper. Term sets Approaches are evaluated by using two sets of selected terms: Wiki200, and Ext100. For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. Wiki200 is constructed by first randomly selecting 400 Wikipedia6 titles as our candidate terms, with the probability of a title T being selected to be ( ( )), where F(T) is the frequency of T in our data corpus. The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. Then 200 terms are manually selected from the 400 candidate terms, with the principle of maximizing the diversity of terms in terms of length (i.e., number of words) and type (person, location, organization, software, movie, song, animal, plant, etc.). Wiki200 is further divided into two subsets: Wiki100H and Wiki100L, containing respectively the 100 high-frequency and lowfrequency terms. Ext100 is built by first selecting 200 non-Wikipedia-title terms at random from the term-label graph generated by the baseline approach (Formula 3. 1), then manually selecting 100 terms. Some sample terms in the term sets are listed in Table 3. 4 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5 http://trec.nist.gov/ 6 http://www.wikipedia.org/ Annotation For each term in the term set, the top-5 results (i.e., hypernym labels) of various methods are mixed and judged by human annotators. Each annotator assigns each result item a judgment of “Good”, “Fair” or “Bad”. The annotators do not know the method by which a result item is generated. Six annotators participated in the labeling with a rough speed of 15 minutes per term. We also encourage the annotators to add new good results which are not discovered by any method. The term sets and their corresponding user anno- tations are available for download at the following links (dataset ID=data.queryset.semcat01): http://research.microsoft.com/en-us/projects/needleseek/ http://needleseek.msra.cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. The evaluation score on a term set is the average over all the terms. Precision@k: The percentage of relevant (good or fair) labels in the top-k results (labels judged as “Fair” are counted as 0.5) Recall@k: The ratio of relevant labels in the topk results to the total number of relevant labels R-Precision: Precision@R where R is the total number of labels judged as “Good” Mean average precision (MAP): The average of precision values at the positions of all good or fair results Before annotation and evaluation, the hypernym list generated by each method for each term is preprocessed to remove duplicate items. Two hypernyms are called duplicate items if they share the same head word (e.g., “military conflict” and “conflict”). For duplicate hypernyms, only the first (i.e., the highest ranked one) in the list is kept. The goal with such a preprocessing step is to partially con- sider results diversity in evaluation and to make a more meaningful comparison among different methods. Consider two hypernym lists for “subway”: List-1 : restaurant; chain restaurant; worldwide chain restaurant; franchise; restaurant franchise… List-2: restaurant; franchise; transportation; company; fast food… There are more detailed hypernyms in the first list about “subway” as a restaurant or a franchise; while the second list covers a broader range of meanings for the term. It is hard to say which is better (without considering the upper-layer applications). With this preprocessing step, we keep our focus on short hypernyms rather than detailed ones. … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6.2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4.1. In Table 4, Linear means that Formula 3. 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. 11 and 4. 12 being utilized. The performance improvement numbers shown in the table are based on the linear version; and the upward pointing arrows indicate relative percentage improvement over the baseline. From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. It is interesting to note that the performance improvement is more significant on Wiki100H, the set of high frequency terms. By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). So the scores computed by various fusion algorithms tend to be similar. In contrast, more supporting sentences can be discov- ered for high-frequency terms. Much information is contained in the sentences about the hypernyms of the high-frequency terms, but the linear function of Formula 3.1 fails to make effective use of it. The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. Please note that the terms in Ext100 do not appear in Wikipedia titles. Thanks to the scale of the data corpus we are using, even the baseline approach achieves reasonably good performance. Please note that the terms (refer to Table 3) we are using are “harder” than those adopted for evaluation in many existing papers. Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. Please note that the recall (R@1, R@5) in this paper is pseudo-recall, i.e., we treat the number of known relevant (Good or Fair) results as the total number of relevant ones. MTPLeNainotbhgrlmed5.M012P35A8e96r%5P4f0orRm-.aP40%2rn9ce 50P7o.@625m10p%5 ari0Ps.4o@07%n25 amR307o.@41n526g%0 vaRr0i.3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. The linear method of Formula 3. 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. Figure 1 shows that, for most of the term sets, the best performance is obtained for [2.0, 4.0]. The reason may be that the sentence correlations are better estimated with p values in this range. Figure 1. Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. The methods for comparison are, Base: The linear function without propagation. NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. LP: Linear propagation, i.e., the linear function is used to combine the evidence of pseudo supporting sentences. NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. Wiki200; Similarity graph: PB; Nonlinear formula: PNorm) In this paper, we generate the DS (distributional similarity) and PB (pattern-based) graphs by adopting the best-performed methods studied in (Shi et al., 2010). The performance improvement numbers (indicated by the upward pointing arrows) shown in tables 6~9 are relative percentage improvement 1166 over the base approach (i.e., linear function without propagation). The values of parameter are set to maximize the MAP values. Several observations can be made from Table 6. First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. The second observation is that the NL+NLP approach achieves a much larger performance improvement than NL and NLP. Similar results (omitted due to space limitation) can be observed on the Ext100 term set. evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. The performance improvement over Wiki100L is especially remarkable. This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. Acknowledgments We would like to thank Matt Callcut for reading through the paper. Thanks to the annotators for their efforts in judging the hypernym labels. Thanks to Yueguo Chen, Siyu Lei, and the anonymous reviewers for their helpful comments and suggestions. The first author is partially supported by the NSF of China (60903028,61070014), and Key Projects in the Tianjin Science and Technology Pillar Program. References E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proc. of NAACL-HLT’2009. M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open Information Extraction from the Web. In Proc. of IJCAI’2007. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. In Proceedings of the 34th Conference on Very Large Data Bases (VLDB’2008), pages 538–549, Auckland, New Zealand. B. Van Durme and M. Pasca. 2008. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain information extraction. Twenty-Third AAAI Conference on Artificial Intelligence. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. 2006. Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In Proceedings of the 13th Conference on String Processing and Information Retrieval (SPIRE’2006), pages 25–36, Glasgow, Scotland. Z. S. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press. M. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Fourteenth International Conference on Computational Linguistics, Nantes, France. Z. Kozareva, E. Riloff, E.H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In Proc. of ACL'2008. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu and V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP’2009. Singapore. P. Pantel and D. Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proc. of the 2004 Human Language Technology Conference (HLTNAACL’2004), 321–328. M. Pasca. 2004. Acquisition of Categorized Named Entities for Web Search. In Proc. of CIKM’2004. M. Pasca. 2010. The Role of Queries in Ranking Labeled Instances Extracted from Text. In Proc. of COLING’2010, Beijing, China. S. Shi, B. Lu, Y. Ma, and J.-R. Wen. 2009. Nonlinear Static-Rank Computation. In Proc. of CIKM’2009, Kong Kong. 1168 S. Shi, H. Zhang, X. Yuan, J.-R. Wen. 2010. Corpusbased Semantic Class Mining: Distributional vs. Pattern-Based Approaches. In Proc. of COLING’2010, Beijing, China. K. Shinzato and K. Torisawa. 2004. Acquiring Hyponymy Relations from Web Documents. In Proc. of the 2004 Human Language (HLT-NAACL’2004). Technology Conference R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems. R. Snow, D. Jurafsky, and A. Y. Ng. 2006. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), 801–808. P. P. Talukdar and F. Pereira. 2010. Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. In 48th Annual Meeting of the Association for Computational Linguistics (ACL’2010). P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. 2008. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP’2008), pages 581–589. R.C. Wang. W.W. Cohen. Automatic Set Instance Extraction using the Web. In Proc. of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’2009), gapore. pages 441–449, Sin- H. Zhang, M. Zhu, S. Shi, and J.-R. Wen. 2009. Employing Topic Models for Pattern-based Semantic Class Discovery. In Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP’2009), pages 441–449, Singapore.
2 0.10144502 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Author: Dipanjan Das ; Slav Petrov
Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.
3 0.087218054 258 acl-2011-Ranking Class Labels Using Query Sessions
Author: Marius Pasca
Abstract: The role of search queries, as available within query sessions or in isolation from one another, in examined in the context of ranking the class labels (e.g., brazilian cities, business centers, hilly sites) extracted from Web documents for various instances (e.g., rio de janeiro). The co-occurrence of a class label and an instance, in the same query or within the same query session, is used to reinforce the estimated relevance of the class label for the instance. Experiments over evaluation sets of instances associated with Web search queries illustrate the higher quality of the query-based, re-ranked class labels, relative to ranking baselines using documentbased counts.
4 0.087184116 174 acl-2011-Insights from Network Structure for Text Mining
Author: Zornitsa Kozareva ; Eduard Hovy
Abstract: Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network, in which words are the nodes and relations linking them are the edges. The results of computational network analysis, especially from the world wide web, are thus applicable. Surprisingly, these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining, how they explain some previously observed phenomena, and how they can be helpful for computational linguistics applications.
5 0.084601603 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
Author: Ryan Gabbard ; Marjorie Freedman ; Ralph Weischedel
Abstract: As an alternative to requiring substantial supervised relation training data, many have explored bootstrapping relation extraction from a few seed examples. Most techniques assume that the examples are based on easily spotted anchors, e.g., names or dates. Sentences in a corpus which contain the anchors are then used to induce alternative ways of expressing the relation. We explore whether coreference can improve the learning process. That is, if the algorithm considered examples such as his sister, would accuracy be improved? With coreference, we see on average a 2-fold increase in F-Score. Despite using potentially errorful machine coreference, we see significant increase in recall on all relations. Precision increases in four cases and decreases in six.
6 0.076526627 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
7 0.07154572 52 acl-2011-Automatic Labelling of Topic Models
8 0.070963964 28 acl-2011-A Statistical Tree Annotator and Its Applications
9 0.069951847 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates
10 0.066738978 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
11 0.066708259 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
12 0.064757772 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
13 0.059480671 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
14 0.057143737 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
15 0.056701344 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts
16 0.056209691 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
17 0.056104004 115 acl-2011-Engkoo: Mining the Web for Language Learning
18 0.055737343 144 acl-2011-Global Learning of Typed Entailment Rules
19 0.055467758 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities
20 0.055429868 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
topicId topicWeight
[(0, 0.169), (1, 0.038), (2, -0.073), (3, 0.031), (4, 0.006), (5, -0.024), (6, 0.005), (7, -0.012), (8, -0.014), (9, -0.053), (10, 0.018), (11, -0.001), (12, 0.021), (13, 0.036), (14, -0.021), (15, -0.03), (16, -0.038), (17, -0.073), (18, 0.022), (19, -0.03), (20, 0.018), (21, -0.005), (22, 0.023), (23, 0.032), (24, -0.006), (25, -0.007), (26, -0.019), (27, 0.056), (28, 0.073), (29, 0.033), (30, 0.036), (31, -0.025), (32, -0.016), (33, -0.001), (34, -0.021), (35, -0.052), (36, 0.042), (37, -0.006), (38, -0.081), (39, -0.012), (40, 0.022), (41, 0.011), (42, -0.009), (43, 0.003), (44, 0.046), (45, -0.01), (46, 0.049), (47, 0.116), (48, -0.001), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.93624872 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
Author: Fan Zhang ; Shuming Shi ; Jing Liu ; Shuqi Sun ; Chin-Yew Lin
Abstract: This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. 1 Introduction1 An important task in text mining is the automatic extraction of entities and their lexical relations; this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy (or is-a) relation from largescale, open-domain web documents. From the viewpoint of entity classification, the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches (Hearst 1992; Pantel & Ravichandran 2004; Snow et al., 2005; Durme & Pasca, 2008; Talukdar et al., 2008) to address the problem. These methods typically exploited manually-designed or automatical* This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns (e.g., “NP such as NP”, “NP like NP”, “NP is a NP”). Although some degree of success has been achieved with these efforts, the results are still far from perfect, in terms of both recall and precision. As will be demonstrated in this paper, even by processing a large corpus of 500 million web pages with the most popular patterns, we are not able to extract correct labels for many (especially rare) entities. Even for popular terms, incorrect results often appear in their label lists. The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. We demonstrate that the specific way of counting has a great impact on result quality, and that the state-ofthe-art counting methods are not optimal. Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. It is hard to extract correct labels for rare terms from a web corpus due to the data sparseness problem. To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. For example, if we already know that 1) Helsinki and Tampere are cities, and 2) Porvoo is similar to Helsinki and Tampere, then Porvoo is ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s159–1168, very likely also a city. This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. For example, Mount Vesuvius and Kilimanjaro are volcanoes and Lhotse is similar to them, but Lhotse is not a volcano. Therefore we should be very conservative and careful in hypernym propagation. In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. This is one of the technologies adopted in our semantic search and min- ing system NeedleSeek2. In the next section, we discuss major related efforts and how they differ from our work. Section 3 is a brief description of the baseline approach. The probabilistic evidence combination model that we exploited is introduced in Section 4. Our main approach is illustrated in Section 5. Section 6 shows our experimental settings and results. Finally, Section 7 concludes this paper. 2 Related Work Existing efforts for hyponymy relation extraction have been conducted upon various types of data sources, including plain-text corpora (Hearst 1992; Pantel & Ravichandran, 2004; Snow et al., 2005; Snow et al., 2006; Banko, et al., 2007; Durme & Pasca, 2008; Talukdar et al., 2008), semistructured web pages (Cafarella et al., 2008; Shinzato & Torisawa, 2004), web search results (Geraci et al., 2006; Kozareva et al., 2008; Wang & Cohen, 2009), and query logs (Pasca 2010). Our target for optimization in this paper is the approaches that use lexico-syntactic patterns to extract hyponymy relations from plain-text corpora. Our future work will study the application of the proposed algorithms on other types of approaches. 2 http://research.microsoft.com/en-us/projects/needleseek/ or http://needleseek.msra.cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al., 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. In (Snow et al., 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. The independence of evidence was assumed in their model. In comparison, we show that better results will be obtained if the evidence correlation is modeled appropriately. Our evidence propagation is basically about using term similarity information to help instance labeling. There have been several approaches which improve hyponymy extraction with instance clusters built by distributional similarity. In (Pantel & Ravichandran, 2004), labels were assigned to the committee (i.e., representative members) of a semantic class and used as the hypernyms of the whole class. Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). In contrast to their method, our label scoring and ranking approach is applied to every single term rather than a semantic class. In addition, we also compute label scores in a nonlinear way, which improves results quality. In Snow et al. (2005), a supervised approach was proposed to improve hypernym classification using coordinate terms. In comparison, our approach is unsupervised. Durme & Pasca (2008) cleaned the set of instance-label pairs with a TF*IDF like method, by exploiting clusters of semantically related phrases. The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. Label propagation is performed in (Talukdar et al., 2008; Talukdar & Pereira, 2010) based on multiple instance-label graphs. Term similarity information was not used in their approach. Most existing work tends to utilize small-scale or private corpora, whereas the corpus that we used is publicly available and much larger than most of the existing work. We published our term sets (refer to Section 6. 1) and their corresponding user judgments so researchers working on similar topics can reproduce our results. H eTIsaryApst-eI {N aP nLd(|{iso,}ra (eNisn|.wuPcgalhsu|edwa.gs(e)r{PN|baiPent,c}lg*ur){nd(ai n|dg)o {r}N P,L}* IsA-II NP (is|are|was|were|being) {the, those} NPL IsA-III NP (is|are|was|were|being) {another, any} NPL Table 1. Patterns adopted in this paper (NP: named phrase representing an entity; NPL: label) 3 Preliminaries The problem addressed in this paper is corpusbased is-a relation mining: extracting hypernyms (as labels) for entities from a large-scale, open- domain document corpus. The desired output is a mapping from terms to their corresponding hypernyms, which can naturally be represented as a weighted bipartite graph (term-label graph). Typically we are only interested in top labels of a term in the graph. Following existing efforts, we adopt patternmatching as a basic way of extracting hypernymy/hyponymy relations. Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. One or more termlabel pairs can be extracted if a pattern matches a sentence. In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3.1) where m is the number of times the pair (T, L) is extracted from the corpus, DF(L) is the number of in-links of L in the graph, N is total number of terms in the graph, and IDF means the “inverse document frequency”. A term can only keep its top-k neighbors (according to the edge weight) in the graph as its final labels. 1161 Our pattern matching algorithm implemented in this paper uses part-of-speech (POS) tagging information, without adopting a parser or a chunker. The noun phrase boundaries (for terms and labels) are determined by a manually designed POS tag list. 4 Probabilistic Label-Scoring Model Here we model the hyponymy extraction problem from the probability theory point of view, aiming at estimating the score of a term-label pair (i.e., the score of a label w.r.t. a term) with probabilistic evidence combination. The model was studied in (Shi et al., 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). Ei: The observation that (T, L) is extracted from a sentence Si via pattern matching (i.e., Si is a sup- porting sentence of the pair). Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,..,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. Formally, we need to find a function f to satisfy, P(A|E1,… ,Em) = f(P(A), P(A|E1)… P(A|Em) ) (4.1) … … …, For simplicity, we first consider the case of m=2. The case of m>2 is quite similar. We start from the simple case of independent supporting sentences. That is, ( ) ( ) ( ) ( ) ( )( ) By applying Bayes rule, we get, ( (4.2) (4.3) ) ( ( ) )() ( ( ) ) ( ) ( () ) ( ) ( ) (4.4) ( ) ( ) ( ) Then define ( ) ( ( )) ( ( )) ( ( )) Here G(A|E) represents the log-probability-gain of A given E, with the meaning of the gain in the log-probability value of A after the evidence E is observed (or known). It is a measure of the impact of evidence E to the probability of event A. With the definition of G(A|E), Formula 4.4 can be transformed to, ( ) ( ) ( ) (4.5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. It is easy to prove (by following a similar procedure) that the above Formula holds for the case of m>2, as long as the pieces of evidence are mutually independent. Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . G(AT,L|E1… ,Em) ∑ ( ) ( ) (4.6) This is exactly the Formula 3. 1. By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. … TRaAb:le(2.A/E)Rv(ide)ncHd065epa.9r81sn7t-dIec10ys.7Ae3-1I0timEa2o:8nH0Is.4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. The probability is quite low because most term-label pairs do not have hyponymy relations. Once we have observed S1, however, the chance of (T, L) having a hyponymy relation in1162 creases. Therefore the chance of observing another supporting sentence becomes larger than before. Table 2 shows the rough estimation of ( ( ) ( ) ) (denoted as RA), ( ( ) ( ) ) (denoted as R), and their ratios. The statistics are obtained by performing maximal likelihood estimation (MLE) upon our corpus and a random selection of term-label pairs from our term sets (see Section 6. 1) together with their top labels3. The data verifies our analysis about the correlation between E1 and E2 (note that R=1 means independent). In addition, it can be seen that the conditional independence assumption of Formula 4.3 does not hold (because RA>1). It is hence necessary to consider the correlation between supporting sentences in the model. The estimation of Table 2 also indicates that, ( ( ) ( )) ( ( ) ( ) ) (4.7) By following a similar procedure as above, with Formulas 4.2 and 4.3 replaced by 4.7, we have, ( ) ( ) ( ) (4.8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w.r.t. term T (given both the sen- tences) is smaller than the sum of the gains caused by one sentence only. In the extreme case that sentence S2 fully depends on E1 (i.e. P(E2|E1)=1), it is easy to prove that ( ) ( ) It is reasonable, since event E2 does not bring in more information than E1. Formula 4.8 cannot be used directly for computing the posterior gain. What we really need is a function h satisfying () ( ( ) ( )) (4.9) and ( )∑ (4.10) Shi et al. (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. 11) 3 RA is estimated from the labels judged as “Good”; whereas the estimation of R is from all judged labels. ( ) √ ∑ (p>1) (4.12) In the next section, we use the above two h func- tions as basic building blocks to compute label scores for terms. 5 Our Approach Multiple types of patterns (Table 1) can be adopted to extract term-label pairs. For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. In Section 5. 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. Then in Section 5.2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. 5.1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5.1) where mi is the number of supporting sentences corresponding to pattern i. Also assume the gain score of Si,j is xi,j, i.e., xi,j=G(A|Si,j). Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. This can be verified by the data in Table-2. By ignoring the inter-pattern correlations, we make the following simplified assumption: Assumption: Supporting sentences corresponding to the same pattern are correlated, while those of different patterns are independent. According to this assumption, our label-scoring function is, ( ) ∑ ( ) (5.2) In the simple case that ( ) , if the h function of Formula 4. 12 is adopted, then, ( ) (∑ √ ) ( ) (5.3) 1163 We use an example to illustrate the above formula. Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. Also assume the supporting-sentence-count vector of label L2 is (25, 0, 0, 0, 0, 0). If we use Formula 5.3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. √ ; √ 5.2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). With the graphs, we can obtain the similarity between two terms without their hypernyms being available. The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). Our evidence propagation algorithm contains two phases. In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5.5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). We call the first type of graph DS and the second type PB. DS approaches are based on the distributional hypothesis (Harris, 1985), which says that terms appearing in analogous contexts tend to be similar. In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. The similarity between two terms is computed as the similarity between their corresponding feature vectors. In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. Two categories of patterns have been studied in the literature (Heast 1992; Pasca 2004; Kozareva et al., 2008; Zhang et al., 2009): sentence lexical patterns, and HTML tag patterns. An example of sentence lexical patterns is “T {, T} *{,} (and|or) T”. HTML tag patterns include HTML tables, drop-down lists, and other tag repeat patterns. In this paper, we generate the DS and PB graphs by adopting the best-performed methods studied in (Shi et al., 2010). We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. 6 Experiments 6.1 Experimental setup Corpus We adopt a publicly available dataset in our experiments: ClueWeb094. This is a very large dataset collected by Carnegie Mellon University in early 2009 and has been used by several tracks of the Text Retrieval Conference (TREC)5. The whole dataset consists of 1.04 billion web pages in ten languages while only those in English, about 500 million pages, are used in our experiments. The reason for selecting such a dataset is twofold: First, it is a corpus large enough for conducting webscale experiments and getting meaningful results. Second, since it is publicly available, it is possible for other researchers to reproduce the experiments in this paper. Term sets Approaches are evaluated by using two sets of selected terms: Wiki200, and Ext100. For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. Wiki200 is constructed by first randomly selecting 400 Wikipedia6 titles as our candidate terms, with the probability of a title T being selected to be ( ( )), where F(T) is the frequency of T in our data corpus. The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. Then 200 terms are manually selected from the 400 candidate terms, with the principle of maximizing the diversity of terms in terms of length (i.e., number of words) and type (person, location, organization, software, movie, song, animal, plant, etc.). Wiki200 is further divided into two subsets: Wiki100H and Wiki100L, containing respectively the 100 high-frequency and lowfrequency terms. Ext100 is built by first selecting 200 non-Wikipedia-title terms at random from the term-label graph generated by the baseline approach (Formula 3. 1), then manually selecting 100 terms. Some sample terms in the term sets are listed in Table 3. 4 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5 http://trec.nist.gov/ 6 http://www.wikipedia.org/ Annotation For each term in the term set, the top-5 results (i.e., hypernym labels) of various methods are mixed and judged by human annotators. Each annotator assigns each result item a judgment of “Good”, “Fair” or “Bad”. The annotators do not know the method by which a result item is generated. Six annotators participated in the labeling with a rough speed of 15 minutes per term. We also encourage the annotators to add new good results which are not discovered by any method. The term sets and their corresponding user anno- tations are available for download at the following links (dataset ID=data.queryset.semcat01): http://research.microsoft.com/en-us/projects/needleseek/ http://needleseek.msra.cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. The evaluation score on a term set is the average over all the terms. Precision@k: The percentage of relevant (good or fair) labels in the top-k results (labels judged as “Fair” are counted as 0.5) Recall@k: The ratio of relevant labels in the topk results to the total number of relevant labels R-Precision: Precision@R where R is the total number of labels judged as “Good” Mean average precision (MAP): The average of precision values at the positions of all good or fair results Before annotation and evaluation, the hypernym list generated by each method for each term is preprocessed to remove duplicate items. Two hypernyms are called duplicate items if they share the same head word (e.g., “military conflict” and “conflict”). For duplicate hypernyms, only the first (i.e., the highest ranked one) in the list is kept. The goal with such a preprocessing step is to partially con- sider results diversity in evaluation and to make a more meaningful comparison among different methods. Consider two hypernym lists for “subway”: List-1 : restaurant; chain restaurant; worldwide chain restaurant; franchise; restaurant franchise… List-2: restaurant; franchise; transportation; company; fast food… There are more detailed hypernyms in the first list about “subway” as a restaurant or a franchise; while the second list covers a broader range of meanings for the term. It is hard to say which is better (without considering the upper-layer applications). With this preprocessing step, we keep our focus on short hypernyms rather than detailed ones. … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6.2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4.1. In Table 4, Linear means that Formula 3. 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. 11 and 4. 12 being utilized. The performance improvement numbers shown in the table are based on the linear version; and the upward pointing arrows indicate relative percentage improvement over the baseline. From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. It is interesting to note that the performance improvement is more significant on Wiki100H, the set of high frequency terms. By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). So the scores computed by various fusion algorithms tend to be similar. In contrast, more supporting sentences can be discov- ered for high-frequency terms. Much information is contained in the sentences about the hypernyms of the high-frequency terms, but the linear function of Formula 3.1 fails to make effective use of it. The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. Please note that the terms in Ext100 do not appear in Wikipedia titles. Thanks to the scale of the data corpus we are using, even the baseline approach achieves reasonably good performance. Please note that the terms (refer to Table 3) we are using are “harder” than those adopted for evaluation in many existing papers. Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. Please note that the recall (R@1, R@5) in this paper is pseudo-recall, i.e., we treat the number of known relevant (Good or Fair) results as the total number of relevant ones. MTPLeNainotbhgrlmed5.M012P35A8e96r%5P4f0orRm-.aP40%2rn9ce 50P7o.@625m10p%5 ari0Ps.4o@07%n25 amR307o.@41n526g%0 vaRr0i.3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. The linear method of Formula 3. 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. Figure 1 shows that, for most of the term sets, the best performance is obtained for [2.0, 4.0]. The reason may be that the sentence correlations are better estimated with p values in this range. Figure 1. Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. The methods for comparison are, Base: The linear function without propagation. NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. LP: Linear propagation, i.e., the linear function is used to combine the evidence of pseudo supporting sentences. NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. Wiki200; Similarity graph: PB; Nonlinear formula: PNorm) In this paper, we generate the DS (distributional similarity) and PB (pattern-based) graphs by adopting the best-performed methods studied in (Shi et al., 2010). The performance improvement numbers (indicated by the upward pointing arrows) shown in tables 6~9 are relative percentage improvement 1166 over the base approach (i.e., linear function without propagation). The values of parameter are set to maximize the MAP values. Several observations can be made from Table 6. First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. The second observation is that the NL+NLP approach achieves a much larger performance improvement than NL and NLP. Similar results (omitted due to space limitation) can be observed on the Ext100 term set. evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. The performance improvement over Wiki100L is especially remarkable. This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. Acknowledgments We would like to thank Matt Callcut for reading through the paper. Thanks to the annotators for their efforts in judging the hypernym labels. Thanks to Yueguo Chen, Siyu Lei, and the anonymous reviewers for their helpful comments and suggestions. The first author is partially supported by the NSF of China (60903028,61070014), and Key Projects in the Tianjin Science and Technology Pillar Program. References E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proc. of NAACL-HLT’2009. M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open Information Extraction from the Web. In Proc. of IJCAI’2007. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. In Proceedings of the 34th Conference on Very Large Data Bases (VLDB’2008), pages 538–549, Auckland, New Zealand. B. Van Durme and M. Pasca. 2008. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain information extraction. Twenty-Third AAAI Conference on Artificial Intelligence. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. 2006. Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In Proceedings of the 13th Conference on String Processing and Information Retrieval (SPIRE’2006), pages 25–36, Glasgow, Scotland. Z. S. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press. M. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Fourteenth International Conference on Computational Linguistics, Nantes, France. Z. Kozareva, E. Riloff, E.H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In Proc. of ACL'2008. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu and V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP’2009. Singapore. P. Pantel and D. Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proc. of the 2004 Human Language Technology Conference (HLTNAACL’2004), 321–328. M. Pasca. 2004. Acquisition of Categorized Named Entities for Web Search. In Proc. of CIKM’2004. M. Pasca. 2010. The Role of Queries in Ranking Labeled Instances Extracted from Text. In Proc. of COLING’2010, Beijing, China. S. Shi, B. Lu, Y. Ma, and J.-R. Wen. 2009. Nonlinear Static-Rank Computation. In Proc. of CIKM’2009, Kong Kong. 1168 S. Shi, H. Zhang, X. Yuan, J.-R. Wen. 2010. Corpusbased Semantic Class Mining: Distributional vs. Pattern-Based Approaches. In Proc. of COLING’2010, Beijing, China. K. Shinzato and K. Torisawa. 2004. Acquiring Hyponymy Relations from Web Documents. In Proc. of the 2004 Human Language (HLT-NAACL’2004). Technology Conference R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems. R. Snow, D. Jurafsky, and A. Y. Ng. 2006. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), 801–808. P. P. Talukdar and F. Pereira. 2010. Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. In 48th Annual Meeting of the Association for Computational Linguistics (ACL’2010). P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. 2008. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP’2008), pages 581–589. R.C. Wang. W.W. Cohen. Automatic Set Instance Extraction using the Web. In Proc. of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’2009), gapore. pages 441–449, Sin- H. Zhang, M. Zhu, S. Shi, and J.-R. Wen. 2009. Employing Topic Models for Pattern-based Semantic Class Discovery. In Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP’2009), pages 441–449, Singapore.
2 0.63521761 174 acl-2011-Insights from Network Structure for Text Mining
Author: Zornitsa Kozareva ; Eduard Hovy
Abstract: Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network, in which words are the nodes and relations linking them are the edges. The results of computational network analysis, especially from the world wide web, are thus applicable. Surprisingly, these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining, how they explain some previously observed phenomena, and how they can be helpful for computational linguistics applications.
3 0.60426277 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
Author: Delip Rao ; David Yarowsky
Abstract: This paper presents an original approach to semi-supervised learning of personal name ethnicity from typed graphs of morphophonemic features and first/last-name co-occurrence statistics. We frame this as a general solution to an inference problem over typed graphs where the edges represent labeled relations between features that are parameterized by the edge types. We propose a framework for parameter estimation on different constructions of typed graphs for this problem using a gradient-free optimization method based on grid search. Results on both in-domain and out-of-domain data show significant gains over 30% accuracy improvement using the techniques presented in the paper.
4 0.58599257 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Author: Dipanjan Das ; Slav Petrov
Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.
5 0.57611775 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging
Author: Anders Sgaard
Abstract: This paper introduces a new training set condensation technique designed for mixtures of labeled and unlabeled data. It finds a condensed set of labeled and unlabeled data points, typically smaller than what is obtained using condensed nearest neighbor on the labeled data only, and improves classification accuracy. We evaluate the algorithm on semisupervised part-of-speech tagging and present the best published result on the Wall Street Journal data set.
6 0.57027972 293 acl-2011-Template-Based Information Extraction without the Templates
7 0.5566541 162 acl-2011-Identifying the Semantic Orientation of Foreign Words
8 0.55537254 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping
9 0.54965323 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components
10 0.54603219 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
11 0.5452466 74 acl-2011-Combining Indicators of Allophony
12 0.54465634 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
13 0.5353291 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
14 0.53073663 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
15 0.52681446 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
16 0.52448678 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
17 0.52356637 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
18 0.51363069 144 acl-2011-Global Learning of Typed Entailment Rules
19 0.51138395 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
20 0.50898236 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
topicId topicWeight
[(5, 0.021), (9, 0.252), (17, 0.059), (26, 0.049), (37, 0.077), (39, 0.067), (41, 0.062), (55, 0.03), (59, 0.034), (72, 0.032), (91, 0.044), (96, 0.182), (97, 0.017)]
simIndex simValue paperId paperTitle
1 0.85114676 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
Author: Truc Vien T. Nguyen ; Alessandro Moschitti
Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.
same-paper 2 0.82662296 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
Author: Fan Zhang ; Shuming Shi ; Jing Liu ; Shuqi Sun ; Chin-Yew Lin
Abstract: This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. 1 Introduction1 An important task in text mining is the automatic extraction of entities and their lexical relations; this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy (or is-a) relation from largescale, open-domain web documents. From the viewpoint of entity classification, the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches (Hearst 1992; Pantel & Ravichandran 2004; Snow et al., 2005; Durme & Pasca, 2008; Talukdar et al., 2008) to address the problem. These methods typically exploited manually-designed or automatical* This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns (e.g., “NP such as NP”, “NP like NP”, “NP is a NP”). Although some degree of success has been achieved with these efforts, the results are still far from perfect, in terms of both recall and precision. As will be demonstrated in this paper, even by processing a large corpus of 500 million web pages with the most popular patterns, we are not able to extract correct labels for many (especially rare) entities. Even for popular terms, incorrect results often appear in their label lists. The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. We demonstrate that the specific way of counting has a great impact on result quality, and that the state-ofthe-art counting methods are not optimal. Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. It is hard to extract correct labels for rare terms from a web corpus due to the data sparseness problem. To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. For example, if we already know that 1) Helsinki and Tampere are cities, and 2) Porvoo is similar to Helsinki and Tampere, then Porvoo is ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s159–1168, very likely also a city. This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. For example, Mount Vesuvius and Kilimanjaro are volcanoes and Lhotse is similar to them, but Lhotse is not a volcano. Therefore we should be very conservative and careful in hypernym propagation. In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. This is one of the technologies adopted in our semantic search and min- ing system NeedleSeek2. In the next section, we discuss major related efforts and how they differ from our work. Section 3 is a brief description of the baseline approach. The probabilistic evidence combination model that we exploited is introduced in Section 4. Our main approach is illustrated in Section 5. Section 6 shows our experimental settings and results. Finally, Section 7 concludes this paper. 2 Related Work Existing efforts for hyponymy relation extraction have been conducted upon various types of data sources, including plain-text corpora (Hearst 1992; Pantel & Ravichandran, 2004; Snow et al., 2005; Snow et al., 2006; Banko, et al., 2007; Durme & Pasca, 2008; Talukdar et al., 2008), semistructured web pages (Cafarella et al., 2008; Shinzato & Torisawa, 2004), web search results (Geraci et al., 2006; Kozareva et al., 2008; Wang & Cohen, 2009), and query logs (Pasca 2010). Our target for optimization in this paper is the approaches that use lexico-syntactic patterns to extract hyponymy relations from plain-text corpora. Our future work will study the application of the proposed algorithms on other types of approaches. 2 http://research.microsoft.com/en-us/projects/needleseek/ or http://needleseek.msra.cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al., 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. In (Snow et al., 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. The independence of evidence was assumed in their model. In comparison, we show that better results will be obtained if the evidence correlation is modeled appropriately. Our evidence propagation is basically about using term similarity information to help instance labeling. There have been several approaches which improve hyponymy extraction with instance clusters built by distributional similarity. In (Pantel & Ravichandran, 2004), labels were assigned to the committee (i.e., representative members) of a semantic class and used as the hypernyms of the whole class. Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). In contrast to their method, our label scoring and ranking approach is applied to every single term rather than a semantic class. In addition, we also compute label scores in a nonlinear way, which improves results quality. In Snow et al. (2005), a supervised approach was proposed to improve hypernym classification using coordinate terms. In comparison, our approach is unsupervised. Durme & Pasca (2008) cleaned the set of instance-label pairs with a TF*IDF like method, by exploiting clusters of semantically related phrases. The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. Label propagation is performed in (Talukdar et al., 2008; Talukdar & Pereira, 2010) based on multiple instance-label graphs. Term similarity information was not used in their approach. Most existing work tends to utilize small-scale or private corpora, whereas the corpus that we used is publicly available and much larger than most of the existing work. We published our term sets (refer to Section 6. 1) and their corresponding user judgments so researchers working on similar topics can reproduce our results. H eTIsaryApst-eI {N aP nLd(|{iso,}ra (eNisn|.wuPcgalhsu|edwa.gs(e)r{PN|baiPent,c}lg*ur){nd(ai n|dg)o {r}N P,L}* IsA-II NP (is|are|was|were|being) {the, those} NPL IsA-III NP (is|are|was|were|being) {another, any} NPL Table 1. Patterns adopted in this paper (NP: named phrase representing an entity; NPL: label) 3 Preliminaries The problem addressed in this paper is corpusbased is-a relation mining: extracting hypernyms (as labels) for entities from a large-scale, open- domain document corpus. The desired output is a mapping from terms to their corresponding hypernyms, which can naturally be represented as a weighted bipartite graph (term-label graph). Typically we are only interested in top labels of a term in the graph. Following existing efforts, we adopt patternmatching as a basic way of extracting hypernymy/hyponymy relations. Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. One or more termlabel pairs can be extracted if a pattern matches a sentence. In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3.1) where m is the number of times the pair (T, L) is extracted from the corpus, DF(L) is the number of in-links of L in the graph, N is total number of terms in the graph, and IDF means the “inverse document frequency”. A term can only keep its top-k neighbors (according to the edge weight) in the graph as its final labels. 1161 Our pattern matching algorithm implemented in this paper uses part-of-speech (POS) tagging information, without adopting a parser or a chunker. The noun phrase boundaries (for terms and labels) are determined by a manually designed POS tag list. 4 Probabilistic Label-Scoring Model Here we model the hyponymy extraction problem from the probability theory point of view, aiming at estimating the score of a term-label pair (i.e., the score of a label w.r.t. a term) with probabilistic evidence combination. The model was studied in (Shi et al., 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). Ei: The observation that (T, L) is extracted from a sentence Si via pattern matching (i.e., Si is a sup- porting sentence of the pair). Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,..,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. Formally, we need to find a function f to satisfy, P(A|E1,… ,Em) = f(P(A), P(A|E1)… P(A|Em) ) (4.1) … … …, For simplicity, we first consider the case of m=2. The case of m>2 is quite similar. We start from the simple case of independent supporting sentences. That is, ( ) ( ) ( ) ( ) ( )( ) By applying Bayes rule, we get, ( (4.2) (4.3) ) ( ( ) )() ( ( ) ) ( ) ( () ) ( ) ( ) (4.4) ( ) ( ) ( ) Then define ( ) ( ( )) ( ( )) ( ( )) Here G(A|E) represents the log-probability-gain of A given E, with the meaning of the gain in the log-probability value of A after the evidence E is observed (or known). It is a measure of the impact of evidence E to the probability of event A. With the definition of G(A|E), Formula 4.4 can be transformed to, ( ) ( ) ( ) (4.5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. It is easy to prove (by following a similar procedure) that the above Formula holds for the case of m>2, as long as the pieces of evidence are mutually independent. Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . G(AT,L|E1… ,Em) ∑ ( ) ( ) (4.6) This is exactly the Formula 3. 1. By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. … TRaAb:le(2.A/E)Rv(ide)ncHd065epa.9r81sn7t-dIec10ys.7Ae3-1I0timEa2o:8nH0Is.4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. The probability is quite low because most term-label pairs do not have hyponymy relations. Once we have observed S1, however, the chance of (T, L) having a hyponymy relation in1162 creases. Therefore the chance of observing another supporting sentence becomes larger than before. Table 2 shows the rough estimation of ( ( ) ( ) ) (denoted as RA), ( ( ) ( ) ) (denoted as R), and their ratios. The statistics are obtained by performing maximal likelihood estimation (MLE) upon our corpus and a random selection of term-label pairs from our term sets (see Section 6. 1) together with their top labels3. The data verifies our analysis about the correlation between E1 and E2 (note that R=1 means independent). In addition, it can be seen that the conditional independence assumption of Formula 4.3 does not hold (because RA>1). It is hence necessary to consider the correlation between supporting sentences in the model. The estimation of Table 2 also indicates that, ( ( ) ( )) ( ( ) ( ) ) (4.7) By following a similar procedure as above, with Formulas 4.2 and 4.3 replaced by 4.7, we have, ( ) ( ) ( ) (4.8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w.r.t. term T (given both the sen- tences) is smaller than the sum of the gains caused by one sentence only. In the extreme case that sentence S2 fully depends on E1 (i.e. P(E2|E1)=1), it is easy to prove that ( ) ( ) It is reasonable, since event E2 does not bring in more information than E1. Formula 4.8 cannot be used directly for computing the posterior gain. What we really need is a function h satisfying () ( ( ) ( )) (4.9) and ( )∑ (4.10) Shi et al. (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. 11) 3 RA is estimated from the labels judged as “Good”; whereas the estimation of R is from all judged labels. ( ) √ ∑ (p>1) (4.12) In the next section, we use the above two h func- tions as basic building blocks to compute label scores for terms. 5 Our Approach Multiple types of patterns (Table 1) can be adopted to extract term-label pairs. For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. In Section 5. 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. Then in Section 5.2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. 5.1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5.1) where mi is the number of supporting sentences corresponding to pattern i. Also assume the gain score of Si,j is xi,j, i.e., xi,j=G(A|Si,j). Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. This can be verified by the data in Table-2. By ignoring the inter-pattern correlations, we make the following simplified assumption: Assumption: Supporting sentences corresponding to the same pattern are correlated, while those of different patterns are independent. According to this assumption, our label-scoring function is, ( ) ∑ ( ) (5.2) In the simple case that ( ) , if the h function of Formula 4. 12 is adopted, then, ( ) (∑ √ ) ( ) (5.3) 1163 We use an example to illustrate the above formula. Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. Also assume the supporting-sentence-count vector of label L2 is (25, 0, 0, 0, 0, 0). If we use Formula 5.3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. √ ; √ 5.2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). With the graphs, we can obtain the similarity between two terms without their hypernyms being available. The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). Our evidence propagation algorithm contains two phases. In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5.5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). We call the first type of graph DS and the second type PB. DS approaches are based on the distributional hypothesis (Harris, 1985), which says that terms appearing in analogous contexts tend to be similar. In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. The similarity between two terms is computed as the similarity between their corresponding feature vectors. In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. Two categories of patterns have been studied in the literature (Heast 1992; Pasca 2004; Kozareva et al., 2008; Zhang et al., 2009): sentence lexical patterns, and HTML tag patterns. An example of sentence lexical patterns is “T {, T} *{,} (and|or) T”. HTML tag patterns include HTML tables, drop-down lists, and other tag repeat patterns. In this paper, we generate the DS and PB graphs by adopting the best-performed methods studied in (Shi et al., 2010). We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. 6 Experiments 6.1 Experimental setup Corpus We adopt a publicly available dataset in our experiments: ClueWeb094. This is a very large dataset collected by Carnegie Mellon University in early 2009 and has been used by several tracks of the Text Retrieval Conference (TREC)5. The whole dataset consists of 1.04 billion web pages in ten languages while only those in English, about 500 million pages, are used in our experiments. The reason for selecting such a dataset is twofold: First, it is a corpus large enough for conducting webscale experiments and getting meaningful results. Second, since it is publicly available, it is possible for other researchers to reproduce the experiments in this paper. Term sets Approaches are evaluated by using two sets of selected terms: Wiki200, and Ext100. For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. Wiki200 is constructed by first randomly selecting 400 Wikipedia6 titles as our candidate terms, with the probability of a title T being selected to be ( ( )), where F(T) is the frequency of T in our data corpus. The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. Then 200 terms are manually selected from the 400 candidate terms, with the principle of maximizing the diversity of terms in terms of length (i.e., number of words) and type (person, location, organization, software, movie, song, animal, plant, etc.). Wiki200 is further divided into two subsets: Wiki100H and Wiki100L, containing respectively the 100 high-frequency and lowfrequency terms. Ext100 is built by first selecting 200 non-Wikipedia-title terms at random from the term-label graph generated by the baseline approach (Formula 3. 1), then manually selecting 100 terms. Some sample terms in the term sets are listed in Table 3. 4 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5 http://trec.nist.gov/ 6 http://www.wikipedia.org/ Annotation For each term in the term set, the top-5 results (i.e., hypernym labels) of various methods are mixed and judged by human annotators. Each annotator assigns each result item a judgment of “Good”, “Fair” or “Bad”. The annotators do not know the method by which a result item is generated. Six annotators participated in the labeling with a rough speed of 15 minutes per term. We also encourage the annotators to add new good results which are not discovered by any method. The term sets and their corresponding user anno- tations are available for download at the following links (dataset ID=data.queryset.semcat01): http://research.microsoft.com/en-us/projects/needleseek/ http://needleseek.msra.cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. The evaluation score on a term set is the average over all the terms. Precision@k: The percentage of relevant (good or fair) labels in the top-k results (labels judged as “Fair” are counted as 0.5) Recall@k: The ratio of relevant labels in the topk results to the total number of relevant labels R-Precision: Precision@R where R is the total number of labels judged as “Good” Mean average precision (MAP): The average of precision values at the positions of all good or fair results Before annotation and evaluation, the hypernym list generated by each method for each term is preprocessed to remove duplicate items. Two hypernyms are called duplicate items if they share the same head word (e.g., “military conflict” and “conflict”). For duplicate hypernyms, only the first (i.e., the highest ranked one) in the list is kept. The goal with such a preprocessing step is to partially con- sider results diversity in evaluation and to make a more meaningful comparison among different methods. Consider two hypernym lists for “subway”: List-1 : restaurant; chain restaurant; worldwide chain restaurant; franchise; restaurant franchise… List-2: restaurant; franchise; transportation; company; fast food… There are more detailed hypernyms in the first list about “subway” as a restaurant or a franchise; while the second list covers a broader range of meanings for the term. It is hard to say which is better (without considering the upper-layer applications). With this preprocessing step, we keep our focus on short hypernyms rather than detailed ones. … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6.2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4.1. In Table 4, Linear means that Formula 3. 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. 11 and 4. 12 being utilized. The performance improvement numbers shown in the table are based on the linear version; and the upward pointing arrows indicate relative percentage improvement over the baseline. From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. It is interesting to note that the performance improvement is more significant on Wiki100H, the set of high frequency terms. By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). So the scores computed by various fusion algorithms tend to be similar. In contrast, more supporting sentences can be discov- ered for high-frequency terms. Much information is contained in the sentences about the hypernyms of the high-frequency terms, but the linear function of Formula 3.1 fails to make effective use of it. The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. Please note that the terms in Ext100 do not appear in Wikipedia titles. Thanks to the scale of the data corpus we are using, even the baseline approach achieves reasonably good performance. Please note that the terms (refer to Table 3) we are using are “harder” than those adopted for evaluation in many existing papers. Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. Please note that the recall (R@1, R@5) in this paper is pseudo-recall, i.e., we treat the number of known relevant (Good or Fair) results as the total number of relevant ones. MTPLeNainotbhgrlmed5.M012P35A8e96r%5P4f0orRm-.aP40%2rn9ce 50P7o.@625m10p%5 ari0Ps.4o@07%n25 amR307o.@41n526g%0 vaRr0i.3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. The linear method of Formula 3. 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. Figure 1 shows that, for most of the term sets, the best performance is obtained for [2.0, 4.0]. The reason may be that the sentence correlations are better estimated with p values in this range. Figure 1. Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. The methods for comparison are, Base: The linear function without propagation. NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. LP: Linear propagation, i.e., the linear function is used to combine the evidence of pseudo supporting sentences. NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. Wiki200; Similarity graph: PB; Nonlinear formula: PNorm) In this paper, we generate the DS (distributional similarity) and PB (pattern-based) graphs by adopting the best-performed methods studied in (Shi et al., 2010). The performance improvement numbers (indicated by the upward pointing arrows) shown in tables 6~9 are relative percentage improvement 1166 over the base approach (i.e., linear function without propagation). The values of parameter are set to maximize the MAP values. Several observations can be made from Table 6. First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. The second observation is that the NL+NLP approach achieves a much larger performance improvement than NL and NLP. Similar results (omitted due to space limitation) can be observed on the Ext100 term set. evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. The performance improvement over Wiki100L is especially remarkable. This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. Acknowledgments We would like to thank Matt Callcut for reading through the paper. Thanks to the annotators for their efforts in judging the hypernym labels. Thanks to Yueguo Chen, Siyu Lei, and the anonymous reviewers for their helpful comments and suggestions. The first author is partially supported by the NSF of China (60903028,61070014), and Key Projects in the Tianjin Science and Technology Pillar Program. References E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proc. of NAACL-HLT’2009. M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open Information Extraction from the Web. In Proc. of IJCAI’2007. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. In Proceedings of the 34th Conference on Very Large Data Bases (VLDB’2008), pages 538–549, Auckland, New Zealand. B. Van Durme and M. Pasca. 2008. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain information extraction. Twenty-Third AAAI Conference on Artificial Intelligence. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. 2006. Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In Proceedings of the 13th Conference on String Processing and Information Retrieval (SPIRE’2006), pages 25–36, Glasgow, Scotland. Z. S. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press. M. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Fourteenth International Conference on Computational Linguistics, Nantes, France. Z. Kozareva, E. Riloff, E.H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In Proc. of ACL'2008. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu and V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP’2009. Singapore. P. Pantel and D. Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proc. of the 2004 Human Language Technology Conference (HLTNAACL’2004), 321–328. M. Pasca. 2004. Acquisition of Categorized Named Entities for Web Search. In Proc. of CIKM’2004. M. Pasca. 2010. The Role of Queries in Ranking Labeled Instances Extracted from Text. In Proc. of COLING’2010, Beijing, China. S. Shi, B. Lu, Y. Ma, and J.-R. Wen. 2009. Nonlinear Static-Rank Computation. In Proc. of CIKM’2009, Kong Kong. 1168 S. Shi, H. Zhang, X. Yuan, J.-R. Wen. 2010. Corpusbased Semantic Class Mining: Distributional vs. Pattern-Based Approaches. In Proc. of COLING’2010, Beijing, China. K. Shinzato and K. Torisawa. 2004. Acquiring Hyponymy Relations from Web Documents. In Proc. of the 2004 Human Language (HLT-NAACL’2004). Technology Conference R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems. R. Snow, D. Jurafsky, and A. Y. Ng. 2006. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), 801–808. P. P. Talukdar and F. Pereira. 2010. Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. In 48th Annual Meeting of the Association for Computational Linguistics (ACL’2010). P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. 2008. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP’2008), pages 581–589. R.C. Wang. W.W. Cohen. Automatic Set Instance Extraction using the Web. In Proc. of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’2009), gapore. pages 441–449, Sin- H. Zhang, M. Zhu, S. Shi, and J.-R. Wen. 2009. Employing Topic Models for Pattern-based Semantic Class Discovery. In Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP’2009), pages 441–449, Singapore.
3 0.7679472 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
Author: Ang Sun ; Ralph Grishman ; Satoshi Sekine
Abstract: We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system. 1
4 0.75975788 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
5 0.69652951 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
Author: Raphael Hoffmann ; Congle Zhang ; Xiao Ling ; Luke Zettlemoyer ; Daniel S. Weld
Abstract: Information extraction (IE) holds the promise of generating a large-scale knowledge base from the Web’s natural language text. Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors. Recently, researchers have developed multiinstance learning algorithms to combat the noisy training data that can come from heuristic labeling, but their models assume relations are disjoint — for example they cannot extract the pair Founded ( Jobs Apple ) and CEO-o f ( Jobs Apple ) . , , This paper presents a novel approach for multi-instance learning with overlapping relations that combines a sentence-level extrac- , tion model with a simple, corpus-level component for aggregating the individual facts. We apply our model to learn extractors for NY Times text using weak supervision from Freebase. Experiments show that the approach runs quickly and yields surprising gains in accuracy, at both the aggregate and sentence level.
6 0.68577957 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
7 0.6761772 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
8 0.67433405 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
9 0.67323637 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
10 0.67284817 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation
11 0.67249334 11 acl-2011-A Fast and Accurate Method for Approximate String Search
12 0.67241466 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
13 0.67221832 117 acl-2011-Entity Set Expansion using Topic information
14 0.67219293 193 acl-2011-Language-independent compound splitting with morphological operations
15 0.67108345 178 acl-2011-Interactive Topic Modeling
16 0.67031854 28 acl-2011-A Statistical Tree Annotator and Its Applications
17 0.6693635 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
18 0.668028 187 acl-2011-Jointly Learning to Extract and Compress
19 0.66780084 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
20 0.66774797 258 acl-2011-Ranking Class Labels Using Query Sessions