acl acl2010 acl2010-111 knowledge-graph by maker-knowledge-mining

111 acl-2010-Extracting Sequences from the Web


Source: pdf

Author: Anthony Fader ; Stephen Soderland ; Oren Etzioni

Abstract: Classical Information Extraction (IE) systems fill slots in domain-specific frames. This paper reports on SEQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. SEQ leverages regularities about sequences to extract a coherent set of sequences from Web text. SEQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Classical Information Extraction (IE) systems fill slots in domain-specific frames. [sent-3, score-0.08]

2 This paper reports on SEQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U. [sent-4, score-0.418]

3 SEQ leverages regularities about sequences to extract a coherent set of sequences from Web text. [sent-6, score-0.376]

4 SEQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities. [sent-7, score-0.18]

5 1 Introduction Classical IE systems fill slots in domain-specific frames such as the time and location slots in seminar announcements (Freitag, 2000) or the terror- ist organization slot in news stories (Chieu et al. [sent-8, score-0.178]

6 In contrast, open IE systems are domainindependent, but extract “flat” sets of assertions that are not organized into frames and slots (Sekine, 2006; Banko et al. [sent-10, score-0.121]

7 This paper reports on SEQ—an open IE system that leverages a domain-independent frame to extract ordered sequences of objects from Web text. [sent-12, score-0.3]

8 We show that the novel, domain-independent sequence frame in SEQ substantially boosts the precision and recall of the system and yields coherent sequences filtered from low-precision extractions (Table 1). [sent-13, score-0.8]

9 Sequence extraction is distinct from set expansion (Etzioni et al. [sent-14, score-0.096]

10 , 2004; Wang and Cohen, 2007) because sequences are ordered and because the extraction process does not require seeds or HTML lists as input. [sent-15, score-0.21]

11 The domain-independent sequence frame consists of a sequence name s (e. [sent-16, score-0.525]

12 , presidents of the United States), and a set of ordered pairs (x, k) where x is a string naming a member of the sequence with name s, and k is an integer indicating Table 1: Examples of sequences extracted by SEQ from unstructured Web text. [sent-18, score-0.545]

13 The task of sequence extraction is to automatically instantiate sequence frames given a corpus of unstructured text. [sent-22, score-0.48]

14 By definition, sequences have two properties that we can leverage in creating a sequence extractor: functionality and density. [sent-23, score-0.502]

15 Functionality means position k in a sequence is occupied by a single real-world entity x. [sent-24, score-0.214]

16 Density means that if a value has been observed at position k then there must exist values for all i< k, and possibly more after it. [sent-25, score-0.042]

17 2 The SEQ System Sequence extraction has two parts: identifying possible extractions (x, k, s) from text, and then classifying those extractions as either correct or incorrect. [sent-26, score-1.098]

18 In the following section, we describe a way to identify candidate extractions from text using a set of lexico-syntactic patterns. [sent-27, score-0.524]

19 We then show that classifying extractions based on sentence-level features and redundancy alone yields low precision, which is improved by leveraging the functionality and density properties of sequences as done in our SEQ system. [sent-28, score-1.177]

20 1 Generating Sequence Extractions To obtain candidate sequence extractions (x, k, s) from text, the SEQ system finds sentences in its input corpus that contain an ordinal phrase (OP). [sent-33, score-0.877]

21 Table 2 lists the lexico-syntactic patterns SEQ uses to detect ordinal phrases. [sent-34, score-0.176]

22 The value of k is set to the integer corresponding to the ordinal number in the OP. [sent-35, score-0.151]

23 1 Next, SEQ takes each sentence that contains an ordinal phrase o, and finds candidate items of the form (x, k) for the sequence with name s. [sent-36, score-0.574]

24 SEQ constrains x to be an NP that is disjoint from o, and s to be an NP (which may have post-modifying PPs or clauses) following the ordinal number in o. [sent-37, score-0.151]

25 We use heuristics to filter out many of the candidate values (e. [sent-39, score-0.068]

26 This process of generating candidate extractions has high coverage, but low precision. [sent-42, score-0.524]

27 The first step in identifying correct extractions is to compute a confidence measure localConf(x, k, s|sentence), which measures lhoocwa likely (x, k, s) eisn given )th,e sentence it came from. [sent-43, score-0.548]

28 We do this using domain-independent syntactic features based on POS tags and the patternbased features “x {is,are,was,were} the kth s” and “btahsee dk fthea s {is,are,was,were} xw”e. [sent-44, score-0.121]

29 r eT}h teh efe katthur se”s are t“hthene kcothm sbi {niesd,a using a wNeariev}e Bayes cel afsesatiufierer. [sent-45, score-0.048]

30 s In addition to the local, sentence-based features, 1Sequences often use a superlative for the first item (k = 1) such as “the deepest lake in Africa”, “the second deepest lake in Africa” (or “the 2nd deepest . [sent-46, score-0.356]

31 we define the measure totalConf that takes into account redundancy in an input corpus C. [sent-50, score-0.04]

32 s that occur more frequently in multiple distinct sentences are more likely to be correct. [sent-53, score-0.045]

33 2 Challenges The scores localConf and totalConf are not sufficient to identify valid sequence extractions. [sent-55, score-0.172]

34 They tend to give high scores to extractions where the sequence scope is too general or too specific. [sent-56, score-0.628]

35 In our running example, the sequence name “President” is too general many countries and orga– nizations have a president. [sent-57, score-0.299]

36 The sequence name “President of the United States in 1960” is too specific there were not multiple U. [sent-58, score-0.299]

37 These errors can be explained as violations of functionality and density. [sent-61, score-0.236]

38 The sequence with name “President” will have many distinct candidate extractions in its positions, which is a violation of functionality. [sent-62, score-0.868]

39 The sequence with name “President of the United States in 1960” will not satisfy density, since it will have extractions for only one position. [sent-63, score-0.755]

40 In the next section, we present the details ofhow SEQ incorporates functionality and density into its assessment of a candidate extraction. [sent-64, score-0.516]

41 Given an extraction (x, k, s), SEQ must classify it as either correct or incorrect. [sent-65, score-0.133]

42 SEQ breaks this problem down into two parts: (1) determining whether s is a correct sequence name, and (2) determining whether (x, k) is an item in s, assuming s is correct. [sent-66, score-0.278]

43 A joint probabilistic model of these two decisions would require a significant amount of la– beled data. [sent-67, score-0.024]

44 To get around this problem, we represent each (x, k, s) as a vector of features and train two Naive Bayes classifiers: one for classifying s and one for classifying (x, k). [sent-68, score-0.193]

45 We then rank extractions by taking the product of the two classifiers’ confidence scores. [sent-69, score-0.493]

46 We now describe the features used in the two classifiers and how the classifiers are trained. [sent-70, score-0.095]

47 Classifying Sequences To classify a sequence name s, SEQ uses features to measure the functionality and density of s. [sent-71, score-0.807]

48 Functionality means 287 that a correct sequence with name s has one correct value x at each position k, possibly with additional noise due to extraction errors and synonymous values of x. [sent-72, score-0.502]

49 hWigeh hfo vuanldue tsha otf a good measure mofa nthye voavluereasll nonfunctionality of s is the average value of H(k, s|C) ffuorn ckt = 1, 2, 3, 4 s. [sent-75, score-0.024]

50 For a sequence name s that is too specific, we would expect that there are only a few filled-in positions. [sent-76, score-0.299]

51 of T dhiseti fnircstt v isalu neusm oFf ikl lseudcPh oths(ats |tCh)e,re th ies some extraction (x, k) for s in the corpus. [sent-79, score-0.051]

52 The second is totalSeqConf(s|C), which is the sum of the scores loSf emqoCsto cnofn(fsid|Ce)n,t x hinic eha ichs position: totalSeqConf(s|C) = Xkmxax totalConf(x,k,s|C) (2) The functionality and density features are combined using a Naive Bayes classifier. [sent-80, score-0.481]

53 To train the classifier, we use a set of sequence names s labeled as either correct or incorrect, which we describe in Section 3. [sent-81, score-0.27]

54 Classifying Sequence Items To classify (x, k) given s, SEQ uses two features: the total confidence totalConf(x, k, s|C) and the same total fciodnefnicdeen tcoet naloCrmonalfiz(exd, kto, sum t aon 1d over a sallm x, h tooltdaling k and s constant. [sent-82, score-0.064]

55 To train the classifier, we use a set of extractions (x, k, s) where s is known to be a correct sequence name. [sent-83, score-0.683]

56 3 Experimental Results This section reports on two experiments. [sent-84, score-0.026]

57 First, we measured how the density and functionality features improve performance on the sequence name Recall Figure 1: Using density or functionality features alone is effective in identifying correct sequence names. [sent-85, score-1.488]

58 Combining both types of features outperforms either by a statistically significant margin (paired t-test, p < 0. [sent-86, score-0.033]

59 To create a test set, we selected all sentences containing ordinal phrases from Banko’s 500M Web page corpus (2008). [sent-90, score-0.151]

60 For each sequence name s satisfying localConf(x, k, s|sentence) ≥ 0. [sent-93, score-0.299]

61 This procedure resulted in making 95, 611 search engine queries. [sent-100, score-0.032]

62 The final corpus contained 3, 716, 745 distinct sentences containing an OP. [sent-101, score-0.045]

63 Generating candidate extractions using the method from Section 2. [sent-102, score-0.524]

64 1 resulted in a set of over 40 million distinct extractions, the vast majority of which are incorrect. [sent-103, score-0.077]

65 To get a sample with a significant number of correct extractions, we filtered this set to include only extractions with totalConf(x, k, s|C) ≥ 0. [sent-104, score-0.511]

66 8 for some sentence, resulting ifn a s,ket, os|fC 2, 409, 02. [sent-105, score-0.035]

67 We then randomly sampled and manually labeled 2, 000 of these extractions for evaluation. [sent-107, score-0.456]

68 We did a Web search to verify the correctness of the sequence name s and that x is the kth item in the sequence. [sent-108, score-0.405]

69 In some cases, the ordering relation of the sequence name was ambiguous (e. [sent-109, score-0.329]

70 , 2We queried for both the numeric form of the ordinal and the number spelled out (e. [sent-111, score-0.187]

71 288 Recall Figure 2: SEQ outperforms the baseline systems, increasing the area under the curve by 247% relative to LOCAL and by 90% relative to REDUND. [sent-120, score-0.085]

72 “largest state in the US” could refer to land area or population), which could lead to merging two distinct sequences. [sent-121, score-0.085]

73 In practice, we found that most ordering relations were used in a consistent way (e. [sent-122, score-0.03]

74 , “largest city in” always means largest by population) and only about 5% of the sequence names in our sample have an ambiguous ordering relation. [sent-124, score-0.271]

75 We compute precision-recall curves relative to this random sample by changing a confidence threshold. [sent-125, score-0.074]

76 Precision is the percentage of correct extractions above a threshold, while recall is the percentage correct above a threshold divided by the total number of correct extractions. [sent-126, score-0.621]

77 The functionality and density features boost SEQ’s ability to correctly identify sequence names. [sent-128, score-0.653]

78 Figure 1 shows how well SEQ can identify correct sequence names using only functionality, only density, and using functionality and density in concert. [sent-129, score-0.718]

79 Both the density features and the functionality features are effective at this task, but using both types of features resulted in a statistically significant improvement over using either type of feature individually (paired t-test of area under the curve, p < 0. [sent-131, score-0.619]

80 The first is LOCAL, which ranks extractions by localConf. [sent-134, score-0.48]

81 3 The second is 3If an extraction arises from multiple sentences, we use REDUND, which ranks extractions by totalConf. [sent-135, score-0.531]

82 Figure 2 shows the precision-recall curves for each system on the test data. [sent-136, score-0.037]

83 The area under the curves for SEQ, REDUND, and LOCAL are 0. [sent-137, score-0.077]

84 The low precision and flat curve for LOCAL suggests that localConf is not informative for classifying extractions on its own. [sent-141, score-0.612]

85 On the subset of extractions with correct s, REDUND can iden- tify x as the kth item with precision of 0. [sent-143, score-0.617]

86 This is consistent with previous work on redundancy-based extractors on the Web. [sent-146, score-0.055]

87 SEQ reduces the negative effects of these problems by decreasing the scores of sequence names that appear too general or too specific. [sent-148, score-0.215]

88 4 Related Work There has been extensive work in extracting lists or sets of entities from the Web. [sent-149, score-0.025]

89 These extractors rely on either (1) HTML features (Cohen et al. [sent-150, score-0.088]

90 SEQ is most similar to this second type of extractor, but additionally leverages the sequence regularities offunctionality and density. [sent-153, score-0.336]

91 These regularities allow the system to overcome the poor performance of the purely syn- tactic extractor LOCAL and the redundancy-based extractor REDUND. [sent-154, score-0.315]

92 5 Conclusions We have demonstrated that an extractor leveraging sequence regularities can greatly outperform extractors without this knowledge. [sent-155, score-0.449]

93 Identifying likely sequence names and then filling in sequence items proved to be an effective approach to sequence extraction. [sent-156, score-0.585]

94 One line of future research is to investigate other types of domain-independent frames that exhibit useful regularities. [sent-157, score-0.044]

95 Other examples include events (with regularities about actor, location, and time) and a generic organization-role frame (with regularities about person, organization, and role played). [sent-158, score-0.256]

96 Closing the gap: Learning-based information extraction rivaling knowledge-engineering methods. [sent-182, score-0.051]

97 A flexible learning system for wrapping tables and lists in html documents. [sent-188, score-0.062]

98 Methods for domain-independent information extraction from the Web: An experimental comparison. [sent-204, score-0.051]

99 Unsupervised named-entity extraction from the web: An experimental study. [sent-209, score-0.051]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('seq', 0.545), ('extractions', 0.456), ('functionality', 0.236), ('density', 0.212), ('sequence', 0.172), ('localconf', 0.161), ('totalconf', 0.161), ('ordinal', 0.151), ('president', 0.135), ('redund', 0.134), ('name', 0.127), ('regularities', 0.101), ('extractor', 0.095), ('sequences', 0.094), ('united', 0.087), ('jfk', 0.081), ('classifying', 0.08), ('etzioni', 0.076), ('presidents', 0.071), ('deepest', 0.071), ('candidate', 0.068), ('downey', 0.065), ('leverages', 0.063), ('oren', 0.059), ('kth', 0.055), ('correct', 0.055), ('extractors', 0.055), ('frame', 0.054), ('banko', 0.054), ('slots', 0.054), ('africa', 0.054), ('totalseqconf', 0.054), ('ie', 0.052), ('extraction', 0.051), ('item', 0.051), ('states', 0.049), ('domainindependent', 0.047), ('chieu', 0.047), ('cohen', 0.046), ('cafarella', 0.046), ('web', 0.046), ('curve', 0.045), ('distinct', 0.045), ('frames', 0.044), ('names', 0.043), ('position', 0.042), ('soderland', 0.041), ('unstructured', 0.041), ('father', 0.04), ('redundancy', 0.04), ('area', 0.04), ('ordered', 0.04), ('local', 0.04), ('confidence', 0.037), ('html', 0.037), ('curves', 0.037), ('queried', 0.036), ('xw', 0.035), ('ifn', 0.035), ('washington', 0.035), ('lake', 0.034), ('features', 0.033), ('doug', 0.033), ('shaked', 0.033), ('resulted', 0.032), ('michele', 0.032), ('classifiers', 0.031), ('flat', 0.031), ('op', 0.031), ('bayes', 0.031), ('finds', 0.03), ('ordering', 0.03), ('stephen', 0.029), ('population', 0.029), ('weld', 0.027), ('classify', 0.027), ('ijcai', 0.026), ('items', 0.026), ('reports', 0.026), ('largest', 0.026), ('popescu', 0.026), ('fill', 0.026), ('leveraging', 0.026), ('lists', 0.025), ('ranks', 0.024), ('naive', 0.024), ('coherent', 0.024), ('soderl', 0.024), ('cel', 0.024), ('closing', 0.024), ('pps', 0.024), ('mofa', 0.024), ('tactic', 0.024), ('beled', 0.024), ('fader', 0.024), ('ined', 0.024), ('sbi', 0.024), ('superlative', 0.024), ('open', 0.023), ('classical', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 111 acl-2010-Extracting Sequences from the Web

Author: Anthony Fader ; Stephen Soderland ; Oren Etzioni

Abstract: Classical Information Extraction (IE) systems fill slots in domain-specific frames. This paper reports on SEQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. SEQ leverages regularities about sequences to extract a coherent set of sequences from Web text. SEQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities.

2 0.06665998 185 acl-2010-Open Information Extraction Using Wikipedia

Author: Fei Wu ; Daniel S. Weld

Abstract: Information-extraction (IE) systems seek to distill semantic relations from naturallanguage text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall. The key to WOE’s performance is a novel form of self-supervised learning for open extractors using heuris— tic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.

3 0.05601082 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning

Author: Francois Mairesse ; Milica Gasic ; Filip Jurcicek ; Simon Keizer ; Blaise Thomson ; Kai Yu ; Steve Young

Abstract: Most previous work on trainable language generation has focused on two paradigms: (a) using a statistical model to rank a set of generated utterances, or (b) using statistics to inform the generation decision process. Both approaches rely on the existence of a handcrafted generator, which limits their scalability to new domains. This paper presents BAGEL, a statistical language generator which uses dynamic Bayesian networks to learn from semantically-aligned data produced by 42 untrained annotators. A human evaluation shows that BAGEL can generate natural and informative utterances from unseen inputs in the information presentation domain. Additionally, generation perfor- mance on sparse datasets is improved significantly by using certainty-based active learning, yielding ratings close to the human gold standard with a fraction of the data.

4 0.053529482 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

5 0.052450344 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

Author: Xianpei Han ; Jun Zhao

Abstract: Name ambiguity problem has raised urgent demands for efficient, high-quality named entity disambiguation methods. In recent years, the increasing availability of large-scale, rich semantic knowledge sources (such as Wikipedia and WordNet) creates new opportunities to enhance the named entity disambiguation by developing algorithms which can exploit these knowledge sources at best. The problem is that these knowledge sources are heterogeneous and most of the semantic knowledge within them is embedded in complex structures, such as graphs and networks. This paper proposes a knowledge-based method, called Structural Semantic Relatedness (SSR), which can enhance the named entity disambiguation by capturing and leveraging the structural semantic knowledge in multiple knowledge sources. Empirical results show that, in comparison with the classical BOW based methods and social network based methods, our method can significantly improve the disambiguation performance by respectively 8.7% and 14.7%. 1

6 0.049703971 159 acl-2010-Learning 5000 Relational Extractors

7 0.046645749 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

8 0.04307599 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

9 0.042899322 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

10 0.041730292 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

11 0.041465241 217 acl-2010-String Extension Learning

12 0.039268155 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

13 0.038550969 165 acl-2010-Learning Script Knowledge with Web Experiments

14 0.037611701 108 acl-2010-Expanding Verb Coverage in Cyc with VerbNet

15 0.037516546 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

16 0.037196837 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

17 0.036632322 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

18 0.036116414 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences

19 0.03605308 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

20 0.035181362 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.108), (1, 0.034), (2, -0.012), (3, -0.011), (4, 0.021), (5, -0.012), (6, 0.012), (7, 0.002), (8, 0.003), (9, -0.018), (10, -0.035), (11, -0.004), (12, -0.048), (13, -0.084), (14, 0.036), (15, 0.038), (16, 0.019), (17, 0.107), (18, -0.009), (19, 0.052), (20, -0.022), (21, 0.005), (22, -0.051), (23, 0.01), (24, -0.022), (25, -0.026), (26, 0.007), (27, 0.026), (28, 0.016), (29, 0.005), (30, 0.057), (31, 0.024), (32, 0.041), (33, -0.037), (34, 0.011), (35, 0.06), (36, -0.001), (37, -0.008), (38, 0.027), (39, 0.006), (40, 0.071), (41, 0.046), (42, -0.045), (43, 0.02), (44, 0.071), (45, -0.07), (46, 0.022), (47, 0.139), (48, -0.009), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92248446 111 acl-2010-Extracting Sequences from the Web

Author: Anthony Fader ; Stephen Soderland ; Oren Etzioni

Abstract: Classical Information Extraction (IE) systems fill slots in domain-specific frames. This paper reports on SEQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. SEQ leverages regularities about sequences to extract a coherent set of sequences from Web text. SEQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities.

2 0.63302517 159 acl-2010-Learning 5000 Relational Extractors

Author: Raphael Hoffmann ; Congle Zhang ; Daniel S. Weld

Abstract: Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations more than an order of magnitude greater than any previous approach with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. — —

3 0.60279971 185 acl-2010-Open Information Extraction Using Wikipedia

Author: Fei Wu ; Daniel S. Weld

Abstract: Information-extraction (IE) systems seek to distill semantic relations from naturallanguage text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall. The key to WOE’s performance is a novel form of self-supervised learning for open extractors using heuris— tic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.

4 0.53994584 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

Author: Dmitry Davidov ; Ari Rappoport

Abstract: We present a novel framework for automated extraction and approximation of numerical object attributes such as height and weight from the Web. Given an object-attribute pair, we discover and analyze attribute information for a set of comparable objects in order to infer the desired value. This allows us to approximate the desired numerical values even when no exact values can be found in the text. Our framework makes use of relation defining patterns and WordNet similarity information. First, we obtain from the Web and WordNet a list of terms similar to the given object. Then we retrieve attribute values for each term in this list, and information that allows us to compare different objects in the list and to infer the attribute value range. Finally, we combine the retrieved data for all terms from the list to select or approximate the requested value. We evaluate our method using automated question answering, WordNet enrichment, and comparison with answers given in Wikipedia and by leading search engines. In all of these, our framework provides a significant improvement.

5 0.46979696 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.

6 0.46084738 139 acl-2010-Identifying Generic Noun Phrases

7 0.45748079 197 acl-2010-Practical Very Large Scale CRFs

8 0.45083779 63 acl-2010-Comparable Entity Mining from Comparative Questions

9 0.41655585 85 acl-2010-Detecting Experiences from Weblogs

10 0.41076124 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.

11 0.40094733 258 acl-2010-Weakly Supervised Learning of Presupposition Relations between Verbs

12 0.39909026 165 acl-2010-Learning Script Knowledge with Web Experiments

13 0.3943167 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

14 0.39159507 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning

15 0.38867354 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

16 0.38363343 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

17 0.38314649 141 acl-2010-Identifying Text Polarity Using Random Walks

18 0.38032094 108 acl-2010-Expanding Verb Coverage in Cyc with VerbNet

19 0.37762091 225 acl-2010-Temporal Information Processing of a New Language: Fast Porting with Minimal Resources

20 0.37494817 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.071), (33, 0.028), (39, 0.027), (42, 0.028), (59, 0.065), (72, 0.022), (73, 0.07), (76, 0.01), (78, 0.042), (80, 0.011), (83, 0.079), (84, 0.034), (96, 0.314), (98, 0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76700038 111 acl-2010-Extracting Sequences from the Web

Author: Anthony Fader ; Stephen Soderland ; Oren Etzioni

Abstract: Classical Information Extraction (IE) systems fill slots in domain-specific frames. This paper reports on SEQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. SEQ leverages regularities about sequences to extract a coherent set of sequences from Web text. SEQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities.

2 0.70482808 248 acl-2010-Unsupervised Ontology Induction from Text

Author: Hoifung Poon ; Pedro Domingos

Abstract: Extracting knowledge from unstructured text is a long-standing goal of NLP. Although learning approaches to many of its subtasks have been developed (e.g., parsing, taxonomy induction, information extraction), all end-to-end solutions to date require heavy supervision and/or manual engineering, limiting their scope and scalability. We present OntoUSP, a system that induces and populates a probabilistic ontology using only dependency-parsed text as input. OntoUSP builds on the USP unsupervised semantic parser by jointly forming ISA and IS-PART hierarchies of lambda-form clusters. The ISA hierarchy allows more general knowledge to be learned, and the use of smoothing for parameter estimation. We evaluate On- toUSP by using it to extract a knowledge base from biomedical abstracts and answer questions. OntoUSP improves on the recall of USP by 47% and greatly outperforms previous state-of-the-art approaches.

3 0.49499673 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

Author: Partha Pratim Talukdar ; Fernando Pereira

Abstract: Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based SSL algorithms for class-instance acquisition on a variety of graphs constructed from different domains. We find that the recently proposed MAD algorithm is the most effective. We also show that class-instance extraction can be significantly improved by adding semantic information in the form of instance-attribute edges derived from an independently developed knowledge base. All of our code and data will be made publicly available to encourage reproducible research in this area.

4 0.49483457 71 acl-2010-Convolution Kernel over Packed Parse Forest

Author: Min Zhang ; Hui Zhang ; Haizhou Li

Abstract: This paper proposes a convolution forest kernel to effectively explore rich structured features embedded in a packed parse forest. As opposed to the convolution tree kernel, the proposed forest kernel does not have to commit to a single best parse tree, is thus able to explore very large object spaces and much more structured features embedded in a forest. This makes the proposed kernel more robust against parsing errors and data sparseness issues than the convolution tree kernel. The paper presents the formal definition of convolution forest kernel and also illustrates the computing algorithm to fast compute the proposed convolution forest kernel. Experimental results on two NLP applications, relation extraction and semantic role labeling, show that the proposed forest kernel significantly outperforms the baseline of the convolution tree kernel. 1

5 0.49174929 185 acl-2010-Open Information Extraction Using Wikipedia

Author: Fei Wu ; Daniel S. Weld

Abstract: Information-extraction (IE) systems seek to distill semantic relations from naturallanguage text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall. The key to WOE’s performance is a novel form of self-supervised learning for open extractors using heuris— tic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.

6 0.49086559 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews

7 0.48986202 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

8 0.48984873 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

9 0.48918116 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

10 0.48487982 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

11 0.48442388 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

12 0.48426482 39 acl-2010-Automatic Generation of Story Highlights

13 0.48414987 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

14 0.48410925 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

15 0.48406214 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

16 0.48379046 158 acl-2010-Latent Variable Models of Selectional Preference

17 0.48331586 169 acl-2010-Learning to Translate with Source and Target Syntax

18 0.48316509 130 acl-2010-Hard Constraints for Grammatical Function Labelling

19 0.48293379 162 acl-2010-Learning Common Grammar from Multilingual Corpus

20 0.48223692 209 acl-2010-Sentiment Learning on Product Reviews via Sentiment Ontology Tree