emnlp emnlp2012 emnlp2012-100 knowledge-graph by maker-knowledge-mining

100 emnlp-2012-Open Language Learning for Information Extraction


Source: pdf

Author: Mausam ; Michael Schmitz ; Stephen Soderland ; Robert Bart ; Oren Etzioni

Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, stateof-the-art Open IE systems such as REVERB and WOE share two important weaknesses (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents OLLIE, a substantially improved Open IE system that addresses both these limitations. First, OLLIE achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. OLLIE obtains 2.7 times the area under precision-yield curve (AUC) compared to REVERB and 1.9 times the AUC of WOEparse. –

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. [sent-3, score-0.256]

2 However, stateof-the-art Open IE systems such as REVERB and WOE share two important weaknesses (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. [sent-4, score-0.262]

3 First, OLLIE achieves high yield by extracting relations mediated by nouns, adjectives, and more. [sent-6, score-0.173]

4 – 1 Introduction While traditional Information Extraction (IE) (ARPA, 1991 ; ARPA, 1998) focused on identifying and extracting specific relations of interest, there has been great interest in scaling IE to a broader set of relations and to far larger corpora (Banko et al. [sent-11, score-0.138]

5 The substantial endeavor in 523 extractions for the first three sentences where REVERB (R) and WOEparse (W) find none. [sent-19, score-0.169]

6 Both extract only relations that are mediated by verbs, and REVERB further restricts this to a subset of verbal patterns. [sent-28, score-0.137]

7 This misses important information mediated via other syntactic entities such as nouns and adjectives, as well as a wider range of verbal structures (examples #1-3 in Figure 1). [sent-29, score-0.183]

8 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl Secondly, REVERB and WOEparse perform only a local analysis of a sentence, so they often extract relations that are not asserted as factual in the sentence (examples #4,5). [sent-32, score-0.192]

9 OLLIE extractions obtain a dramatically higher yield at higher or comparable precision relative to existing systems. [sent-35, score-0.229]

10 Section 3 describes the syntactic scope expansion component, which is based on a novel approach that learns open pattern templates. [sent-38, score-0.188]

11 Moreover, for specific relations commonly mediated by nouns (e. [sent-45, score-0.179]

12 2 Background Open IE systems extract tuples consisting of argument phrases from the input sentence and a phrase 1Available for download at http://openie. [sent-50, score-0.134]

13 , 2011), which uses shallow syntactic processing to identify relation phrases that begin with a verb and occur between the argument phrases;2 (2) WOEparse (Wu and Weld, 2010), which uses bootstrapping from entries in Wikipedia info-boxes to learn extraction pat- terns in dependency parses. [sent-56, score-0.382]

14 Like REVERB, the relation phrases begin with verbs, but can handle long-range dependencies and relation phrases that do not come between the arguments. [sent-57, score-0.298]

15 Unlike REVERB, WOE does not include nouns within the relation phrases (e. [sent-58, score-0.191]

16 Both systems ignore context around the extracted relations that may indicate whether it is a supposition or conditionally true rather than asserted as factual (see #4-5 in Figure 1). [sent-61, score-0.212]

17 The task of Semantic role labeling is to identify arguments of verbs in a sentence, and then to classify the arguments by mapping the verb to a semantic frame and mapping the argument phrases to roles in that frame, such as agent, patient, instrument, or benefactive. [sent-62, score-0.303]

18 SRL systems can also identify and classify arguments of relations that are mediated by nouns when trained on NomBank annotations. [sent-63, score-0.246]

19 Where SRL begins with a verb or noun and then looks for arguments that play roles with respect to that verb or noun, Open IE looks for a phrase that expresses a relation between a pair of arguments. [sent-64, score-0.259]

20 First, it uses a set of high precision seed tuples from REVERB to bootstrap a large training set. [sent-67, score-0.14]

21 Second, it learns open pattern templates over this training set. [sent-68, score-0.211]

22 Next, OLLIE applies these pattern templates at extraction time. [sent-69, score-0.163]

23 edu/ Figure 2: System architecture: OLLIE begins with seed tuples from REVERB, uses them to build a bootstrap training set, and learns open pattern templates. [sent-75, score-0.281]

24 The key observation is that almost every relation can also be expressed via a REVERB-style verb-based expression. [sent-79, score-0.147]

25 So, bootstrapping sentences based on REVERB’s tuples will likely capture all relation expressions. [sent-80, score-0.202]

26 We start with over 110,000 seed tuples these are high confidence REVERB extractions from a large Web corpus (ClueWeb)3 that are asserted at least twice and contain only proper nouns in the arguments. [sent-81, score-0.436]

27 For example, a seed tuple may be (Paul Annacone; is the coach of; Federer) that REVERB extracts from the sentence “Paul Annacone is the coach of Federer. [sent-83, score-0.26]

28 As an example, for a seed tuple (Boyle; is born in; Ireland) we may retrieve a sentence “Felix G. [sent-90, score-0.142]

29 We only allow sentences where the content words from arguments and relation can be linked to each other via a linear path of size four in the dependency parse. [sent-95, score-0.237]

30 In our case, by enforcing that a sentence additionally contains some syntactic form of the relation content words, our bootstrapping set is naturally much cleaner. [sent-113, score-0.185]

31 Since the relation words in the sentence and seed match, we can learn general pattern templates that may apply to other relations too. [sent-115, score-0.375]

32 LIE learns open pattern templates a mapping from a dependency path to an open extraction, i. [sent-124, score-0.362]

33 , one that identifies both the arguments and the exact (REVERB-style) relation phrase. [sent-126, score-0.19]

34 Open pattern templates encode the ways in which a relation (in the first column) may be expressed in a sentence (second column). [sent-129, score-0.254]

35 For example, a relation (Godse; kill; Gandhi) may be expressed with a dependency path (#2) {Godse}↑nsubj↑{kill:postag=VBD}↓dobj↓{Gandhi}. [sent-130, score-0.194]

36 To learn the pattern templates, we first extract the dependency path connecting the arguments and relation words for each seed tuple and the associated sentence. [sent-131, score-0.44]

37 We annotate the relation node in the path with the exact relation word (as a lexical constraint) and the POS (postag constraint). [sent-132, score-0.269]

38 We create a relation template from the seed tuple by normalizing ‘is’/‘was’/‘will be’ to ‘be’, and replacing the relation content word with {rel}. [sent-133, score-0.388]

39 As an example, ‘hired’ is a slot word for the tuple (Annacone; is the coach of; Federer) in the sentence “Federer hired Annacone as a coach”. [sent-137, score-0.205]

40 The checks are: (1) There are no slot 4Our current implementation only allows a single relation content word; extending to multiple words is straightforward the templates will require rel1, rel2,. [sent-142, score-0.241]

41 We remove all lexical restrictions from the relation nodes. [sent-154, score-0.184]

42 Both these data points return the same open pattern after generalization: “{arg1 } ↑nsubj↑ {rel:postag=VBD} ↓{prep ∗}↓ {arg2}” }w ↑itnhs tbhje↑ e {xrtrela:cptioosnta template (arg1, {rel} {prep}, arg2). [sent-165, score-0.165]

43 To enable such patterns we retain the lexical constraints on the relation words and slot words. [sent-180, score-0.228]

44 5 We collect all patterns together based only on the syntactic restrictions and convert the lexical constraint into a list of words with which the pattern was seen. [sent-181, score-0.212]

45 This imposes a natural ranking on the patterns more frequent patterns are likely to give higher precision extractions. [sent-197, score-0.158]

46 3 Pattern Matching for Extraction We now describe how these open patterns are used to extract binary relations from a new sentence. [sent-199, score-0.24]

47 We first match the open patterns with the dependency parse of the sentence and identify the base nodes for arguments and relations. [sent-200, score-0.262]

48 To apply pattern #1 from Figure 3 we first match arg1 to ‘festival’, rel to ‘scheduled’ and arg2 to ‘25th’ with prep ‘for’ . [sent-204, score-0.16]

49 Since WOE does not have access to a seed relation phrase, it heuristically assigns all intervening words between the arguments in the parse as the relation phrase. [sent-223, score-0.389]

50 ” WOE’s heuristics will extract the relation divorced was pursuing between ‘Tom Cruise’ and ‘Nicole Kidman’ . [sent-226, score-0.192]

51 OLLIE, in contrast, produces well-formed relation phrases by basing its templates on REVERB relation phrases. [sent-227, score-0.318]

52 Finally, WOE is designed to have verb-mediated relation phrases that do not include nouns, thus missing important relations such as ‘is the president of’ . [sent-229, score-0.264]

53 4 Context Analysis in OLLIE We now turn to the context analysis component, which handles the problem ofextractions that are not asserted as factual in the text. [sent-231, score-0.143]

54 Our algorithm first checks for the presence of a ccomp edge to the relation node. [sent-246, score-0.196]

55 }o and 528 ClausalModifier fields, nearly 98% on a development set, however, these two fields do not cover all the cases where an extraction is not asserted as factual. [sent-260, score-0.163]

56 Our training set was 1000 extractions drawn evenly from Wikipedia, News, and Biology sentences. [sent-265, score-0.169]

57 (3) How do OLLIE’s extractions compare with semantic role labeling argument identification? [sent-269, score-0.237]

58 We ran three systems, OLLIE, REVERB and WOEparse on this dataset resulting in a total of 1,945 extractions from all three systems. [sent-273, score-0.169]

59 Two annotators tagged the extractions as correct if the sentence asserted or implied that the relation was true. [sent-274, score-0.377]

60 We find that 40% of the OLLIE extractions that REVERB misses are due to OLLIE’s use of parsers REVERB misses those because its shallow syntactic analysis cannot skip over the intervening clauses or prepositional phrases between the relation phrase and the arguments. [sent-290, score-0.441]

61 About 30% of the additional yield is those extractions where the relation is not between its arguments (see instance #1 in Figure 1). [sent-291, score-0.395]

62 In contrast, OLLIE misses very few extractions returned by REVERB, mostly due to parser errors. [sent-293, score-0.219]

63 We find that WOEparse misses extractions found by OLLIE for a variety of reasons. [sent-294, score-0.219]

64 The primary cause is that WOEparse does not include nouns in relation phrases. [sent-295, score-0.165]

65 In other cases, WOEparse misses extractions due to ill-formed relation phrases (as in the example of Section 3. [sent-297, score-0.368]

66 While the bulk of OLLIE’s extractions in our test 6Evaluating recall is difficult at this scale – however, since yield is proportional to recall, the area differences also hold for the equivalent precision-recall curves. [sent-299, score-0.228]

67 nsfr relations that are typically expressed by noun phrases up to 146 times that of REVERB. [sent-301, score-0.142]

68 7 OLLIE found up to 146 times as many extractions for these relations than REVERB. [sent-309, score-0.238]

69 Because WOEparse does not include nouns in relation phrases, it is unable to extract any instance of these relations. [sent-310, score-0.165]

70 We examine a sample ofthe extractions to verify that noun-mediated extractions are the main reason for this large yield boost over REVERB (73% of OLLIE extractions were noun-mediated). [sent-311, score-0.543]

71 High-frequency noun patterns like “Obama, the president of the US”, “Obama, the US president”, “US President Obama” far outnumber sentences of the form “Obama is the president of the US”. [sent-312, score-0.182]

72 2 Analysis of OLLIE We perform two control experiments to understand the value of semantic/lexical restrictions in pattern learning and precision boost due to context analysis component. [sent-317, score-0.146]

73 7We multiply the total number of extractions with precision on a sample for that relation to estimate the yield. [sent-318, score-0.316]

74 Figure 7: Results on the subset of extractions from patterns with semantic/lexical restrictions. [sent-319, score-0.236]

75 Are semantic restrictions important for open pattern learning? [sent-323, score-0.226]

76 To answer these questions we compare three systems OLLIE without semantic or lexical restrictions (OLLIE[syn]), OLLIE with lexical restrictions but no type generalization (OLLIE[lex]) and the full system (OLLIE). [sent-325, score-0.15]

77 We restrict this experiment to the patterns where OLLIE adds semantic/lexical restrictions, rather than dilute the result with patterns that would be unchanged by these variants. [sent-326, score-0.134]

78 This matches our intuition, since these are not completely general patterns and generalizing to all unseen relations results in a large number of errors. [sent-329, score-0.136]

79 Adding ClausalModifier corrects errors for 21% of extractions that have a ClausalModifier and does not introduce any new errors. [sent-342, score-0.169]

80 Adding AttributedTo corrects errors for 55% of the extractions with AttributedTo and introduces an error for 3% of the extractions. [sent-343, score-0.169]

81 18% of the errors are due to aggressive generalization of a pattern to all unseen relations and 12% due to incorrect application of lexically annotated patterns. [sent-347, score-0.158]

82 SRL, as discussed in Section 2, has a very different goal analyzing verbs and nouns to identify their arguments, then mapping the verb or noun to a semantic frame and determining the role that each argument plays in that frame. [sent-357, score-0.208]

83 These verbs and nouns need not make the full relation phrase, although, recent work has shown that they may be converted to Open IE style extractions with additional postprocessing (Christensen et al. [sent-358, score-0.356]

84 ” This task is – – permissive for both systems, as it does not require finding an exact relation phrase or argument boundary, or determining the argument roles in a relation. [sent-362, score-0.259]

85 We only counted relation expressed by a verb or noun in the text, and did not include relations expressed simply with “of” or apostrophe-s. [sent-364, score-0.286]

86 Where a verb mediates between an argument and multiple NPs, we represent this as a binary relation for all pairs of NPs. [sent-365, score-0.214]

87 Recall is based on the percentage of NP pairs where the head nouns matches head nouns of two different arguments in an extraction or semantic frame. [sent-386, score-0.207]

88 5, since it is tuned for high precision extraction, and avoids less reliable extractions from constructions such as reduced relative clauses and gerunds, or from noun-mediated relations with longrange dependencies. [sent-396, score-0.262]

89 The missing recall from SRL is primarily where it does not identify both arguments of a binary relation, or where the correct argument is buried in a long argument phrase, but is not its head noun. [sent-398, score-0.203]

90 OLLIE finds the extraction (Clarcor; be a maker of; packaging and filtration products) where the heads of both arguments matched those of the target. [sent-407, score-0.182]

91 All these approaches first bootstrap data based on seed instances of a relation (or seed data from existing resources such as Wikipedia) and then learn lexical or lexico-POS patterns to create an extractor. [sent-421, score-0.342]

92 First, and most importantly, these previous systems learn an extractor for each relation of interest, whereas OLLIE is an open extractor. [sent-425, score-0.249]

93 OLLIE’s strength is its ability to generalize from one relation to many other relations that are expressed in similar forms. [sent-426, score-0.244]

94 This happens both via syntactic generalization and type generalization of relation words (sections 3. [sent-427, score-0.202]

95 This capability is essential as many relations in the test set are not even seen in the training set in early exper– 532 iments we found that non-generalized pattern learning (equivalent to traditional IE) had significantly less yield at a slightly higher precision. [sent-432, score-0.166]

96 The closest to our work is the pattern learning based open extractor WOEparse. [sent-437, score-0.187]

97 Second, by an- alyzing the context around an extraction, OLLIE is able to identify cases where the relation is not asserted as factual, but is hypothetical or conditionally true. [sent-447, score-0.228]

98 OLLIE increases precision by reducing con- fidence in those extractions or by associating addi- tional context in the extractions, in the form of attribution and clausal modifiers. [sent-448, score-0.253]

99 7 times more area under precisionyield curves compared open extractors. [sent-451, score-0.147]

100 An analysis of open information extraction based on semantic role labeling. [sent-516, score-0.16]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ollie', 0.807), ('reverb', 0.205), ('extractions', 0.169), ('woeparse', 0.157), ('relation', 0.123), ('ie', 0.112), ('open', 0.104), ('clausalmodifier', 0.089), ('lund', 0.089), ('asserted', 0.085), ('srl', 0.08), ('annacone', 0.079), ('attributedto', 0.079), ('seed', 0.076), ('relations', 0.069), ('federer', 0.069), ('argument', 0.068), ('mediated', 0.068), ('woe', 0.068), ('patterns', 0.067), ('arguments', 0.067), ('tuple', 0.066), ('restrictions', 0.061), ('pattern', 0.061), ('coach', 0.059), ('prep', 0.059), ('extraction', 0.056), ('misses', 0.05), ('scheduled', 0.049), ('templates', 0.046), ('president', 0.046), ('rna', 0.042), ('hired', 0.042), ('postag', 0.042), ('nouns', 0.042), ('tuples', 0.04), ('rel', 0.04), ('ccomp', 0.039), ('maker', 0.039), ('pursuing', 0.039), ('bootstrapping', 0.039), ('factual', 0.038), ('clausal', 0.038), ('slot', 0.038), ('yield', 0.036), ('fader', 0.036), ('checks', 0.034), ('auc', 0.034), ('festival', 0.034), ('frame', 0.03), ('christensen', 0.03), ('clarcor', 0.03), ('divorced', 0.03), ('macromolecules', 0.03), ('ceo', 0.028), ('generalize', 0.028), ('generalization', 0.028), ('ritter', 0.027), ('phrases', 0.026), ('soderland', 0.026), ('hoffmann', 0.026), ('expressed', 0.024), ('precision', 0.024), ('dependency', 0.024), ('confidence', 0.024), ('suchanek', 0.024), ('curve', 0.024), ('oren', 0.024), ('wikipedia', 0.023), ('verb', 0.023), ('path', 0.023), ('nombank', 0.023), ('ireland', 0.023), ('area', 0.023), ('noun', 0.023), ('syntactic', 0.023), ('verbs', 0.022), ('attribution', 0.022), ('extractor', 0.022), ('fields', 0.022), ('enter', 0.021), ('obama', 0.021), ('curves', 0.02), ('weld', 0.02), ('conditionally', 0.02), ('mausam', 0.02), ('lex', 0.02), ('amod', 0.02), ('boyle', 0.02), ('clueweb', 0.02), ('founder', 0.02), ('godse', 0.02), ('janara', 0.02), ('ofextractions', 0.02), ('orchestra', 0.02), ('packaging', 0.02), ('sasquatch', 0.02), ('symphony', 0.02), ('tubes', 0.02), ('webb', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 100 emnlp-2012-Open Language Learning for Information Extraction

Author: Mausam ; Michael Schmitz ; Stephen Soderland ; Robert Bart ; Oren Etzioni

Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, stateof-the-art Open IE systems such as REVERB and WOE share two important weaknesses (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents OLLIE, a substantially improved Open IE system that addresses both these limitations. First, OLLIE achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. OLLIE obtains 2.7 times the area under precision-yield curve (AUC) compared to REVERB and 1.9 times the AUC of WOEparse. –

2 0.12079298 40 emnlp-2012-Ensemble Semantics for Large-scale Unsupervised Relation Extraction

Author: Bonan Min ; Shuming Shi ; Ralph Grishman ; Chin-Yew Lin

Abstract: Discovering significant types of relations from the web is challenging because of its open nature. Unsupervised algorithms are developed to extract relations from a corpus without knowing the relations in advance, but most of them rely on tagging arguments of predefined types. Recently, a new algorithm was proposed to jointly extract relations and their argument semantic classes, taking a set of relation instances extracted by an open IE algorithm as input. However, it cannot handle polysemy of relation phrases and fails to group many similar (“synonymous”) relation instances because of the sparseness of features. In this paper, we present a novel unsupervised algorithm that provides a more general treatment of the polysemy and synonymy problems. The algorithm incorporates various knowledge sources which we will show to be very effective for unsupervised extraction. Moreover, it explicitly disambiguates polysemous relation phrases and groups synonymous ones. While maintaining approximately the same precision, the algorithm achieves significant improvement on recall compared to the previous method. It is also very efficient. Experiments on a realworld dataset show that it can handle 14.7 million relation instances and extract a very large set of relations from the web. Ralph Grishman1 Chin-Yew Lin2 2Microsoft Research Asia Beijing, China { shumings cyl } @mi cro s o ft . com , that has many applications in answering factoid questions, building knowledge bases and improving search engine relevance. The web has become a massive potential source of such relations. However, its open nature brings an open-ended set of relation types. To extract these relations, a system should not assume a fixed set of relation types, nor rely on a fixed set of relation argument types. The past decade has seen some promising solutions, unsupervised relation extraction (URE) algorithms that extract relations from a corpus without knowing the relations in advance. However, most algorithms (Hasegawa et al., 2004, Shinyama and Sekine, 2006, Chen et. al, 2005) rely on tagging predefined types of entities as relation arguments, and thus are not well-suited for the open domain. Recently, Kok and Domingos (2008) proposed Semantic Network Extractor (SNE), which generates argument semantic classes and sets of synonymous relation phrases at the same time, thus avoiding the requirement of tagging relation arguments of predefined types. However, SNE has 2 limitations: 1) Following previous URE algorithms, it only uses features from the set of input relation instances for clustering. Empirically we found that it fails to group many relevant relation instances. These features, such as the surface forms of arguments and lexical sequences in between, are very sparse in practice. In contrast, there exist several well-known corpus-level semantic resources that can be automatically derived from a source corpus and are shown to be useful for generating the key elements of a relation: its 2 argument semantic classes and a set of synonymous phrases. For example, semantic classes can be derived from a source corpus with contextual distributional simi1 Introduction Relation extraction aims at discovering semantic larity and web table co-occurrences. The “synonymy” 1 problem for clustering relation instances relations between entities. It is an important task * Work done during an internship at Microsoft Research Asia 1027 LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in02t C7–o1n0f3e7re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls could potentially be better solved by adding these resources. 2) SNE assumes that each entity or relation phrase belongs to exactly one cluster, thus is not able to effectively handle polysemy of relation phrases2. An example of a polysemous phrase is be the currency of as in 2 triples

3 0.1008783 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

4 0.09477362 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types

Author: Ndapandula Nakashole ; Gerhard Weikum ; Fabian Suchanek

Abstract: This paper presents PATTY: a large resource for textual patterns that denote binary relations between entities. The patterns are semantically typed and organized into a subsumption taxonomy. The PATTY system is based on efficient algorithms for frequent itemset mining and can process Web-scale corpora. It harnesses the rich type system and entity population of large knowledge bases. The PATTY taxonomy comprises 350,569 pattern synsets. Random-sampling-based evaluation shows a pattern accuracy of 84.7%. PATTY has 8,162 subsumptions, with a random-sampling-based precision of 75%. The PATTY resource is freely available for interactive access and download.

5 0.09225107 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

Author: Mihai Surdeanu ; Julie Tibshirani ; Ramesh Nallapati ; Christopher D. Manning

Abstract: Distant supervision for relation extraction (RE) gathering training data by aligning a database of facts with text – is an efficient approach to scale RE to thousands of different relations. However, this introduces a challenging learning scenario where the relation expressed by a pair of entities found in a sentence is unknown. For example, a sentence containing Balzac and France may express BornIn or Died, an unknown relation, or no relation at all. Because of this, traditional supervised learning, which assumes that each example is explicitly mapped to a label, is not appropriate. We propose a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables. Our model performs competitively on two difficult domains. –

6 0.07503359 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

7 0.072931424 62 emnlp-2012-Identifying Constant and Unique Relations by using Time-Series Text

8 0.071576409 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence

9 0.059832692 97 emnlp-2012-Natural Language Questions for the Web of Data

10 0.058078077 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation

11 0.057316601 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

12 0.042495508 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

13 0.042409699 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering

14 0.036812473 84 emnlp-2012-Linking Named Entities to Any Database

15 0.036122367 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web

16 0.035475578 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

17 0.034759909 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

18 0.034088649 65 emnlp-2012-Improving NLP through Marginalization of Hidden Syntactic Structure

19 0.031949971 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

20 0.030977238 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, 0.095), (2, -0.002), (3, -0.054), (4, 0.06), (5, -0.017), (6, 0.081), (7, 0.177), (8, -0.149), (9, 0.03), (10, -0.066), (11, -0.009), (12, -0.061), (13, -0.02), (14, -0.044), (15, -0.028), (16, -0.199), (17, -0.026), (18, -0.01), (19, -0.009), (20, -0.04), (21, -0.063), (22, -0.008), (23, -0.017), (24, -0.023), (25, -0.154), (26, 0.023), (27, -0.029), (28, 0.065), (29, 0.067), (30, -0.082), (31, 0.038), (32, -0.113), (33, 0.134), (34, -0.032), (35, -0.04), (36, 0.022), (37, 0.06), (38, -0.022), (39, 0.008), (40, -0.093), (41, -0.049), (42, -0.025), (43, -0.06), (44, -0.147), (45, 0.002), (46, -0.188), (47, 0.122), (48, 0.039), (49, 0.132)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92964321 100 emnlp-2012-Open Language Learning for Information Extraction

Author: Mausam ; Michael Schmitz ; Stephen Soderland ; Robert Bart ; Oren Etzioni

Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, stateof-the-art Open IE systems such as REVERB and WOE share two important weaknesses (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents OLLIE, a substantially improved Open IE system that addresses both these limitations. First, OLLIE achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. OLLIE obtains 2.7 times the area under precision-yield curve (AUC) compared to REVERB and 1.9 times the AUC of WOEparse. –

2 0.76390523 40 emnlp-2012-Ensemble Semantics for Large-scale Unsupervised Relation Extraction

Author: Bonan Min ; Shuming Shi ; Ralph Grishman ; Chin-Yew Lin

Abstract: Discovering significant types of relations from the web is challenging because of its open nature. Unsupervised algorithms are developed to extract relations from a corpus without knowing the relations in advance, but most of them rely on tagging arguments of predefined types. Recently, a new algorithm was proposed to jointly extract relations and their argument semantic classes, taking a set of relation instances extracted by an open IE algorithm as input. However, it cannot handle polysemy of relation phrases and fails to group many similar (“synonymous”) relation instances because of the sparseness of features. In this paper, we present a novel unsupervised algorithm that provides a more general treatment of the polysemy and synonymy problems. The algorithm incorporates various knowledge sources which we will show to be very effective for unsupervised extraction. Moreover, it explicitly disambiguates polysemous relation phrases and groups synonymous ones. While maintaining approximately the same precision, the algorithm achieves significant improvement on recall compared to the previous method. It is also very efficient. Experiments on a realworld dataset show that it can handle 14.7 million relation instances and extract a very large set of relations from the web. Ralph Grishman1 Chin-Yew Lin2 2Microsoft Research Asia Beijing, China { shumings cyl } @mi cro s o ft . com , that has many applications in answering factoid questions, building knowledge bases and improving search engine relevance. The web has become a massive potential source of such relations. However, its open nature brings an open-ended set of relation types. To extract these relations, a system should not assume a fixed set of relation types, nor rely on a fixed set of relation argument types. The past decade has seen some promising solutions, unsupervised relation extraction (URE) algorithms that extract relations from a corpus without knowing the relations in advance. However, most algorithms (Hasegawa et al., 2004, Shinyama and Sekine, 2006, Chen et. al, 2005) rely on tagging predefined types of entities as relation arguments, and thus are not well-suited for the open domain. Recently, Kok and Domingos (2008) proposed Semantic Network Extractor (SNE), which generates argument semantic classes and sets of synonymous relation phrases at the same time, thus avoiding the requirement of tagging relation arguments of predefined types. However, SNE has 2 limitations: 1) Following previous URE algorithms, it only uses features from the set of input relation instances for clustering. Empirically we found that it fails to group many relevant relation instances. These features, such as the surface forms of arguments and lexical sequences in between, are very sparse in practice. In contrast, there exist several well-known corpus-level semantic resources that can be automatically derived from a source corpus and are shown to be useful for generating the key elements of a relation: its 2 argument semantic classes and a set of synonymous phrases. For example, semantic classes can be derived from a source corpus with contextual distributional simi1 Introduction Relation extraction aims at discovering semantic larity and web table co-occurrences. The “synonymy” 1 problem for clustering relation instances relations between entities. It is an important task * Work done during an internship at Microsoft Research Asia 1027 LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in02t C7–o1n0f3e7re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls could potentially be better solved by adding these resources. 2) SNE assumes that each entity or relation phrase belongs to exactly one cluster, thus is not able to effectively handle polysemy of relation phrases2. An example of a polysemous phrase is be the currency of as in 2 triples

3 0.64925921 62 emnlp-2012-Identifying Constant and Unique Relations by using Time-Series Text

Author: Yohei Takaku ; Nobuhiro Kaji ; Naoki Yoshinaga ; Masashi Toyoda

Abstract: Because the real world evolves over time, numerous relations between entities written in presently available texts are already obsolete or will potentially evolve in the future. This study aims at resolving the intricacy in consistently compiling relations extracted from text, and presents a method for identifying constancy and uniqueness of the relations in the context of supervised learning. We exploit massive time-series web texts to induce features on the basis of time-series frequency and linguistic cues. Experimental results confirmed that the time-series frequency distributions contributed much to the recall of constancy identification and the precision of the uniqueness identification.

4 0.60349762 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types

Author: Ndapandula Nakashole ; Gerhard Weikum ; Fabian Suchanek

Abstract: This paper presents PATTY: a large resource for textual patterns that denote binary relations between entities. The patterns are semantically typed and organized into a subsumption taxonomy. The PATTY system is based on efficient algorithms for frequent itemset mining and can process Web-scale corpora. It harnesses the rich type system and entity population of large knowledge bases. The PATTY taxonomy comprises 350,569 pattern synsets. Random-sampling-based evaluation shows a pattern accuracy of 84.7%. PATTY has 8,162 subsumptions, with a random-sampling-based precision of 75%. The PATTY resource is freely available for interactive access and download.

5 0.42896122 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

6 0.42403385 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

7 0.42218885 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence

8 0.3963145 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web

9 0.3745037 97 emnlp-2012-Natural Language Questions for the Web of Data

10 0.36005843 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation

11 0.34656072 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

12 0.34636137 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

13 0.27192476 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs

14 0.22979718 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

15 0.21851252 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

16 0.21017286 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

17 0.20927924 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

18 0.20538633 85 emnlp-2012-Local and Global Context for Supervised and Unsupervised Metonymy Resolution

19 0.18467587 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

20 0.18339226 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.032), (16, 0.026), (18, 0.015), (25, 0.012), (34, 0.032), (60, 0.06), (63, 0.451), (64, 0.025), (65, 0.051), (70, 0.021), (73, 0.017), (74, 0.036), (76, 0.044), (80, 0.026), (86, 0.017), (94, 0.02), (95, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94535077 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

Author: Lan Du ; Wray Buntine ; Huidong Jin

Abstract: Topic models are increasingly being used for text analysis tasks, often times replacing earlier semantic techniques such as latent semantic analysis. In this paper, we develop a novel adaptive topic model with the ability to adapt topics from both the previous segment and the parent document. For this proposed model, a Gibbs sampler is developed for doing posterior inference. Experimental results show that with topic adaptation, our model significantly improves over existing approaches in terms of perplexity, and is able to uncover clear sequential structure on, for example, Herman Melville’s book “Moby Dick”.

2 0.94199574 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

Author: Kristian Woodsend ; Mirella Lapata

Abstract: Multi-document summarization involves many aspects of content selection and surface realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

same-paper 3 0.93442327 100 emnlp-2012-Open Language Learning for Information Extraction

Author: Mausam ; Michael Schmitz ; Stephen Soderland ; Robert Bart ; Oren Etzioni

Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, stateof-the-art Open IE systems such as REVERB and WOE share two important weaknesses (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents OLLIE, a substantially improved Open IE system that addresses both these limitations. First, OLLIE achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. OLLIE obtains 2.7 times the area under precision-yield curve (AUC) compared to REVERB and 1.9 times the AUC of WOEparse. –

4 0.90475559 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

Author: Thomas Francois ; Cedrick Fairon

Abstract: This paper present a new readability formula for French as a foreign language (FFL), which relies on 46 textual features representative of the lexical, syntactic, and semantic levels as well as some of the specificities of the FFL context. We report comparisons between several techniques for feature selection and various learning algorithms. Our best model, based on support vector machines (SVM), significantly outperforms previous FFL formulas. We also found that semantic features behave poorly in our case, in contrast with some previous readability studies on English as a first language.

5 0.84144604 97 emnlp-2012-Natural Language Questions for the Web of Data

Author: Mohamed Yahya ; Klaus Berberich ; Shady Elbassuoni ; Maya Ramanath ; Volker Tresp ; Gerhard Weikum

Abstract: The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources. Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the . in question translation and the resulting query answering.

6 0.67869455 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

7 0.67675543 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

8 0.66843343 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

9 0.6440621 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types

10 0.57728994 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

11 0.57532579 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

12 0.57119668 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

13 0.56137222 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

14 0.55523551 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

15 0.55410963 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

16 0.55398148 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models

17 0.55091906 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques

18 0.54653138 19 emnlp-2012-An Entity-Topic Model for Entity Linking

19 0.54375887 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

20 0.54205227 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns