acl acl2010 acl2010-43 knowledge-graph by maker-knowledge-mining

43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies


Source: pdf

Author: Karin Murthy ; Tanveer A Faruquie ; L Venkata Subramaniam ; Hima Prasad K ; Mukesh Mohania

Abstract: We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for our approach and demonstrate its effectiveness and robustness in extracting knowledge from real-world data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. [sent-3, score-0.509]

2 A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. [sent-4, score-0.997]

3 However, building taxonomies manually for specific domains or data sources is time consuming and expensive. [sent-7, score-0.126]

4 Techniques to automatically deduce a taxonomy in an unsupervised manner are thus indispensable. [sent-8, score-0.509]

5 Automatic deduction of taxonomies consist of two tasks: extracting relevant terms to represent concepts of the taxonomy and discovering relationships between concepts. [sent-9, score-0.748]

6 For unstructured text, the extraction of relevant terms relies on information extraction methods (Etzioni et al. [sent-10, score-0.109]

7 Though producing accurate results, these approaches usually have low coverage for many domains and suffer from the problem of inconsistency between terms when connecting the instances as chains to form a taxonomy. [sent-16, score-0.237]

8 The second category of approaches uses clustering to discover terms and the relationships between them (Roy and Subramaniam, 2006), even if those relation- ships do not explicitly appear in the text. [sent-17, score-0.214]

9 Though these methods tackle inconsistency by addressing taxonomy deduction globally, the relationships extracted are often difficult to interpret by humans. [sent-18, score-0.613]

10 We show that for certain domains, the frequency with which terms appear in a corpus on their own and in conjunction with other terms induces a natural taxonomy. [sent-19, score-0.368]

11 We formally define the concept of a term-frequency-based taxonomy and show its applicability for an example application. [sent-20, score-0.509]

12 We present an unsupervised method to generate such a taxonomy from scratch and outline how domainspecific constraints can easily be integrated into the generation process. [sent-21, score-0.546]

13 For addresses from emerging geographies no standard postal address scheme exists and our objective was to produce a postal taxonomy that is useful in standardizing addresses (Kothari et al. [sent-24, score-1.228]

14 Specifically, the experiments were designed to investigate the effective- ness of our approach on noisy terms with lots of variations. [sent-26, score-0.109]

15 The results show that our method is able to induce a taxonomy without using any kind of lexical-semantic patterns. [sent-27, score-0.509]

16 2 Related Work One approach for taxonomy deduction is to use explicit expressions (Iwaska et al. [sent-28, score-0.573]

17 Supervised methods for taxonomy induction provide training instances with global semantic information about concepts (Fleischman and Hovy, 2002) and use bootstrapping to induce new seeds to extract further patterns (Cimiano et al. [sent-39, score-0.562]

18 Semi-supervised approaches start with known terms belonging to a category, construct context vectors of classified terms, and associate categories to previously unclassified terms depending on the similarity of their context (Tanev and Magnini, 2006). [sent-41, score-0.218]

19 Recently, lexical entailment has been used where the term is assigned to a category if its occurrence in the corpus can be replaced by the lexicalization of the category (Giuliano and Gliozzo, 2008). [sent-49, score-0.178]

20 In our method, terms are incrementally added to the taxonomy based on their support and context. [sent-50, score-0.655]

21 Association rule mining (Agrawal and Srikant, 1994) discovers interesting relations between terms, based on the frequency with which terms appear together. [sent-51, score-0.259]

22 However, the amount of patterns generated is often huge and constructing a taxonomy from all the patterns can be challenging. [sent-52, score-0.615]

23 In our approach, we employ similar concepts but make taxonomy construction part of the relationship discovery process. [sent-53, score-0.509]

24 3 Term-frequency-induced Taxonomies For some application domains, a taxonomy is induced by the frequency in which terms appear in a corpus on their own and in combination with other terms. [sent-54, score-0.798]

25 L {tet f(t) rde ∧not re t∈he C frequency eotf o tfe armll t, mthast o oisf Cth. [sent-61, score-0.088]

26 Let F(t, T+, T−) denote the frequency of term t given a set of mustalso-appear terms T+ and a set of cannot-alsoappear terms T−. [sent-63, score-0.398]

27 t is the term at n, A(n) the ancestors of n, and P(n) the predecessors of n. [sent-68, score-0.128]

28 A TFIT has a root node with the special term andA t TheF IcTo nhadsit iaon roaol frequency ∞. [sent-69, score-0.248]

29 tional frequency in the context of the node’s ancestors and predecessors. [sent-75, score-0.124]

30 Only terms with a conditional frequency above zero are added to a TFIT. [sent-76, score-0.197]

31 We show in Section 4 how a TFIT taxonomy can be automatically induced from a given corpus. [sent-77, score-0.509]

32 But before that, we show that TFITs are useful in practice and reflect a natural ordering of terms for application domains where the concept hierarchy is expressed through the frequency in which terms appear. [sent-78, score-0.438]

33 2 Example Domain: Address Data An address taxonomy is a key enabler for address standardization. [sent-80, score-0.739]

34 Figure 1shows part of such an address taxonomy where the root contains the most generic term and leaf-level nodes contain the most specific terms. [sent-81, score-0.716]

35 For emerging economies building a standardized address taxonomy is a huge chal127 RowTermPart of addressCategory lenge. [sent-82, score-0.682]

36 First, new areas and with it new addresses constantly emerge. [sent-83, score-0.251]

37 Second, there are very limited conventions for specifying an address (Faruquie et al. [sent-84, score-0.115]

38 However, while many developing countries do not have a postal taxonomy, there is often no lack of address data to learn a taxonomy from. [sent-86, score-0.743]

39 Although Indian addresses tend to follow the general principal that more specific information is mentioned earlier, there is no fixed order for different elements of an address. [sent-88, score-0.189]

40 For example, the ZIP code of an address may be mentioned before or after the state information and, although ZIP code information is more specific than city information, it is generally mentioned later in the address. [sent-89, score-0.315]

41 Taking all this into account, there is often not enough structure available to automatically infer a taxonomy purely based on the structural or seman- tic aspects of an address. [sent-92, score-0.509]

42 However, for address data, the general-to-specific concept hierarchy is reflected in the frequency with which terms appear on their own and together with other terms. [sent-93, score-0.416]

43 It mostly holds that f(s) > f(d) > f(c) > f(z) where s is a state name, d is a district name, c is a city name, and z is a ZIP code. [sent-94, score-0.245]

44 However, sometimes the name of a large city may be more frequent than the name of a small state. [sent-95, score-0.273]

45 For example, in a given corpus, the term ’Houston’ (a populous US city) may appear more frequent than the term ’Vermont’ (a small US state). [sent-96, score-0.3]

46 To avoid that ’Houston’ is picked as a node at the first level of the taxonomy (which should only contain states), the conditional-frequency constraint introduced in Section 3. [sent-97, score-0.671]

47 ’Houston’s state ’Texas’ (which is more frequent) is picked before ’Houston’ . [sent-99, score-0.103]

48 After ’Texas’ is picked it appears in the ”cannot-also-appear”’ list for all further siblings on the first level, thus giving ’Houston’ has a conditional frequency of zero. [sent-100, score-0.152]

49 We show in Section 5 that an address taxonomy can be inferred by generating a TFIT taxonomy. [sent-101, score-0.624]

50 // For initialization T+, T−are empty // For initialization l,w are zero genTFIT(T+, T− , C, l, w) // select most frequent term tnext = tj with F(tj , T+ , T−) is maximal amongst all tj ∈ C; T+, fne∈xt C C=; F(tnext, T−); if fnext ≥ support then //Outp≥ut s nuoppdeo r(tt jth , el,n w) . [sent-105, score-0.391]

51 // Generate child node genTFIT(T+ ∪ {tnext}, T− , C, l+ 1, w) // Generate sib∪lin {gt node} genTFIT(T+, T− ∪ {tnext}, C, l, w + 1) end if To generate a TFIT taxonomy as defined in Section 3. [sent-108, score-0.614]

52 1we recursively pick the most frequent term given previously chosen terms. [sent-109, score-0.146]

53 With each call of genTFIT a new node n in the taxonomy is created with (t, l,w) where t is the most frequent term given T+ and T− and land w capture the position in the taxonomy. [sent-112, score-0.723]

54 Instead of adding all terms with a conditional frequency above zero, we only add terms with a conditional frequency equal to or higher than support. [sent-115, score-0.394]

55 For example, limiting the depth of the taxonomy by introducing a maxLevel constraint and checking before each recursive call if maxLevel is reached, is a taxonomy-level constraint. [sent-121, score-0.539]

56 A node-level constraint applies to each node and affects the way the frequency of terms is determined. [sent-122, score-0.295]

57 For our example application, we introduce the following node-level constraint: at each node we only count terms that appear at specific positions in records with respect to the current level of the node. [sent-123, score-0.308]

58 Specifically, we slide (or incrementally increase) a window over the address records starting from the end. [sent-124, score-0.216]

59 For example, when picking the term ’Washington’ as a state name, occurrences of ’Washington’ as city or street name are ignored. [sent-125, score-0.306]

60 That is, we divide all positions in an address by the average length of an address (which is 10 for our 40 Million addresses). [sent-128, score-0.23]

61 In addition to syntactical constraints, semantic constraints can be integrated by classifying terms for use when picking the next frequent term. [sent-131, score-0.193]

62 In our example application, markers tend to appear much more often than any proper noun. [sent-132, score-0.094]

63 For example, the term ’Road’ appears in almost all addresses, and might be picked up as the most frequent term very early in the process. [sent-133, score-0.302]

64 Thus, it is beneficial to ignore marker terms during taxonomy generation and adding them as a post-processing step. [sent-134, score-0.618]

65 Misspelled terms are generally infrequent and will as such not become part of the taxonomy. [sent-137, score-0.109]

66 Incomplete addresses partially contribute to the taxonomy and only cause a problem if the same information is missing too often. [sent-139, score-0.728]

67 For example, if more than support addresses with the city ’Houston’ are missing the state ’Texas’, then ’Houston’ may become a node at the first level and appear to be a state. [sent-140, score-0.496]

68 5 Evaluation We present an evaluation of our approach for address data from an emerging economy. [sent-142, score-0.173]

69 Each address record was given to us as a single string and was first tokenized into a sequence of terms as shown in Table 1. [sent-147, score-0.224]

70 We used tools to detect synonyms with the same context to generate a list of rules to map terms to a standard form (Lin, 1998). [sent-150, score-0.181]

71 We also used a list of keywords to classify some terms as markers such as ’Road’ and ’Nagar’ shown in Table 1. [sent-152, score-0.141]

72 To evaluate the precision and recall we also retrieved post office addresses from India Post1 , cleaned them, and organized them in a tree. [sent-155, score-0.386]

73 Second, we use our approach to enrich the existing hierarchy created from post office addresses with additional area terms. [sent-156, score-0.418]

74 To validate the result, we also retrieved data about which area names appear within a ZIP code. [sent-157, score-0.128]

75 2 Taxonomy Generation We generated a taxonomy O using all 40 million addresses. [sent-160, score-0.509]

76 We compare the terms assigned to category levels district and taluk4 in O with the tree P constructed from post office addresses. [sent-161, score-0.408]

77 Each district and taluk has at least one post office. [sent-162, score-0.424]

78 Thus P covers all districts and taluks and allows us to test coverage and precision. [sent-163, score-0.168]

79 We compute the precision and recall for each category level CL as 1http://www. [sent-164, score-0.119]

80 For taluk it is lower because a major part ofthe data belongs to urban areas where taluk information is missing. [sent-177, score-0.51]

81 The precision seems to be low but it has to be noted that in almost 75% of the addresses either district or taluk information is missing or noisy. [sent-178, score-0.626]

82 Again, both districts and taluks appear at the next level of the taxonomy. [sent-181, score-0.202]

83 For a support of 200 there are 19 entries in O of which all but two appear in P as district or taluk. [sent-182, score-0.234]

84 One entry is a taluk that actually belongs to Maharashtra and one entry is a name variation of a taluk in P. [sent-183, score-0.522]

85 There were not enough addresses to get a good coverage of all districts and taluks. [sent-184, score-0.301]

86 3 Taxonomy Augmentation We used P and ran our algorithm for each branch in P to include area information. [sent-186, score-0.1]

87 The recall is low because many addresses do not mention a ZIP code or use an incorrect ZIP code. [sent-188, score-0.262]

88 For each detected area we compared whether the area is also listed on whereincity. [sent-192, score-0.132]

89 com, part of a post office name (PO), or shown on google maps. [sent-193, score-0.195]

90 Out of the unconfirmed terms Fanaswadi and MarineDrive seem to be genuine area names but we could not confirm DhakurdwarRoad. [sent-195, score-0.175]

91 The term th is due to our AreaWhereincityPOGoogle TablPKGOCehriupame3gthrda:bnwoilhanAHkRdSaoriwueaRmsdaeo sjadfouny eo fs rZIPcodyn e o s40ny 0oe s02(top) and 400004 (bottom) tokenization process. [sent-196, score-0.092]

92 16 correct terms out of 18 terms results in a precision of 89%. [sent-197, score-0.266]

93 We also ran experiments to measure the coverage of area detection for Mumbai without using ZIP codes. [sent-198, score-0.094]

94 However, again the precision is low because quite a few of those areas are actually taluk names. [sent-200, score-0.334]

95 Using a large number of addresses is necessary to achieve good recall and precision. [sent-201, score-0.217]

96 6 Conclusion In this paper, we presented a novel approach to generate a taxonomy for data where terms exhibit an inherent frequency-based hierarchy. [sent-202, score-0.655]

97 We showed that term frequency can be used to generate a meaningful taxonomy from address records. [sent-203, score-0.841]

98 The presented approach can also be used to extend an existing taxonomy which is a big advantage for emerging countries where geographical areas evolve continuously. [sent-204, score-0.664]

99 While we have evaluated our approach on address data, it is applicable to all data sources where the inherent hierarchical structure is encoded in the frequency with which terms appear on their own and together with other terms. [sent-205, score-0.374]

100 Preliminary experiments on real-time analyst’s stock market tips 5 produced a taxonomy of (TV station, An- alyst, Affiliation) with decent precision and recall. [sent-206, score-0.588]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('taxonomy', 0.509), ('tfit', 0.252), ('zip', 0.245), ('taluk', 0.224), ('gentfit', 0.196), ('addresses', 0.189), ('houston', 0.172), ('district', 0.135), ('address', 0.115), ('faruquie', 0.112), ('subramaniam', 0.112), ('tnext', 0.112), ('terms', 0.109), ('term', 0.092), ('frequency', 0.088), ('cimiano', 0.084), ('districts', 0.084), ('iwaska', 0.084), ('maharashtra', 0.084), ('mukesh', 0.084), ('postal', 0.084), ('tanveer', 0.084), ('venkata', 0.084), ('indian', 0.08), ('name', 0.074), ('city', 0.071), ('records', 0.069), ('node', 0.068), ('area', 0.066), ('taxonomies', 0.066), ('post', 0.065), ('picked', 0.064), ('deduction', 0.064), ('areas', 0.062), ('appear', 0.062), ('domains', 0.06), ('emerging', 0.058), ('kozareva', 0.057), ('office', 0.056), ('fthrsom', 0.056), ('hima', 0.056), ('kothari', 0.056), ('lucja', 0.056), ('maxlevel', 0.056), ('mohania', 0.056), ('peactth', 0.056), ('roomot', 0.056), ('taluks', 0.056), ('tfits', 0.056), ('tooo', 0.056), ('girju', 0.055), ('frequent', 0.054), ('patterns', 0.053), ('prasad', 0.053), ('agrawal', 0.049), ('giuliano', 0.049), ('inl', 0.049), ('tanev', 0.049), ('tj', 0.048), ('texas', 0.048), ('precision', 0.048), ('code', 0.045), ('callan', 0.045), ('tlo', 0.045), ('category', 0.043), ('mumbai', 0.042), ('hui', 0.042), ('jamie', 0.042), ('landmark', 0.042), ('psa', 0.042), ('hierarchy', 0.042), ('fleischman', 0.04), ('inconsistency', 0.04), ('state', 0.039), ('road', 0.038), ('support', 0.037), ('generate', 0.037), ('ancestors', 0.036), ('india', 0.036), ('countries', 0.035), ('rr', 0.035), ('synonyms', 0.035), ('yang', 0.035), ('branch', 0.034), ('roy', 0.033), ('snow', 0.032), ('markers', 0.032), ('bunescu', 0.032), ('window', 0.032), ('market', 0.031), ('lin', 0.031), ('constraint', 0.03), ('picking', 0.03), ('application', 0.03), ('missing', 0.03), ('amounts', 0.03), ('pages', 0.029), ('recall', 0.028), ('coverage', 0.028), ('international', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

Author: Karin Murthy ; Tanveer A Faruquie ; L Venkata Subramaniam ; Hima Prasad K ; Mukesh Mohania

Abstract: We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for our approach and demonstrate its effectiveness and robustness in extracting knowledge from real-world data.

2 0.14745703 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation

Author: Stephen Tratz ; Eduard Hovy

Abstract: The automatic interpretation of noun-noun compounds is an important subproblem within many natural language processing applications and is an area of increasing interest. The problem is difficult, with disagreement regarding the number and nature of the relations, low inter-annotator agreement, and limited annotated data. In this paper, we present a novel taxonomy of relations that integrates previous relations, the largest publicly-available annotated dataset, and a supervised classification method for automatic noun compound interpretation.

3 0.10192145 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds

Author: Ashwin Ittoo ; Gosse Bouma

Abstract: An important relation in information extraction is the part-whole relation. Ontological studies mention several types of this relation. In this paper, we show that the traditional practice of initializing minimally-supervised algorithms with a single set that mixes seeds of different types fails to capture the wide variety of part-whole patterns and tuples. The results obtained with mixed seeds ultimately converge to one of the part-whole relation types. We also demonstrate that all the different types of part-whole relations can still be discovered, regardless of the type characterized by the initializing seeds. We performed our experiments with a state-ofthe-art information extraction algorithm. 1

4 0.090805031 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: A challenging problem in open information extraction and text mining is the learning of the selectional restrictions of semantic relations. We propose a minimally supervised bootstrapping algorithm that uses a single seed and a recursive lexico-syntactic pattern to learn the arguments and the supertypes of a diverse set of semantic relations from the Web. We evaluate the performance of our algorithm on multiple semantic relations expressed using “verb”, “noun”, and “verb prep” lexico-syntactic patterns. Humanbased evaluation shows that the accuracy of the harvested information is about 90%. We also compare our results with existing knowledge base to outline the similarities and differences of the granularity and diversity of the harvested knowledge.

5 0.079812184 138 acl-2010-Hunting for the Black Swan: Risk Mining from Text

Author: Jochen Leidner ; Frank Schilder

Abstract: In the business world, analyzing and dealing with risk permeates all decisions and actions. However, to date, risk identification, the first step in the risk management cycle, has always been a manual activity with little to no intelligent software tool support. In addition, although companies are required to list risks to their business in their annual SEC filings in the USA, these descriptions are often very highlevel and vague. In this paper, we introduce Risk Mining, which is the task of identifying a set of risks pertaining to a business area or entity. We argue that by combining Web mining and Information Extraction (IE) techniques, risks can be detected automatically before they materialize, thus providing valuable business intelligence. We describe a system that induces a risk taxonomy with concrete risks (e.g., interest rate changes) at its leaves and more abstract risks (e.g., financial risks) closer to its root node. The taxonomy is induced via a bootstrapping algorithms starting with a few seeds. The risk taxonomy is used by the system as input to a risk monitor that matches risk mentions in financial documents to the abstract risk types, thus bridging a lexical gap. Our system is able to automatically generate company specific “risk maps”, which we demonstrate for a corpus of earnings report conference calls.

6 0.068827644 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

7 0.061793808 166 acl-2010-Learning Word-Class Lattices for Definition and Hypernym Extraction

8 0.059207194 127 acl-2010-Global Learning of Focused Entailment Graphs

9 0.056751195 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining

10 0.056565754 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

11 0.056096837 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

12 0.055194553 139 acl-2010-Identifying Generic Noun Phrases

13 0.054151095 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

14 0.051387008 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences

15 0.05004767 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

16 0.047163751 33 acl-2010-Assessing the Role of Discourse References in Entailment Inference

17 0.04634691 44 acl-2010-BabelNet: Building a Very Large Multilingual Semantic Network

18 0.0453251 156 acl-2010-Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems

19 0.045127805 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

20 0.044584092 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, 0.062), (2, -0.024), (3, -0.012), (4, 0.041), (5, 0.007), (6, 0.052), (7, 0.022), (8, -0.01), (9, -0.028), (10, -0.037), (11, 0.02), (12, -0.05), (13, -0.072), (14, 0.039), (15, 0.076), (16, 0.047), (17, 0.056), (18, 0.023), (19, -0.007), (20, -0.003), (21, 0.069), (22, -0.022), (23, 0.029), (24, -0.054), (25, -0.055), (26, -0.054), (27, 0.076), (28, -0.024), (29, 0.026), (30, -0.02), (31, 0.058), (32, 0.04), (33, 0.007), (34, 0.015), (35, -0.006), (36, 0.041), (37, 0.043), (38, -0.071), (39, 0.116), (40, -0.02), (41, -0.025), (42, 0.083), (43, 0.05), (44, 0.121), (45, 0.054), (46, -0.051), (47, -0.09), (48, 0.171), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93175995 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

Author: Karin Murthy ; Tanveer A Faruquie ; L Venkata Subramaniam ; Hima Prasad K ; Mukesh Mohania

Abstract: We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for our approach and demonstrate its effectiveness and robustness in extracting knowledge from real-world data.

2 0.83082235 138 acl-2010-Hunting for the Black Swan: Risk Mining from Text

Author: Jochen Leidner ; Frank Schilder

Abstract: In the business world, analyzing and dealing with risk permeates all decisions and actions. However, to date, risk identification, the first step in the risk management cycle, has always been a manual activity with little to no intelligent software tool support. In addition, although companies are required to list risks to their business in their annual SEC filings in the USA, these descriptions are often very highlevel and vague. In this paper, we introduce Risk Mining, which is the task of identifying a set of risks pertaining to a business area or entity. We argue that by combining Web mining and Information Extraction (IE) techniques, risks can be detected automatically before they materialize, thus providing valuable business intelligence. We describe a system that induces a risk taxonomy with concrete risks (e.g., interest rate changes) at its leaves and more abstract risks (e.g., financial risks) closer to its root node. The taxonomy is induced via a bootstrapping algorithms starting with a few seeds. The risk taxonomy is used by the system as input to a risk monitor that matches risk mentions in financial documents to the abstract risk types, thus bridging a lexical gap. Our system is able to automatically generate company specific “risk maps”, which we demonstrate for a corpus of earnings report conference calls.

3 0.81754869 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds

Author: Ashwin Ittoo ; Gosse Bouma

Abstract: An important relation in information extraction is the part-whole relation. Ontological studies mention several types of this relation. In this paper, we show that the traditional practice of initializing minimally-supervised algorithms with a single set that mixes seeds of different types fails to capture the wide variety of part-whole patterns and tuples. The results obtained with mixed seeds ultimately converge to one of the part-whole relation types. We also demonstrate that all the different types of part-whole relations can still be discovered, regardless of the type characterized by the initializing seeds. We performed our experiments with a state-ofthe-art information extraction algorithm. 1

4 0.65274155 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation

Author: Stephen Tratz ; Eduard Hovy

Abstract: The automatic interpretation of noun-noun compounds is an important subproblem within many natural language processing applications and is an area of increasing interest. The problem is difficult, with disagreement regarding the number and nature of the relations, low inter-annotator agreement, and limited annotated data. In this paper, we present a novel taxonomy of relations that integrates previous relations, the largest publicly-available annotated dataset, and a supervised classification method for automatic noun compound interpretation.

5 0.63788807 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: A challenging problem in open information extraction and text mining is the learning of the selectional restrictions of semantic relations. We propose a minimally supervised bootstrapping algorithm that uses a single seed and a recursive lexico-syntactic pattern to learn the arguments and the supertypes of a diverse set of semantic relations from the Web. We evaluate the performance of our algorithm on multiple semantic relations expressed using “verb”, “noun”, and “verb prep” lexico-syntactic patterns. Humanbased evaluation shows that the accuracy of the harvested information is about 90%. We also compare our results with existing knowledge base to outline the similarities and differences of the granularity and diversity of the harvested knowledge.

6 0.62279153 166 acl-2010-Learning Word-Class Lattices for Definition and Hypernym Extraction

7 0.62073863 64 acl-2010-Complexity Assumptions in Ontology Verbalisation

8 0.52690232 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction

9 0.50482368 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

10 0.47806588 61 acl-2010-Combining Data and Mathematical Models of Language Change

11 0.46546155 139 acl-2010-Identifying Generic Noun Phrases

12 0.46368051 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

13 0.42261323 248 acl-2010-Unsupervised Ontology Induction from Text

14 0.41968945 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

15 0.41580394 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining

16 0.40483141 258 acl-2010-Weakly Supervised Learning of Presupposition Relations between Verbs

17 0.39724672 176 acl-2010-Mood Patterns and Affective Lexicon Access in Weblogs

18 0.38804018 112 acl-2010-Extracting Social Networks from Literary Fiction

19 0.38724348 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

20 0.38416538 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.016), (25, 0.062), (28, 0.011), (39, 0.012), (42, 0.029), (59, 0.102), (72, 0.019), (73, 0.046), (76, 0.017), (78, 0.039), (80, 0.024), (82, 0.309), (83, 0.082), (84, 0.03), (98, 0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7605654 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

Author: Karin Murthy ; Tanveer A Faruquie ; L Venkata Subramaniam ; Hima Prasad K ; Mukesh Mohania

Abstract: We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for our approach and demonstrate its effectiveness and robustness in extracting knowledge from real-world data.

2 0.71745157 257 acl-2010-WSD as a Distributed Constraint Optimization Problem

Author: Siva Reddy ; Abhilash Inumella

Abstract: This work models Word Sense Disambiguation (WSD) problem as a Distributed Constraint Optimization Problem (DCOP). To model WSD as a DCOP, we view information from various knowledge sources as constraints. DCOP algorithms have the remarkable property to jointly maximize over a wide range of utility functions associated with these constraints. We show how utility functions can be designed for various knowledge sources. For the purpose of evaluation, we modelled all words WSD as a simple DCOP problem. The results are competi- tive with state-of-art knowledge based systems.

3 0.52911282 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

Author: Mohit Bansal ; Dan Klein

Abstract: We present a simple but accurate parser which exploits both large tree fragments and symbol refinement. We parse with all fragments of the training set, in contrast to much recent work on tree selection in data-oriented parsing and treesubstitution grammar learning. We require only simple, deterministic grammar symbol refinement, in contrast to recent work on latent symbol refinement. Moreover, our parser requires no explicit lexicon machinery, instead parsing input sentences as character streams. Despite its simplicity, our parser achieves accuracies of over 88% F1 on the standard English WSJ task, which is competitive with substantially more complicated state-of-theart lexicalized and latent-variable parsers. Additional specific contributions center on making implicit all-fragments parsing efficient, including a coarse-to-fine inference scheme and a new graph encoding.

4 0.525913 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

5 0.52434188 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

Author: Dmitry Davidov ; Ari Rappoport

Abstract: We present a novel framework for automated extraction and approximation of numerical object attributes such as height and weight from the Web. Given an object-attribute pair, we discover and analyze attribute information for a set of comparable objects in order to infer the desired value. This allows us to approximate the desired numerical values even when no exact values can be found in the text. Our framework makes use of relation defining patterns and WordNet similarity information. First, we obtain from the Web and WordNet a list of terms similar to the given object. Then we retrieve attribute values for each term in this list, and information that allows us to compare different objects in the list and to infer the attribute value range. Finally, we combine the retrieved data for all terms from the list to select or approximate the requested value. We evaluate our method using automated question answering, WordNet enrichment, and comparison with answers given in Wikipedia and by leading search engines. In all of these, our framework provides a significant improvement.

6 0.52403331 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

7 0.52390647 169 acl-2010-Learning to Translate with Source and Target Syntax

8 0.52303421 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

9 0.52291101 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

10 0.52223009 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification

11 0.52222282 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

12 0.52184993 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

13 0.52109933 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

14 0.52068895 158 acl-2010-Latent Variable Models of Selectional Preference

15 0.51952845 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

16 0.51942039 248 acl-2010-Unsupervised Ontology Induction from Text

17 0.51902103 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

18 0.51897883 162 acl-2010-Learning Common Grammar from Multilingual Corpus

19 0.51851189 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

20 0.51849508 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences