emnlp emnlp2010 emnlp2010-12 knowledge-graph by maker-knowledge-mining

12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web


Source: pdf

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs’ is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. [sent-2, score-0.175]

2 Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. [sent-4, score-0.154]

3 We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. [sent-5, score-0.298]

4 Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested. [sent-6, score-0.506]

5 1 Introduction A variety of NLP tasks, including inference, textual entailment (Glickman et al. [sent-7, score-0.044]

6 , 1999), rely on semantic knowledge derived from term taxonomies and thesauri such as WordNet. [sent-10, score-0.369]

7 However, the coverage of WordNet is still limited in many regions (even well-studied ones such as the concepts and instances below Animals and People), as noted by researchers such as (Pennacchiotti and Pantel, 2006) and (Hovy et al. [sent-11, score-0.162]

8 This hap1110 pens because WordNet and most other existing taxonomies are manually created, which makes them difficult to maintain in rapidly changing domains, and (in the face of taxonomic complexity) makes them hard to build with consistency. [sent-13, score-0.568]

9 To surmount these problems, it would be advantageous to have an automatic procedure that can not only augment existing resources but can also produce taxonomies for existing and new domains and tasks starting from scratch. [sent-14, score-0.48]

10 The main stages of automatic taxonomy induction are term extraction and term organization. [sent-15, score-0.703]

11 In recent years there has been a substantial amount of work on term extraction, including semantic class learning (Hearst, 1992; Riloff and Shepherd, 1997; Etzioni et al. [sent-16, score-0.142]

12 , 2007), and creation of concept lists (Katz and Lin, 2003). [sent-20, score-0.299]

13 Various attempts have been made to learn the taxonomic organization of concepts (Widdows, 2003; Snow et al. [sent-21, score-0.57]

14 Among the most common is to start with a good ontology and then to try to position the missing concepts into it. [sent-23, score-0.123]

15 , 2006) maximize the conditional probability of hyponym-hypernym relations given certain evidence, while (Yang and Callan, 2009) combines heterogenous features like context, co-occurrence, and surface patterns to produce a more-inclusive inclusion ranking formula. [sent-25, score-0.2]

16 The obtained results are promising, but the problem of how to organize the gathered knowledge when there is no initial taxonomy, or when the initial taxonomy is grossly impoverished, still remains. [sent-26, score-0.458]

17 c od2s01 in0 N Aastsuorcaialt Lioan g foura Cgeom Prpoucteastisoin ga,l p Laignegsui 1s1ti1c0s–1 18, The major problem in performing taxonomy construction from scratch is that overall concept positioning is not trivial. [sent-29, score-0.82]

18 It is difficult to discover whether concepts are unrelated, subordinated, or parallel to each other. [sent-30, score-0.176]

19 In this paper, we address the following question: How can one induce the taxonomic organization of concepts in a given domain starting from scratch? [sent-31, score-0.538]

20 The contributions of this paper are as follows: • • • • An automatic procedure for harvesting hyponym-hypernym pairs given a domain of interest. [sent-32, score-0.249]

21 A ranking mechanism for validating the learned iAs-ara nrekliantigomnse bcheatwneisemn tfoher pairs. [sent-33, score-0.065]

22 A graph-based approach for inducing the taxonAo gmriacp organization oofacthhe f oharr ivnedsutecdin tger tmhes tsatxarot-ing from scratch. [sent-34, score-0.081]

23 An experiment on reconstructing WordNet’s taxonomy mfore given d roemcoanisntrsu. [sent-35, score-0.44]

24 Before focusing on the harvesting and taxonomy induction algorithms, we are going to describe some basic terminology following (Hovy et al. [sent-36, score-0.732]

25 A term is an English word (for our current purposes, a noun or a proper name). [sent-38, score-0.108]

26 A concept is an item in the classification taxonomy we are building. [sent-39, score-0.637]

27 A root concept is a fairly general concept which is located on the high level of the taxonomy. [sent-40, score-0.533]

28 A basic-level concept corresponds to the Basic Level categories defined in Prototype Theory in Psychology (Rosch, 1978). [sent-41, score-0.219]

29 An instance is an item in the classification taxonomy that is more specific than a concept. [sent-43, score-0.418]

30 2 Related Work The first stage of automatic taxonomy induction, term and relation extraction, is relatively wellunderstood. [sent-50, score-0.486]

31 Methods have matured to the point of achieving high accuracy (Girju et al. [sent-51, score-0.033]

32 The produced output typically contains flat lists of terms 1111 and/or ground instance facts (lion is-a mammal) and general relation types (mammal is-a animal). [sent-54, score-0.079]

33 Most approaches use either clustering or patterns to mine knowledge from structured and unstructured text. [sent-55, score-0.088]

34 Clustering approaches (Lin, 1998; Lin and Pantel, 2002; Davidov and Rappoport, 2006) are fully unsupervised and discover relations that are not directly expressed in text. [sent-56, score-0.102]

35 Their main drawback is that they may or may not produce the term types and granularities useful to the user. [sent-57, score-0.143]

36 In contrast, patternbased approaches harvest information with high accuracy, but they require a set of seeds and surface patterns to initiate the learning process. [sent-58, score-0.204]

37 These methods are successfully used to collect semantic lexicons (Riloff and Shepherd, 1997; Etzioni et al. [sent-59, score-0.034]

38 , 2007), concept lists (Katz and Lin, 2003), and relations between terms, such as hypernyms (Ritter et al. [sent-62, score-0.452]

39 However, simple term lists are not enough to solve many problems involving natural language. [sent-66, score-0.149]

40 Terms may be augmented with information that is required for knowledge-intensive tasks such as textual entailment (Glickman et al. [sent-67, score-0.044]

41 , 2010) learn the selectional restrictions of semantic relations, and (Pennacchiotti and Pantel, 2006) ontologize the learned arguments using WordNet. [sent-72, score-0.071]

42 Taxonomizing the terms is a very powerful method to leverage added information. [sent-73, score-0.038]

43 Subordinated terms (hyponyms) inherit information from their superordinates (hypernyms), making it unnecessary to learn all relevant information over and over for every term in the language. [sent-74, score-0.217]

44 But despite many attempts, no ‘correct’ taxonomization has ever been constructed for the terms of, say, English. [sent-75, score-0.239]

45 Typically, people build term taxonomies (and/or richer structures like ontologies) for particular purposes, using specific taxonomization criteria. [sent-76, score-0.566]

46 Different tasks and criteria produce different taxonomies, even when using the same basic level concepts. [sent-77, score-0.094]

47 This is because most basic level concepts admit to multiple perspectives, while each task focuses on one, or at most two, perspectives at a time. [sent-78, score-0.258]

48 For example, a dolphin is a Mammal (and not a Fish) to a biologist, but is a Fish (and hence not a Mammal) to a fisherman or anyone building or visiting an aquarium. [sent-79, score-0.034]

49 Attempts at producing a single multi-perspective taxonomy fail due to the complexity of interaction among perspectives, and people are notoriously bad at constructing taxonomies adherent to a single perspective when given terms from multiple perspectives. [sent-81, score-0.741]

50 This issue and the major alternative principles for taxonomization are discussed in (Hovy, 2002). [sent-82, score-0.201]

51 It is therefore not surprising that the second stage of automated taxonomy induction is harder to achieve. [sent-83, score-0.456]

52 As mentioned, most attempts to learn taxonomy structures start with a reasonably complete taxonomy and then insert the newly learned terms into it, one term at a time (Widdows, 2003; Pasca, 2004; Snow et al. [sent-84, score-0.992]

53 (Yang and Callan, 2009) introduce a taxonomy induction framework which combines the power of surface patterns and clustering through combining numerous heterogeneous features. [sent-88, score-0.64]

54 Still, one would like a procedure to organize the harvested terms into a taxonomic structure starting fresh (i. [sent-89, score-0.535]

55 We propose an approach that bridges the gap between the term extraction algorithms that focus mainly on harvesting but do not taxonomize, and those that accept a new term and seek to enrich an already existing taxonomy. [sent-92, score-0.464]

56 Our aim is to perform both stages: to extract the terms of a given domain and to induce their taxonomic organization without any initial taxonomic structure and information. [sent-93, score-0.671]

57 This task is challenging because it is not trivial to discover both the hierarchically related and the parallel (perspectival) organizations of concepts. [sent-94, score-0.053]

58 Achieving this goal can provide the research community with the ability to produce taxonomies for domains for which currently there are no existing or manually created ontologies. [sent-95, score-0.325]

59 Starting with the root concept animal and the basic level concept lion, the algorithm learns new terms like tiger, puma, deer, donkey of class animal. [sent-108, score-0.746]

60 Next for each basic level concept, the algorithm harvests hypernyms and learns that a lion is-a vertebrate, chordate, feline and mammal. [sent-109, score-0.404]

61 Finally, the taxonomic structure of each basic level concept and its hypernyms is induced: animal→chordate→vertebrate→mammal→feline→lion. [sent-110, score-0.697]

62 2 Knowledge Harvesting The main objective of our work is not the creation of a new harvesting algorithm, but rather the organization of the harvested information in a tax- onomy structure starting from scratch. [sent-112, score-0.446]

63 There are many algorithms for hyponym and hypernym harvesting from the Web. [sent-113, score-0.217]

64 In our experiments, we use the doubly-anchored lexico-syntactic patterns and bootstrapping algorithm introduced by (Kozareva et al. [sent-114, score-0.055]

65 , 2005; Pasca, 2004); and (4) adapts easily to different domains. [sent-118, score-0.034]

66 The general framework of the knowledge harvesting algorithm is shown in Figure 2. [sent-119, score-0.217]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('taxonomy', 0.378), ('taxonomic', 0.276), ('mammal', 0.241), ('taxonomies', 0.227), ('concept', 0.219), ('harvesting', 0.217), ('taxonomization', 0.201), ('kozareva', 0.172), ('callan', 0.161), ('pennacchiotti', 0.155), ('hypernyms', 0.143), ('scratch', 0.143), ('pasca', 0.138), ('pantel', 0.137), ('concepts', 0.123), ('subordinated', 0.121), ('hovy', 0.117), ('term', 0.108), ('lion', 0.103), ('root', 0.095), ('tiger', 0.093), ('snow', 0.092), ('wordnet', 0.087), ('animal', 0.086), ('organization', 0.081), ('chordate', 0.08), ('habitat', 0.08), ('positioning', 0.08), ('szpektor', 0.08), ('vertebrate', 0.08), ('organize', 0.08), ('induction', 0.078), ('girju', 0.076), ('perspectives', 0.076), ('feline', 0.069), ('fish', 0.069), ('puppy', 0.069), ('reconstructing', 0.062), ('ritter', 0.062), ('shepherd', 0.062), ('surface', 0.061), ('basic', 0.059), ('yang', 0.059), ('etzioni', 0.058), ('starting', 0.058), ('davidov', 0.057), ('glickman', 0.057), ('harvest', 0.057), ('hyponyms', 0.057), ('widdows', 0.057), ('patterns', 0.055), ('attempts', 0.053), ('discover', 0.053), ('harvested', 0.051), ('moldovan', 0.051), ('relations', 0.049), ('katz', 0.048), ('dog', 0.044), ('entailment', 0.044), ('riloff', 0.042), ('lists', 0.041), ('item', 0.04), ('lin', 0.04), ('regions', 0.039), ('creation', 0.039), ('terms', 0.038), ('perspective', 0.037), ('learn', 0.037), ('power', 0.035), ('produce', 0.035), ('impoverished', 0.034), ('confusingly', 0.034), ('adapts', 0.034), ('advantageous', 0.034), ('animals', 0.034), ('anyone', 0.034), ('deer', 0.034), ('inherit', 0.034), ('mammals', 0.034), ('pens', 0.034), ('suchanek', 0.034), ('tfoher', 0.034), ('usc', 0.034), ('purposes', 0.034), ('semantic', 0.034), ('achieving', 0.033), ('clustering', 0.033), ('domains', 0.032), ('procedure', 0.032), ('stages', 0.031), ('admiralty', 0.031), ('initiate', 0.031), ('misses', 0.031), ('notoriously', 0.031), ('ontologies', 0.031), ('reconstruct', 0.031), ('validating', 0.031), ('existing', 0.031), ('learns', 0.03), ('people', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs’ is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.

2 0.2406524 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

3 0.13324372 51 emnlp-2010-Function-Based Question Classification for General QA

Author: Fan Bu ; Xingwei Zhu ; Yu Hao ; Xiaoyan Zhu

Abstract: In contrast with the booming increase of internet data, state-of-art QA (question answering) systems, otherwise, concerned data from specific domains or resources such as search engine snippets, online forums and Wikipedia in a somewhat isolated way. Users may welcome a more general QA system for its capability to answer questions of various sources, integrated from existed specialized sub-QA engines. In this framework, question classification is the primary task. However, the current paradigms of question classification were focused on some specified type of questions, i.e. factoid questions, which are inappropriate for the general QA. In this paper, we propose a new question classification paradigm, which includes a question taxonomy suitable to the general QA and a question classifier based on MLN (Markov logic network), where rule-based methods and statistical methods are unified into a single framework in a fuzzy discriminative learning approach. Experiments show that our method outperforms traditional question classification approaches.

4 0.071828499 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

Author: Roberto Navigli ; Giuseppe Crisafulli

Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.

5 0.055044468 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

6 0.054422691 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

7 0.05222182 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

8 0.052084107 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

9 0.045212865 59 emnlp-2010-Identifying Functional Relations in Web Text

10 0.043875415 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

11 0.043070827 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

12 0.040987156 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

13 0.037583217 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

14 0.036098007 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

15 0.035437305 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

16 0.033690177 20 emnlp-2010-Automatic Detection and Classification of Social Events

17 0.033522997 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

18 0.033474345 84 emnlp-2010-NLP on Spoken Documents Without ASR

19 0.033469316 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

20 0.03332248 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.121), (1, 0.089), (2, -0.044), (3, 0.185), (4, 0.082), (5, -0.032), (6, -0.08), (7, 0.078), (8, 0.038), (9, 0.068), (10, -0.015), (11, -0.203), (12, 0.006), (13, -0.242), (14, 0.027), (15, 0.163), (16, -0.204), (17, -0.102), (18, -0.002), (19, -0.098), (20, 0.016), (21, -0.057), (22, 0.162), (23, 0.222), (24, -0.045), (25, 0.091), (26, -0.232), (27, -0.115), (28, 0.047), (29, -0.134), (30, -0.283), (31, 0.024), (32, -0.17), (33, 0.155), (34, 0.112), (35, -0.064), (36, 0.094), (37, -0.065), (38, -0.064), (39, -0.007), (40, 0.11), (41, 0.012), (42, 0.016), (43, -0.008), (44, 0.088), (45, 0.043), (46, 0.042), (47, 0.03), (48, 0.035), (49, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97897232 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs’ is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.

2 0.8227033 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

3 0.30166072 51 emnlp-2010-Function-Based Question Classification for General QA

Author: Fan Bu ; Xingwei Zhu ; Yu Hao ; Xiaoyan Zhu

Abstract: In contrast with the booming increase of internet data, state-of-art QA (question answering) systems, otherwise, concerned data from specific domains or resources such as search engine snippets, online forums and Wikipedia in a somewhat isolated way. Users may welcome a more general QA system for its capability to answer questions of various sources, integrated from existed specialized sub-QA engines. In this framework, question classification is the primary task. However, the current paradigms of question classification were focused on some specified type of questions, i.e. factoid questions, which are inappropriate for the general QA. In this paper, we propose a new question classification paradigm, which includes a question taxonomy suitable to the general QA and a question classifier based on MLN (Markov logic network), where rule-based methods and statistical methods are unified into a single framework in a fuzzy discriminative learning approach. Experiments show that our method outperforms traditional question classification approaches.

4 0.20746422 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

5 0.18429673 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

6 0.18358022 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

7 0.1753473 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

8 0.16916852 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

9 0.15262172 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

10 0.14833701 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

11 0.14094061 59 emnlp-2010-Identifying Functional Relations in Web Text

12 0.13509642 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs

13 0.12183844 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

14 0.12064612 80 emnlp-2010-Modeling Organization in Student Essays

15 0.11845117 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

16 0.11166728 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

17 0.11065545 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

18 0.10944894 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

19 0.10872228 74 emnlp-2010-Learning the Relative Usefulness of Questions in Community QA

20 0.10503574 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.043), (12, 0.043), (22, 0.457), (29, 0.049), (30, 0.041), (52, 0.014), (56, 0.052), (62, 0.016), (66, 0.089), (72, 0.038), (76, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78160965 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Although many algorithms have been developed to harvest lexical resources, few organize the mined terms into taxonomies. We propose (1) a semi-supervised algorithm that uses a root concept, a basic level concept, and recursive surface patterns to learn automatically from the Web hyponym-hypernym pairs subordinated to the root; (2) a Web based concept positioning procedure to validate the learned pairs’ is-a relations; and (3) a graph algorithm that derives from scratch the integrated taxonomy structure of all the terms. Comparing results with WordNet, we find that the algorithm misses some concepts and links, but also that it discovers many additional ones lacking in WordNet. We evaluate the taxonomization power of our method on reconstructing parts of the WordNet taxonomy. Experiments show that starting from scratch, the algorithm can reconstruct 62% of the WordNet taxonomy for the regions tested.

2 0.30239278 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

3 0.27025113 51 emnlp-2010-Function-Based Question Classification for General QA

Author: Fan Bu ; Xingwei Zhu ; Yu Hao ; Xiaoyan Zhu

Abstract: In contrast with the booming increase of internet data, state-of-art QA (question answering) systems, otherwise, concerned data from specific domains or resources such as search engine snippets, online forums and Wikipedia in a somewhat isolated way. Users may welcome a more general QA system for its capability to answer questions of various sources, integrated from existed specialized sub-QA engines. In this framework, question classification is the primary task. However, the current paradigms of question classification were focused on some specified type of questions, i.e. factoid questions, which are inappropriate for the general QA. In this paper, we propose a new question classification paradigm, which includes a question taxonomy suitable to the general QA and a question classifier based on MLN (Markov logic network), where rule-based methods and statistical methods are unified into a single framework in a fuzzy discriminative learning approach. Experiments show that our method outperforms traditional question classification approaches.

4 0.26828656 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev

Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.

5 0.26777554 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

Author: Niklas Jakob ; Iryna Gurevych

Abstract: In this paper, we focus on the opinion target extraction as part of the opinion mining task. We model the problem as an information extraction task, which we address based on Conditional Random Fields (CRF). As a baseline we employ the supervised algorithm by Zhuang et al. (2006), which represents the state-of-the-art on the employed data. We evaluate the algorithms comprehensively on datasets from four different domains annotated with individual opinion target instances on a sentence level. Furthermore, we investigate the performance of our CRF-based approach and the baseline in a single- and cross-domain opinion target extraction setting. Our CRF-based approach improves the performance by 0.077, 0.126, 0.071 and 0. 178 regarding F-Measure in the single-domain extraction in the four domains. In the crossdomain setting our approach improves the performance by 0.409, 0.242, 0.294 and 0.343 regarding F-Measure over the baseline.

6 0.26753119 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

7 0.26677507 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

8 0.26634216 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

9 0.26620579 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

10 0.26411507 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

11 0.26410517 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

12 0.26397797 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

13 0.26387388 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

14 0.26382264 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

15 0.26304585 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

16 0.26265141 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

17 0.26262465 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

18 0.26110068 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

19 0.26092163 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

20 0.26054215 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification