nips nips2004 nips2004-101 knowledge-graph by maker-knowledge-mining

101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Source: pdf

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

Abstract: Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. Using “dependency path” features extracted from parse trees, we introduce a general-purpose formalization and generalization of these patterns. Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Learning syntactic patterns for automatic hypernym discovery Rion Snow Daniel Jurafsky Andrew Y. [sent-1, score-0.84]

2 edu Abstract Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. [sent-7, score-0.082]

3 Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. [sent-8, score-0.742]

4 Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. [sent-9, score-0.725]

5 Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. [sent-11, score-0.906]

6 On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. [sent-12, score-0.986]

7 1 Introduction Semantic taxonomies and thesauri such as WordNet [5] are a key source of knowledge for natural language processing applications, and provide structured information about semantic relations between words. [sent-13, score-0.213]

8 Further, semantic taxonomies are invariably limited in scope and domain, and the high cost of extending or customizing them for an application has often limited their usefulness. [sent-15, score-0.204]

9 Consequently, there has been signiﬁcant recent interest in ﬁnding methods for automatically learning taxonomic relations and constructing semantic hierarchies. [sent-16, score-0.178]

10 [1, 2, 3, 4, 6, 8, 9, 13, 15, 17, 18, 19, 20, 21] In this paper, we build an automatic classiﬁer for the hypernym/hyponym relation. [sent-17, score-0.048]

11 A noun X is a hyponym of a noun Y if X is a subtype or instance of Y. [sent-18, score-0.806]

12 Thus “Shakespeare” is a hyponym of “author” (and conversely “author” is a hypernym of “Shakespeare”), “dog” is a hyponym of “canine”, “desk” is a hyponym of “furniture”, and so on. [sent-19, score-1.005]

13 Much of the previous work on automatic semantic classiﬁcation of words has been based on a key insight ﬁrst articulated by Hearst [8], that the presence of certain “lexico-syntactic patterns” can indicate a particular semantic relationship between two nouns. [sent-20, score-0.244]

14 Hearst noticed that, for example, linking two noun phrases (NPs) via the constructions “Such NP Y as NP X ”, or “NP X and other NP Y ”, often implies that NP X is a hyponym of NP Y , i. [sent-21, score-0.468]

15 Since then, several researchers have used a small number (typically less than ten) of hand-crafted patterns like these to try to automatically label such semantic relations [1, 2, 6, 13, 17, 18]. [sent-24, score-0.288]

16 While these patterns have been successful at identifying some examples of relationships like hypernymy, this method of lexicon construction is tedious and severely limited by the small number of patterns typically employed. [sent-25, score-0.302]

17 Our goal is to use machine learning to automatically replace this hand-built knowledge. [sent-26, score-0.047]

18 We ﬁrst use examples of known hypernym pairs to automatically identify large numbers of useful lexico-syntactic patterns, and then combine these patterns using a supervised learning algorithm to obtain a high accuracy hypernym classiﬁer. [sent-27, score-1.45]

19 Training: (a) Collect noun pairs from corpora, identifying examples of hypernym pairs (pairs of nouns in a hypernym/hyponym relation) using WordNet. [sent-29, score-1.305]

20 (b) For each noun pair, collect sentences in which both nouns occur. [sent-30, score-0.592]

21 (c) Parse the sentences, and automatically extract patterns from the parse tree. [sent-31, score-0.181]

22 (d) Train a hypernym classiﬁer based on these features. [sent-32, score-0.615]

23 Test: (a) Given a pair of nouns in the test set, extract features and use the classiﬁer to determine if the noun pair is in the hypernym/hyponym relation or not. [sent-34, score-0.722]

24 Section 2 introduces our method for automatically discovering patterns indicative of hypernymy. [sent-36, score-0.183]

25 In Section 4 we analyze our feature space, and in Section 5 we describe a classiﬁer using these features that achieves high accuracy on the task of hypernym identiﬁcation. [sent-38, score-0.636]

26 Section 6 shows how this classiﬁer can be improved by adding a new source of knowledge, coordinate terms. [sent-39, score-0.119]

27 2 Representing lexico-syntactic patterns with dependency paths The ﬁrst goal of our work is to automatically identify lexico-syntactic patterns indicative of hypernymy. [sent-40, score-0.501]

28 We propose the use of dependency paths as a general-purpose formalization of the space of lexico-syntactic patterns. [sent-42, score-0.227]

29 Dependency paths have been used successfully in the past to represent lexico-syntactic relations suitable for semantic processing [11]. [sent-43, score-0.18]

30 A dependency parser produces a dependency tree that represents the syntactic relations between words by a list of edge tuples of the form: (word1 ,CATEGORY 1 :RELATION:CATEGORY 2 , word2 ). [sent-44, score-0.434]

31 In this formulation each word is the stemmed form of the word or multi-word phrase (so that “authors” becomes “author”), and corresponds to a speciﬁc node in the dependency tree; each category is the part of speech label of the corresponding word (e. [sent-45, score-0.228]

32 , N for noun or P REP for preposition); and the relation is the directed syntactic relationship exhibited between word1 and word2 (e. [sent-47, score-0.413]

33 We may then deﬁne our space of lexico-syntactic patterns to be all shortest paths of four links or less between any two nouns in a dependency tree. [sent-50, score-0.604]

34 Figure 1 shows the partial dependency tree for the sentence fragment “. [sent-51, score-0.204]

35 such authors as Herrick and Shakespeare” generated by the broad-coverage dependency parser MINIPAR [10]. [sent-54, score-0.18]

36 Each dependency path may then be presented as an ordered list of dependency tuples. [sent-59, score-0.368]

37 , single links not already contained in the dependency path added on either side of each noun. [sent-63, score-0.242]

38 Second, we capitalize on the distributive nature of the syntactic conjunction relation (nouns linked by “and” or “or”, or in comma-separated lists) by distributing dependency links across such conjunctions. [sent-64, score-0.328]

39 For example, in the simple 2-member conjunction chain of “Herrick” and “Shakespeare” in Figure 1, we add the entrance link “as, -P REP : PCOMP - N :N” to the single element “Shakespeare” (as a dotted line in the ﬁgure). [sent-65, score-0.061]

40 Our extended dependency notation is able to capture the power of the hand-engineered patterns described in the literature. [sent-66, score-0.269]

41 Table 1 shows the six patterns used in [1, 2, 8] and their corresponding dependency path formalizations. [sent-67, score-0.319]

42 3 Experimental paradigm Our goal is to build a classiﬁer which, when given an ordered pair of nouns, makes the binary decision of whether the nouns are related by hypernymy. [sent-68, score-0.299]

43 All of our experiments are based on a corpus of over 6 million newswire sentences. [sent-69, score-0.064]

44 1 We ﬁrst parsed each of the sentences in the corpus using MINIPAR. [sent-70, score-0.059]

45 We extract every pair of nouns from each sentence. [sent-71, score-0.281]

46 752,311 of the resulting unique noun pairs were labeled as Known Hypernym or Known Non-Hypernym using WordNet. [sent-72, score-0.401]

47 2 A noun pair (ni , nj ) is labeled Known Hypernym if nj is an ancestor of the ﬁrst sense of ni in the WordNet hypernym taxonomy, and if the only “frequently-used”3 sense of each noun is the ﬁrst noun sense listed in WordNet. [sent-73, score-2.042]

48 Note that nj is considered a hypernym of ni regardless of how much higher in the hierarchy it is with respect to ni . [sent-74, score-0.898]

49 A noun pair may be assigned to the second set of Known Non-Hypernym pairs if both nouns are contained within WordNet, but neither noun is an ancestor of the other in the WordNet hypernym taxonomy for any senses of either noun. [sent-75, score-1.774]

50 For both sets of evaluations, our classiﬁer was given a pair of nouns from an unseen sentence and had to make a hypernym vs. [sent-78, score-0.923]

51 3 A noun sense is determined to be “frequently-used” if it occurs at least once in the sense-tagged Brown Corpus Semantic Concordance ﬁles (as reported in the cntlist ﬁle distributed as part of WordNet 2. [sent-84, score-0.356]

52 This determination is made so as to reduce the number of false hypernym/hyponym classiﬁcations due to highly polysemous nouns (nouns which have multiple meanings). [sent-86, score-0.226]

53 For the second set of evaluations we hand-labeled a test set of 5,387 noun pairs from randomly-selected paragraphs within our corpus (with part-of-speech labels assigned by MINIPAR). [sent-107, score-0.449]

54 The annotators were instructed to label each ordered noun pair as one of “hyponym-to-hypernym”, “hypernym-to-hyponym”, “coordinate”, or “unrelated” (the coordinate relation will be deﬁned in Section 6). [sent-108, score-0.543]

55 As expected, the vast majority of pairs (5,122) were found to be unrelated by these measures; the rest were split evenly between hypernym and coordinate pairs (134 and 131, resp. [sent-109, score-0.86]

56 Interannotator agreement was obtained between four labelers (all native speakers of English) on a set of 511 noun pairs, and determined for each task according to the averaged F-Score across all pairs of the four labelers. [sent-111, score-0.432]

57 Agreement was 83% and 64% for the hypernym and coordinate term classiﬁcation tasks, respectively. [sent-112, score-0.734]

58 4 Features: pattern discovery Our ﬁrst study focused on discovering which dependency paths might prove useful features for our classiﬁers. [sent-113, score-0.261]

59 We created a feature lexicon of 69,592 dependency paths, consisting of every dependency path that occurred between at least ﬁve unique noun pairs in our corpus. [sent-114, score-0.872]

60 To evaluate these features, we constructed a binary classiﬁer for each pattern, which simply classiﬁes a noun pair as hypernym/hyponym if and only if the speciﬁc pattern occurs at least once for that noun pair. [sent-115, score-0.779]

61 Figure 2 depicts the precision and recall of all such classiﬁers (with recall at least . [sent-116, score-0.092]

62 This analysis gives a quantitative justiﬁcation to Hearst’s initial intuition as to the power of hand-selected patterns; nearly all of Hearst’s patterns are at the high-performance boundary of precision and recall for individual features. [sent-119, score-0.172]

63 We record in our noun pair lexicon each noun pair that occurs with at least ﬁve unique paths from our feature lexicon discussed in the previous section. [sent-129, score-1.02]

64 We then create a feature count vector for each such noun pair. [sent-130, score-0.359]

65 Each entry of the 69,592-dimension vector represents a particular dependency path, and contains the total number of times that that path was the shortest path connecting that noun pair in some dependency tree in our corpus. [sent-131, score-0.856]

66 We thus deﬁne as our task the binary classiﬁcation of a noun pair as a hypernym pair based on its feature vector of dependency paths. [sent-132, score-1.261]

67 We train a number of classiﬁers on this data set, including multinomial Naive Bayes, complement Naive Bayes [16], and logistic regression. [sent-134, score-0.07]

68 The second classiﬁer consists of only the “NP and/or other NP” subset of Hearst’s patterns, as used in the automatic construction of a noun-labeled hypernym taxonomy in [1]. [sent-138, score-0.719]

69 In our tests we found greatest performance from a binary logistic regression model with 14 redundant threshold buckets spaced at the exponentially increasing intervals {1, 2, 4, . [sent-139, score-0.134]

70 These buckets are deﬁned such that a feature corresponding to pattern p at threshold t will be activated by a noun pair n if and only if p has been observed to occur as a shortest dependency path between n at least t times. [sent-143, score-0.721]

71 Our classiﬁer shows a dramatic improvement over previous classiﬁers; in particular, using our best logistic regression classiﬁer trained on newswire corpora, we observe a 132% relative improvement of average maximum F-score over the classiﬁer based on Hearst’s patterns. [sent-144, score-0.16]

72 6 Using coordinate terms to improve hypernym classiﬁcation While our hypernym-only classiﬁer performed better than previous classiﬁers based on hand-built patterns, there is still much room for improvement. [sent-145, score-0.734]

73 As [2] points out, one problem with pattern-based hypernym classiﬁers in general is that within-sentence hypernym pattern information is quite sparse. [sent-146, score-1.26]

74 Patterns are useful only for classifying noun pairs which happen to occur in the same sentence; many hypernym/hyponym pairs may simply not occur in the same sentence in the corpus. [sent-147, score-0.491]

75 The WordNet glossary deﬁnes coordinate terms as “nouns or verbs that have the same hypernym”. [sent-149, score-0.119]

76 Here we treat the coordinate relation as a symmetric relation that exists between two nouns that share at least one common ancestor in the hypernym taxonomy, and are therefore “the same kind of thing” at some level. [sent-150, score-1.084]

77 Many methods exist for inferring that two nouns are coordinate terms (a common subtask in automatic thesaurus induction). [sent-151, score-0.393]

78 2630 Table 4: Summary of maximum F-scores on hand-labeled coordinate pairs Coordinate term classifiers on hand-labeled test set 1 0. [sent-156, score-0.234]

79 1 Coordinate Term Classiﬁcation Prior work for identifying coordinate terms includes automatic word sense clustering methods based on distributional similarity (e. [sent-194, score-0.274]

80 First we construct a vector-space model similar to [12] using single MINIPAR dependency links as our distributional features. [sent-200, score-0.237]

81 6 We use the normalized similarity score from this model for coordinate term classiﬁcation. [sent-201, score-0.14]

82 For purposes of comparison we construct a series of classiﬁers from WordNet, which make the binary decision of determining whether two nouns are coordinate according to whether they share a common ancestor within k nouns higher up in the hypernym taxonomy, for all k from 1 to 6. [sent-203, score-1.266]

83 Also, we compare a simple pattern-based classiﬁer based on the conjunction pattern, which thresholds simply on the number of conjunction patterns found between a given pair. [sent-204, score-0.232]

84 2 Hybrid hypernym-coordinate classiﬁcation We now combine our hypernym and coordinate models in order to improve hypernym classiﬁcation. [sent-208, score-1.349]

85 We deﬁne two probabilities of pair relationships between nouns: P (n i < nj ), H 6 We use the same 6 million MINIPAR-parsed sentences used in our hypernym training set. [sent-209, score-0.772]

86 Our feature lexicon consists of the 30,000 most frequent noun-connected dependency edges. [sent-210, score-0.262]

87 1386 Table 5: Maximum F-Score of hypernym classiﬁers on hand-labeled test set representing the probability that noun ni has nj as an ancestor in its hypernym hierarchy, and P (ni ∼ nj ), the probability that nouns ni and nj are coordinate terms, i. [sent-220, score-2.402]

88 , that they C share a common hypernym ancestor at some level. [sent-222, score-0.677]

89 Deﬁning the probability produced by our best hypernym-only classiﬁer as Pold (ni < nj ), and a probability obtained by normalH izing the similarity score from our coordinate classiﬁer as P (ni ∼ nj ), we apply a simple C linear interpolation scheme to compute a new hypernymy probability. [sent-223, score-0.33]

90 We evaluated several classiﬁers based on the WordNet hypernym taxonomy. [sent-226, score-0.615]

91 Our logistic regression hypernym-only model trained on the newswire corpora has a 16% relative F-score improvement over the best WordNet classiﬁer, while the combined hypernym/coordinate model has a 40% relative F-score improvement. [sent-228, score-0.179]

92 Our bestperforming classiﬁer is a hypernym-only model additionally trained on the Wikipedia corpus, with an expanded feature lexicon of 200,000 dependency paths; this classiﬁer shows a 54% improvement over WordNet. [sent-229, score-0.279]

93 9 8 Conclusions Our experiments demonstrate that automatic methods can be competitive with WordNet for the identiﬁcation of hypernym pairs in newswire corpora. [sent-232, score-0.759]

94 In future work we will use the presented method to automatically generate ﬂexible, statistically-grounded hypernym taxonomies directly from corpora. [sent-233, score-0.744]

95 These taxonomies will be made publicly available to complement existing semantic resources. [sent-234, score-0.201]

96 8 We tried all combinations of the following parameters: the maximum number of senses of a hyponym for which to ﬁnd hypernyms, the maximum distance between the hyponym and its hypernym in the WordNet taxonomy, and whether or not to allow synonyms. [sent-238, score-0.896]

97 The WordNet model achieving the maximum F-score uses only the ﬁrst sense of a hyponym and allows a maximum distance of 4 links between a hyponym and hypernym. [sent-239, score-0.311]

98 9 There are 31 such disagreements, with WordNet agreeing with the human labels on 5 and our hybrid model agreeing on the other 26. [sent-240, score-0.07]

99 (1993) Customizing a lexicon to better suit a computational task. [sent-289, score-0.082]

100 (2003) Unsupervised methods for developing taxonomies by combining syntactic and statistical information. [sent-353, score-0.126]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hypernym', 0.615), ('noun', 0.338), ('wordnet', 0.308), ('np', 0.256), ('nouns', 0.226), ('hearst', 0.178), ('dependency', 0.159), ('hyponym', 0.13), ('coordinate', 0.119), ('patterns', 0.11), ('semantic', 0.098), ('classi', 0.095), ('ni', 0.094), ('er', 0.091), ('taxonomies', 0.082), ('lexicon', 0.082), ('mod', 0.079), ('nj', 0.074), ('shakespeare', 0.071), ('pairs', 0.063), ('ancestor', 0.062), ('conjunction', 0.061), ('conj', 0.059), ('interannotator', 0.059), ('minipar', 0.059), ('nk', 0.058), ('taxonomy', 0.056), ('trec', 0.056), ('pair', 0.055), ('rep', 0.051), ('path', 0.05), ('paths', 0.049), ('logistic', 0.049), ('automatic', 0.048), ('automatically', 0.047), ('pcomp', 0.047), ('wikipedia', 0.047), ('ers', 0.046), ('distributional', 0.045), ('syntactic', 0.044), ('buckets', 0.041), ('corpora', 0.036), ('appo', 0.036), ('herrick', 0.036), ('hypernyms', 0.036), ('pantel', 0.036), ('pold', 0.036), ('punc', 0.036), ('classifiers', 0.035), ('links', 0.033), ('newswire', 0.033), ('relations', 0.033), ('precision', 0.032), ('corpus', 0.031), ('agreement', 0.031), ('relation', 0.031), ('recall', 0.03), ('stanford', 0.03), ('pattern', 0.03), ('sentences', 0.028), ('hybrid', 0.028), ('shortest', 0.027), ('sentence', 0.027), ('indicative', 0.026), ('regression', 0.026), ('naive', 0.026), ('table', 0.024), ('ciaramita', 0.024), ('customizing', 0.024), ('hypernymy', 0.024), ('jurafsky', 0.024), ('ravichandran', 0.024), ('rion', 0.024), ('widdows', 0.024), ('parse', 0.024), ('discovery', 0.023), ('word', 0.023), ('company', 0.022), ('feature', 0.021), ('complement', 0.021), ('hierarchy', 0.021), ('similarity', 0.021), ('parser', 0.021), ('agreeing', 0.021), ('lexical', 0.021), ('obj', 0.021), ('senses', 0.021), ('author', 0.019), ('disagreements', 0.019), ('snow', 0.019), ('formalization', 0.019), ('rennie', 0.019), ('binary', 0.018), ('sense', 0.018), ('acquisition', 0.018), ('tree', 0.018), ('best', 0.018), ('named', 0.018), ('test', 0.017), ('improvement', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

2 0.10450276 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

3 0.080689959 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

Abstract: Coreference analysis, also known as record linkage or identity uncertainty, is a difﬁcult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditional-probability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random ﬁelds over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1

4 0.06769719 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efﬁcient solvers are available. Highly competitive numerical results on both artiﬁcial and real-world data sets are reported. 1

5 0.054824118 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

Abstract: We study the problem of hierarchical classiﬁcation when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classiﬁer for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classiﬁer with a ﬂat SVM classiﬁer and with classiﬁers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classiﬁer performs, after just one training epoch, nearly as well as the hierarchical SVM classiﬁer (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classiﬁer based on the true model parameters. 1 Introduction and basic deﬁnitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is speciﬁed as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classiﬁcation have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classiﬁcation tasks, and we derive the corresponding Bayes-optimal classiﬁer under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reﬂects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classiﬁer; 3) a simpliﬁed variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classiﬁer based on the true model parameters. Our incremental algorithms are based on training linear-threshold classiﬁers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classiﬁcations as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classiﬁcation system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a ﬂat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are deﬁned over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some ﬁxed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are deﬁned over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make ﬁne-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is deﬁned by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are ﬁxed cost coefﬁcients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N deﬁne its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model deﬁnes a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and deﬁning fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this deﬁnition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a ﬁxed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we deﬁne the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simpliﬁes to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classiﬁer for the H-loss We now describe a classiﬁer, called H - BAYES, that is the Bayes-optimal classiﬁer for the H-loss. In other words, H - BAYES classiﬁes any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Deﬁne pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, ﬁx any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively deﬁne H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classiﬁer H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classiﬁcation of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classiﬁer by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classiﬁer generated by these algorithms classiﬁes an instance x through a multilabel y = (y1 , ..., yN ) deﬁned as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classiﬁers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the ﬁrst t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classiﬁer (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The ﬁrst algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are deﬁned by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classiﬁer of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates deﬁned by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the deﬁnition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classiﬁer by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simpliﬁed variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are deﬁned as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The ﬁrst dataset consists of the ﬁrst (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the ﬁfth and ﬁrst group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a speciﬁc subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classiﬁcation tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance ﬁgures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a signiﬁcant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classiﬁcation problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefﬁcients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefﬁcients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classiﬁer trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classiﬁcation scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We ﬁnally tested the “ﬂat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefﬁcients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the ﬂat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a ﬁnal remark, we observe that the normalized H-loss gave too small a range of values to afford ﬁne comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefﬁcients u1 , . . . , uN . We deﬁne the regret of a classiﬁer assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classiﬁer (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classiﬁer learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we deﬁne the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modiﬁed H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classiﬁcation of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classiﬁer, and empirically compared an incremental approximation to this classiﬁer with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simpliﬁed variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classiﬁer is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classiﬁer, rather than with respect to a classiﬁer that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classiﬁers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classiﬁcation with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classiﬁcation. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classiﬁcation of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classiﬁcation using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classiﬁcation by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classiﬁer. In Proceedings of the 13th European Conference on Artiﬁcial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classiﬁcation and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

6 0.050574027 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

7 0.049985155 136 nips-2004-On Semi-Supervised Classification

8 0.049094878 54 nips-2004-Distributed Information Regularization on Graphs

9 0.047505379 70 nips-2004-Following Curved Regularized Optimization Solution Paths

10 0.039601546 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

11 0.038903948 137 nips-2004-On the Adaptive Properties of Decision Trees

12 0.038050596 164 nips-2004-Semi-supervised Learning by Entropy Minimization

13 0.037541777 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

14 0.036848921 42 nips-2004-Computing regularization paths for learning multiple kernels

15 0.036387902 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

16 0.036335919 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

17 0.036091395 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

18 0.036040593 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

19 0.034773882 145 nips-2004-Parametric Embedding for Class Visualization

20 0.033389311 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.104), (1, 0.042), (2, -0.025), (3, 0.017), (4, 0.021), (5, 0.076), (6, 0.048), (7, 0.071), (8, 0.026), (9, 0.025), (10, 0.01), (11, 0.026), (12, -0.099), (13, 0.053), (14, -0.04), (15, -0.045), (16, 0.073), (17, -0.038), (18, -0.009), (19, 0.007), (20, 0.03), (21, -0.06), (22, 0.008), (23, -0.045), (24, 0.011), (25, -0.007), (26, -0.039), (27, -0.014), (28, -0.053), (29, 0.03), (30, -0.061), (31, 0.035), (32, 0.044), (33, 0.018), (34, 0.025), (35, 0.008), (36, -0.109), (37, -0.193), (38, 0.01), (39, -0.008), (40, -0.024), (41, -0.099), (42, 0.137), (43, 0.101), (44, -0.14), (45, -0.139), (46, -0.125), (47, 0.122), (48, -0.015), (49, -0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92762941 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

2 0.57631522 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

3 0.49592537 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

4 0.45923162 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with ﬁve datasets from diﬀerent application domains and called for classiﬁcation results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the ﬁnal competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diﬀusion trees. Other top entries used a variety of methods for feature selection, which combined ﬁlters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classiﬁcation engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

5 0.43936092 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random ﬁelds (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on ﬁve named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

6 0.42851898 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

7 0.42289966 87 nips-2004-Integrating Topics and Syntax

8 0.38553751 82 nips-2004-Incremental Algorithms for Hierarchical Classification

9 0.38029534 109 nips-2004-Mass Meta-analysis in Talairach Space

10 0.37995183 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

11 0.37399819 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

12 0.36590391 136 nips-2004-On Semi-Supervised Classification

13 0.31629017 145 nips-2004-Parametric Embedding for Class Visualization

14 0.31543592 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

15 0.29799828 193 nips-2004-Theories of Access Consciousness

16 0.29104954 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

17 0.28463945 54 nips-2004-Distributed Information Regularization on Graphs

18 0.27350068 57 nips-2004-Economic Properties of Social Networks

19 0.27225196 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

20 0.26941013 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.043), (13, 0.052), (15, 0.082), (17, 0.012), (25, 0.012), (26, 0.034), (31, 0.021), (33, 0.176), (35, 0.011), (39, 0.012), (50, 0.036), (51, 0.015), (86, 0.393)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7688787 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

2 0.70241827 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

3 0.51154101 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a ﬁrst-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the ﬁrst-order (single-step) transitions. But for many problems, the ﬁrstorder conditional independence assumptions are not satisﬁed, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a ﬁrst-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the beneﬁts of using ﬁrst-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

4 0.47135657 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

Author: Ligen Wang, Balázs Kégl

Abstract: In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve A DA B OOST by incorporating knowledge on the structure of the data into base classiﬁer design and selection. On the other hand, we use A DA B OOST’s efﬁcient learning mechanism to signiﬁcantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the speciﬁc manifold-based penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms. 1

5 0.46450907 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

6 0.46299487 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

7 0.45906267 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

8 0.458323 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

9 0.45714238 44 nips-2004-Conditional Random Fields for Object Recognition

10 0.45706475 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

11 0.45604867 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

12 0.45549634 77 nips-2004-Hierarchical Clustering of a Mixture Model

13 0.45477748 161 nips-2004-Self-Tuning Spectral Clustering

14 0.45454472 166 nips-2004-Semi-supervised Learning via Gaussian Processes

15 0.45421627 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

16 0.45387182 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

17 0.45380995 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

18 0.4534674 179 nips-2004-Surface Reconstruction using Learned Shape Models

19 0.45331615 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

20 0.45330542 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data