nips nips2004 nips2004-156 knowledge-graph by maker-knowledge-mining

156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Source: pdf

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with ﬁve datasets from diﬀerent application domains and called for classiﬁcation results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the ﬁnal competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diﬀusion trees. Other top entries used a variety of methods for feature selection, which combined ﬁlters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classiﬁcation engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract The NIPS 2003 workshops included a feature selection competition organized by the authors. [sent-13, score-0.393]

2 We provided participants with ﬁve datasets from diﬀerent application domains and called for classiﬁcation results using a minimal number of features. [sent-14, score-0.461]

3 The competition took place over a period of 13 weeks and attracted 78 research groups. [sent-15, score-0.162]

4 Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. [sent-16, score-0.893]

5 In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the ﬁnal competition. [sent-17, score-0.494]

6 The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diﬀusion trees. [sent-18, score-0.131]

7 Other top entries used a variety of methods for feature selection, which combined ﬁlters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classiﬁcation engine. [sent-19, score-0.466]

8 The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. [sent-20, score-0.613]

9 In fact, the proliferation of datasets combined with the creativity of researchers in designing experiments makes it hardly possible to compare one paper with another [12]. [sent-30, score-0.273]

10 A number of large conferences have regularly organized competitions (e. [sent-31, score-0.072]

11 In 2003, we organized a competition on the theme of feature selection, the results of which were presented at a workshop on feature extraction, which attracted 98 participants. [sent-35, score-0.436]

12 In this paper, we present to the NIPS community a concise summary of our challenge design and the ﬁndings of the result analysis. [sent-37, score-0.254]

13 2 Benchmark design We formatted ﬁve datasets (Table 1) from various application domains. [sent-38, score-0.297]

14 The data were split into three subsets: a training set, a validation set, and a test set. [sent-40, score-0.185]

15 The class labels for the validation set and the test set were withheld. [sent-42, score-0.185]

16 The identity of the datasets and of the features (some of which were random features artiﬁcially generated) were kept secret. [sent-43, score-0.347]

17 The participants could submit prediction results on the validation set and get their performance results and ranking on-line for a period of 12 weeks. [sent-44, score-0.738]

18 By December 1st , 2003, which marked the end of the development period, the participants had to turn in their results on the test set. [sent-45, score-0.397]

19 Immediately after that, the validation set labels were revealed. [sent-46, score-0.137]

20 On December 8th , 2003, the participants could make submissions of test set predictions, after having trained on both the training and the validation set. [sent-47, score-0.708]

21 Some details on the benchmark design are provided in this Section. [sent-48, score-0.212]

22 Challenge format We gave our benchmark the format of a challenge to stimulate participation. [sent-49, score-0.524]

23 We made available an automatic web-based system to submit prediction results and get immediate feed-back, inspired by the system of the NIPS2000 and NIPS2001 unlabelled data competitions [4, 5]. [sent-50, score-0.191]

24 During development participants could submit validation results on any of the ﬁve datasets proposed (not necessarily all). [sent-52, score-0.762]

25 Competitors were required to submit results on all ﬁve test sets by the challenge deadline to be included in the ﬁnal ranking. [sent-53, score-0.363]

26 This avoided a common problem of “multiple track” benchmarks in which no conclusion can be drawn because too few participants enter all tracks. [sent-54, score-0.304]

27 “pure” methods and “hybrids”), we allowed participating groups to submit a total of ﬁve ﬁnal entries on December 1st and ﬁve entries on December 8th . [sent-57, score-0.434]

28 Our format was very successful: it attracted 78 research groups who competed for 13 weeks and made (submitted) a total of 1863 entries. [sent-58, score-0.231]

29 Twenty groups were eligible for being ranked on December 1st (56 submissions1 ), and 16 groups on December 8th (36 submissions. [sent-59, score-0.27]

30 uk remains available as a resource for researchers in the feature selection. [sent-65, score-0.206]

31 1 After imposing a maximum of 5 submissions per group and eliminating some incomplete submissions, there remained 56 eligible submissions out of the 135 received. [sent-66, score-0.534]

32 The situation has changed considerably in the past few years, and in the 2003 special issue we edited for JMLR including papers from the proceedings of the NIPS 2001 workshop [7], most papers explore domains with hundreds to tens of thousands of variables or features. [sent-71, score-0.09]

33 Feature selection is a particular way of tackling the problem of space dimensionality reduction. [sent-73, score-0.153]

34 The necessary computing power to handle large datasets is now available in simple laptops, so there is a proliferation of solutions proposed for such feature selection problems. [sent-74, score-0.51]

35 We formatted ﬁve datasets for the purpose of benchmarking variable selection algorithms (see Table 1. [sent-76, score-0.392]

36 ) The datasets were chosen to span a variety of domains and diﬃculties (the input variables are continuous or binary, sparse or dense; one dataset has unbalanced classes. [sent-77, score-0.157]

37 ) One dataset (Madelon) was artiﬁcially constructed to illustrate a particular diﬃculty: selecting a feature set when no feature is informative by itself. [sent-78, score-0.29]

38 We chose datasets that had suﬃciently many examples to create a large enough test set to obtain statistically signiﬁcant results [6]. [sent-79, score-0.205]

39 To prevent researchers familiar with the datasets to have an advantage, we concealed the identity of the datasets during the benchmark. [sent-80, score-0.375]

40 In particular, we introduced a number of features called probes. [sent-82, score-0.095]

41 The probes were drawn at random from a distribution resembling that of the real features, but carrying no information about the class labels. [sent-83, score-0.152]

42 Such probes have a function in performance assessment: a good feature selection algorithm should eliminate most of the probes. [sent-84, score-0.45]

43 2 In this paper, we do not make a distinction between features and variables. [sent-86, score-0.095]

44 The benchmark addresses the problem of selecting input variables. [sent-87, score-0.154]

45 Those may actually be features derived from the original variables through preprocessing. [sent-88, score-0.095]

46 94 (1) (2) (5) (3) (29) (31) (16) (18) (34) (20) (4) (32) (33) (b) December 8th 2003 challenge results. [sent-130, score-0.196]

47 2 Performance assessment Final submissions qualiﬁed for scoring if they included the class predictions for training, validation, and test sets for all ﬁve tasks proposed, and the list of features used. [sent-232, score-0.422]

48 This metric was used because some datasets (particularly Dorothea) are unbalanced. [sent-235, score-0.157]

49 The curve represents the fraction of true positive as a function of the fraction of false negative. [sent-238, score-0.104]

50 • Fprobe: Fraction of probes found in the feature set selected. [sent-241, score-0.297]

51 We ranked the participants with the test set results using a score combining BER, Ffeat and Fprobe. [sent-242, score-0.43]

52 Brieﬂy: We used the McNemar test to determine whether classiﬁer A is better than classiﬁer B according to the BER with 5% risk yielding to a score of 1 (better), 0 (don’t know) or 1 (worse). [sent-243, score-0.126]

53 The overall score for each dataset is the sum of the pairwise comparison scores (normalized by the maximum achievable score, that is the number of submissions minus one. [sent-246, score-0.297]

54 Our scoring method favors accuracy over feature set compactness. [sent-249, score-0.205]

55 Our benchmark design could not prevent participants from “cheating” in the following way. [sent-250, score-0.516]

56 An entrant could “declare” a smaller feature subset than the one used to make predictions. [sent-251, score-0.145]

57 To deter participants from cheating, we warned them that we would be performing a stage of veriﬁcation. [sent-252, score-0.304]

58 3 Challenge results The overall scores of the best entries are shown in Table 2. [sent-254, score-0.102]

59 The main features of the methods of the participants listed in that table are summarized in Table 3. [sent-255, score-0.441]

60 The analysis of this section also includes the survey of ten more top ranking participants. [sent-256, score-0.165]

61 Winners The winners of the benchmark (both December 1st and 8th ) are Radford Neal and Jianguo Zhang, with a combination of Bayesian neural networks [14] and Dirichlet diﬀusion trees [15]. [sent-257, score-0.285]

62 They are also the top entrants December 1st for Arcene and Dexter and December 1st and 8th for Dorothea. [sent-259, score-0.129]

63 • They then applied a classiﬁcation method based on Bayesian learning, using an Automatic Relevance Determination (ARD) prior that allows the model to determine which of these features are most relevant. [sent-261, score-0.095]

64 As allowed by the challenge rules, the winners constructed these trees using both the training data and the unlabeled data in the validation and test sets. [sent-264, score-0.556]

65 The classiﬁers are grouped in four categories: N - neural network, K - SVM or other kernel method, T tree classiﬁers (none found in the top ranking methods), O - other. [sent-267, score-0.165]

66 The feature selection engines (Fengine) are grouped in three categories: C - single variable criteria including correlation coeﬃcients, T - tree classiﬁers or RF used as a ﬁlter E - Wrapper or embedded methods. [sent-268, score-0.398]

67 The search methods are identiﬁed by: E embedded, R - feature ranking, B - backward elimination, S - more elaborated search. [sent-269, score-0.145]

68 Our ﬁndings include: Feature selection The winners and several top ranking challengers use a combination of ﬁlters and embedded methods3 . [sent-271, score-0.549]

69 Several high ranking participants obtain good results using only ﬁlters, even simple correlation coeﬃcients. [sent-272, score-0.422]

70 The second best entrants use Random Forests, an ensemble of tree classiﬁers, to perform feature selection [3]. [sent-273, score-0.454]

71 4 Search strategies are generally unsophisticated (simple feature ranking, forward selection or backward elimination. [sent-274, score-0.298]

72 Only one group used “random probes” purposely introduced to track the fraction of falsely selected features. [sent-278, score-0.1]

73 In spite of the high risk of overﬁtting, 7 of the 9 top groups using kernel methods found that Gaussian kernels gave them better results than the linear kernel on Arcene, Dexter, Dorothea, or Gisette (for Madelon all best ranking groups used a Gaussian kernel. [sent-282, score-0.387]

74 ) Ensemble methods Some groups relied on a committee of classiﬁers to make the ﬁnal decision. [sent-283, score-0.111]

75 The techniques to build such committee include sampling 3 We distinguish embedded methods that have a feature selection mechanism built into the learning algorithm from wrappers, which perform feature selection by using the classiﬁer as a black box. [sent-284, score-0.696]

76 4 Random Forests (RF) are classiﬁcation techniques with an embedded feature selection mechanism. [sent-285, score-0.398]

77 The participants used the features generated by RF, but did not use RF for classiﬁcation. [sent-286, score-0.399]

78 Most groups that used ensemble methods reported improved accuracy. [sent-288, score-0.185]

79 Transduction Since all the datasets were provided since the beginning of the benchmark (validation and test set deprived of their class labels), it was possible to make use of the unlabelled data as part of learning (sometimes referred to as transduction [17]). [sent-289, score-0.42]

80 Only two groups took advantage of that, including the winners. [sent-290, score-0.111]

81 Preprocessing Centering and scaling the features was the most common preprocessing used. [sent-291, score-0.095]

82 4 Conclusions and future work The challenge demonstrated both that feature selection can be performed eﬀectively and that eliminating meaningless features is not critical to achieve good classiﬁcation performance. [sent-295, score-0.589]

83 By design, our datasets include many irrelevant “distracters” features, called “probes”. [sent-296, score-0.157]

84 It is surprising that some of the best entries use all the features. [sent-298, score-0.102]

85 For many years, ﬁlter methods have dominated feature selection for computational reasons. [sent-301, score-0.298]

86 It was understood that wrapper and embedded methods are more powerful, but too computationally expensive. [sent-302, score-0.172]

87 Some of the top ranking entries use one or several ﬁlters as their only selection strategy. [sent-303, score-0.42]

88 A ﬁlter as simple as the Pearson correlation coeﬃcient proves to be very eﬀective, even though it does not remove feature redundancy and therefore yields unnecessarily large feature subsets. [sent-304, score-0.29]

89 Other entrants combined ﬁlters and embedded methods to further reduce the feature set and eliminate redundancies. [sent-305, score-0.327]

90 Several challenge datasets included a very large number of features (up to 100,000) and only a few hundred examples. [sent-307, score-0.503]

91 Not surprisingly, the winning entries use as classiﬁes either ensemble methods or strongly regularized classiﬁers. [sent-309, score-0.176]

92 Hence, with adequate regularization, non-linear classiﬁers do not overﬁt the data, even when the number of features exceeds the number of examples by orders of magnitude. [sent-311, score-0.095]

93 Principal Component Analysis was successfully used by several researchers to reduce the dimension of input space down to a few hundred features, without any knowledge of the class labels. [sent-312, score-0.116]

94 The analysis of the challenge results revealed that hyperparameter selection may have played an important role in winning the challenge. [sent-314, score-0.349]

95 Indeed, several groups were using the same classiﬁer (e. [sent-315, score-0.111]

96 We have started laying the basis of a new benchmark on the theme of model selection and hyperparameter selection [8]. [sent-318, score-0.504]

97 We also thank the people who formatted the data and made them available: Thorsten Joachims, Yann Le Cun, and the KDD Cup 2001 organizers. [sent-321, score-0.082]

98 Selection of relevant features and examples in machine learning. [sent-327, score-0.095]

99 Design of experiments of the NIPS 2003 variable selection benchmark. [sent-351, score-0.153]

100 Model selection and ensemble methods challenge in preparation http://clopinet. [sent-367, score-0.423]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('participants', 0.304), ('submissions', 0.219), ('december', 0.196), ('challenge', 0.196), ('ber', 0.174), ('datasets', 0.157), ('benchmark', 0.154), ('selection', 0.153), ('probes', 0.152), ('feature', 0.145), ('yes', 0.143), ('validation', 0.137), ('winners', 0.131), ('rf', 0.127), ('neal', 0.127), ('submit', 0.119), ('ranking', 0.118), ('groups', 0.111), ('ffeat', 0.11), ('ghostminer', 0.11), ('usion', 0.11), ('entries', 0.102), ('auc', 0.101), ('embedded', 0.1), ('classi', 0.099), ('fe', 0.095), ('features', 0.095), ('arcene', 0.082), ('ari', 0.082), ('cbagroup', 0.082), ('dexter', 0.082), ('dorothea', 0.082), ('entrants', 0.082), ('formatted', 0.082), ('madelon', 0.082), ('di', 0.08), ('score', 0.078), ('ve', 0.074), ('ensemble', 0.074), ('competitions', 0.072), ('rlsc', 0.072), ('wrapper', 0.072), ('guyon', 0.07), ('roc', 0.066), ('format', 0.065), ('nips', 0.063), ('dense', 0.061), ('researchers', 0.061), ('transduction', 0.061), ('scoring', 0.06), ('dirichlet', 0.06), ('period', 0.06), ('ers', 0.06), ('lters', 0.059), ('chen', 0.059), ('design', 0.058), ('coe', 0.057), ('forests', 0.057), ('cheating', 0.055), ('distracters', 0.055), ('elissee', 0.055), ('fengine', 0.055), ('gideon', 0.055), ('gisette', 0.055), ('gunn', 0.055), ('isabelle', 0.055), ('nikravesh', 0.055), ('proliferation', 0.055), ('team', 0.055), ('hundred', 0.055), ('attracted', 0.055), ('fraction', 0.052), ('pr', 0.05), ('group', 0.048), ('sa', 0.048), ('test', 0.048), ('kdd', 0.048), ('bagging', 0.048), ('contributed', 0.048), ('asa', 0.048), ('eligible', 0.048), ('ard', 0.048), ('workshops', 0.048), ('wrappers', 0.048), ('kremer', 0.048), ('top', 0.047), ('competition', 0.047), ('balanced', 0.045), ('development', 0.045), ('papers', 0.045), ('unlabeled', 0.044), ('leo', 0.044), ('stimulate', 0.044), ('theme', 0.044), ('parenthesis', 0.044), ('table', 0.042), ('http', 0.041), ('lter', 0.041), ('er', 0.041), ('ties', 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

2 0.13701141 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classiﬁcation error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-conﬁdence is larger than that required to obtain an -accurate estimate of the expected error rate of a classiﬁcation function with the same conﬁdence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from ﬁnite function classes. 1

3 0.13118185 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classiﬁer. The criterion commonly used to measure the ranking quality of a classiﬁcation algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of conﬁdence for its value. This paper provides conﬁdence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard conﬁdence intervals given in the case of the error rate. They are compared with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classiﬁer is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classiﬁcation system is essential. The criterion widely used to measure the ranking quality of a classiﬁcation algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of conﬁdence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the conﬁdence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar conﬁdence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of conﬁdence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight conﬁdence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose conﬁdence intervals, or strong distribution-dependent assumptions leading to tight but unsafe conﬁdence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. Our results are also useful for testing the statistical signiﬁcance of the difference of the AUC values of two classiﬁers. The paper is organized as follows. We ﬁrst introduce the deﬁnition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and brieﬂy review some essential aspects of the existing literature related to the computation of conﬁdence intervals for the AUC. Our computation of the expected value and variance of the AUC for a ﬁxed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a ﬁxed error rate. These can be efﬁciently computed and used to determine our conﬁdence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Deﬁnition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. The AUC is deﬁned as the area under the ROC curve. Consider a binary classiﬁcation task with m positive examples and n negative examples. Let C be a ﬁxed classiﬁer that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classiﬁcation. It can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. It is an estimate of the probability Pxy that the classiﬁer ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section brieﬂy describes some previous distribution-dependent approaches presented in the statistical literature to derive conﬁdence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a ﬁxed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classiﬁer ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting conﬁdence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can ﬁx m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose conﬁdence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we ﬁnd it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justiﬁed empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a ﬁxed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the conﬁdence intervals. For all q ≥ 0, let Xq (k, m, n) be deﬁned by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a ﬁxed error rate k/N assuming that all classiﬁcations or rankings with k errors are equiprobable. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classiﬁcations with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a ﬁxed number of errors k: Proposition 2 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classiﬁcations with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simpliﬁed since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F deﬁned for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefﬁcient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to signiﬁcantly more efﬁcient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the conﬁdence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a conﬁdence interval E = [e1 , e2 ] is given for the error rate of a classiﬁer C over a sample S, with the conﬁdence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a conﬁdence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassiﬁcations, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a conﬁdence interval Ae for the AUC at the conﬁdence level 1 − . Since our equiprobable model for the classiﬁcations is independent of the model used to compute the interval of conﬁdence for the error rate, we can use E and Ae , e ∈ E, to compute a conﬁdence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classiﬁer and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a conﬁdence interval for the error rate of C over S at the conﬁdence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval for the AUC value of the classiﬁer C at the conﬁdence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a ﬁxed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be deﬁned by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the conﬁdence interval deﬁned by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the conﬁdence interval E is independent of our equiprobability model for ﬁxed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the deﬁnitions of the intervals IK and IA . Thus, IA constitutes a conﬁdence interval for the AUC value of C at the conﬁdence level (1 − )(1 − ). In practice, the conﬁdence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classiﬁer, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassiﬁcations of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval of the AUC value of the classiﬁer C at the conﬁdence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefﬁcient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a conﬁdence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the conﬁdence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a conﬁdence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a ﬁner interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ conﬁdence interval for the error rate at the conﬁdence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter conﬁdence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be veriﬁed. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a conﬁdence interval of the AUC value of a classier C at the conﬁdence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow conﬁdence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is deﬁned by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter conﬁdence intervals (Fig. 1 (a)). It also leads to tighter conﬁdence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on speciﬁc assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight conﬁdence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classiﬁers are statistically signiﬁcant by checking if the AUC value of one falls within the conﬁdence interval of the other. 8 Conclusion We presented principled techniques for computing useful conﬁdence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these conﬁdence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a ﬁxed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classiﬁer ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the deﬁnition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-speciﬁc weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difﬁcult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical signiﬁcance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classiﬁer Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classiﬁer Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efﬁcient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classiﬁer Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

4 0.1253994 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efﬁcient solvers are available. Highly competitive numerical results on both artiﬁcial and real-world data sets are reported. 1

5 0.08981435 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classiﬁcation. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classiﬁer automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

6 0.088344492 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

7 0.074071467 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

8 0.071080551 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

9 0.070331082 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

10 0.069638439 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

11 0.06904804 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

12 0.064889796 68 nips-2004-Face Detection --- Efficient and Rank Deficient

13 0.064853445 100 nips-2004-Learning Preferences for Multiclass Problems

14 0.064730033 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

15 0.063474603 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.063173316 164 nips-2004-Semi-supervised Learning by Entropy Minimization

17 0.062987052 19 nips-2004-An Application of Boosting to Graph Classification

18 0.062983595 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

19 0.062933944 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

20 0.062498074 34 nips-2004-Breaking SVM Complexity with Cross-Training

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, 0.082), (2, -0.041), (3, 0.029), (4, -0.015), (5, 0.12), (6, 0.087), (7, 0.081), (8, 0.13), (9, -0.07), (10, -0.162), (11, 0.086), (12, 0.013), (13, 0.074), (14, -0.056), (15, 0.128), (16, -0.09), (17, -0.044), (18, 0.051), (19, 0.011), (20, 0.011), (21, 0.016), (22, -0.04), (23, 0.041), (24, 0.035), (25, 0.118), (26, -0.037), (27, -0.04), (28, 0.027), (29, 0.039), (30, -0.044), (31, -0.016), (32, 0.028), (33, 0.051), (34, -0.083), (35, -0.08), (36, -0.163), (37, 0.006), (38, -0.053), (39, -0.024), (40, 0.029), (41, -0.198), (42, 0.085), (43, 0.054), (44, 0.007), (45, -0.103), (46, 0.017), (47, -0.012), (48, -0.015), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9477939 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

2 0.62477505 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

3 0.59320611 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

4 0.58231139 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

5 0.53328353 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, deﬁned on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non-trivial tradeoﬀ between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

6 0.53284621 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

7 0.52282935 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

8 0.51010311 136 nips-2004-On Semi-Supervised Classification

9 0.47908512 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

10 0.46113804 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

11 0.42826724 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

12 0.42643511 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

13 0.4158774 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

14 0.41030261 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

15 0.39639139 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

16 0.39438009 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

17 0.3918024 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

18 0.3905732 127 nips-2004-Neighbourhood Components Analysis

19 0.38069817 193 nips-2004-Theories of Access Consciousness

20 0.37950554 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.089), (15, 0.102), (26, 0.061), (31, 0.031), (33, 0.18), (35, 0.018), (39, 0.022), (50, 0.051), (76, 0.019), (86, 0.361)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80963987 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

Abstract: Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. Using “dependency path” features extracted from parse trees, we introduce a general-purpose formalization and generalization of these patterns. Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. 1

same-paper 2 0.7741425 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

3 0.59933192 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a ﬁrst-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the ﬁrst-order (single-step) transitions. But for many problems, the ﬁrstorder conditional independence assumptions are not satisﬁed, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a ﬁrst-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the beneﬁts of using ﬁrst-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

4 0.55031359 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

5 0.54794151 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.54511243 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.54499269 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.54491097 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

9 0.54254156 131 nips-2004-Non-Local Manifold Tangent Learning

10 0.54212058 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

11 0.5412941 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

12 0.54097307 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

13 0.54047155 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.54011619 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.5400458 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.53990477 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

17 0.53977621 103 nips-2004-Limits of Spectral Clustering

18 0.53869176 116 nips-2004-Message Errors in Belief Propagation

19 0.53852564 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.53846127 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices