nips nips2004 nips2004-156 knowledge-graph by maker-knowledge-mining

156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge


Source: pdf

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract The NIPS 2003 workshops included a feature selection competition organized by the authors. [sent-13, score-0.393]

2 We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. [sent-14, score-0.461]

3 The competition took place over a period of 13 weeks and attracted 78 research groups. [sent-15, score-0.162]

4 Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. [sent-16, score-0.893]

5 In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. [sent-17, score-0.494]

6 The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. [sent-18, score-0.131]

7 Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. [sent-19, score-0.466]

8 The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. [sent-20, score-0.613]

9 In fact, the proliferation of datasets combined with the creativity of researchers in designing experiments makes it hardly possible to compare one paper with another [12]. [sent-30, score-0.273]

10 A number of large conferences have regularly organized competitions (e. [sent-31, score-0.072]

11 In 2003, we organized a competition on the theme of feature selection, the results of which were presented at a workshop on feature extraction, which attracted 98 participants. [sent-35, score-0.436]

12 In this paper, we present to the NIPS community a concise summary of our challenge design and the findings of the result analysis. [sent-37, score-0.254]

13 2 Benchmark design We formatted five datasets (Table 1) from various application domains. [sent-38, score-0.297]

14 The data were split into three subsets: a training set, a validation set, and a test set. [sent-40, score-0.185]

15 The class labels for the validation set and the test set were withheld. [sent-42, score-0.185]

16 The identity of the datasets and of the features (some of which were random features artificially generated) were kept secret. [sent-43, score-0.347]

17 The participants could submit prediction results on the validation set and get their performance results and ranking on-line for a period of 12 weeks. [sent-44, score-0.738]

18 By December 1st , 2003, which marked the end of the development period, the participants had to turn in their results on the test set. [sent-45, score-0.397]

19 Immediately after that, the validation set labels were revealed. [sent-46, score-0.137]

20 On December 8th , 2003, the participants could make submissions of test set predictions, after having trained on both the training and the validation set. [sent-47, score-0.708]

21 Some details on the benchmark design are provided in this Section. [sent-48, score-0.212]

22 Challenge format We gave our benchmark the format of a challenge to stimulate participation. [sent-49, score-0.524]

23 We made available an automatic web-based system to submit prediction results and get immediate feed-back, inspired by the system of the NIPS2000 and NIPS2001 unlabelled data competitions [4, 5]. [sent-50, score-0.191]

24 During development participants could submit validation results on any of the five datasets proposed (not necessarily all). [sent-52, score-0.762]

25 Competitors were required to submit results on all five test sets by the challenge deadline to be included in the final ranking. [sent-53, score-0.363]

26 This avoided a common problem of “multiple track” benchmarks in which no conclusion can be drawn because too few participants enter all tracks. [sent-54, score-0.304]

27 “pure” methods and “hybrids”), we allowed participating groups to submit a total of five final entries on December 1st and five entries on December 8th . [sent-57, score-0.434]

28 Our format was very successful: it attracted 78 research groups who competed for 13 weeks and made (submitted) a total of 1863 entries. [sent-58, score-0.231]

29 Twenty groups were eligible for being ranked on December 1st (56 submissions1 ), and 16 groups on December 8th (36 submissions. [sent-59, score-0.27]

30 uk remains available as a resource for researchers in the feature selection. [sent-65, score-0.206]

31 1 After imposing a maximum of 5 submissions per group and eliminating some incomplete submissions, there remained 56 eligible submissions out of the 135 received. [sent-66, score-0.534]

32 The situation has changed considerably in the past few years, and in the 2003 special issue we edited for JMLR including papers from the proceedings of the NIPS 2001 workshop [7], most papers explore domains with hundreds to tens of thousands of variables or features. [sent-71, score-0.09]

33 Feature selection is a particular way of tackling the problem of space dimensionality reduction. [sent-73, score-0.153]

34 The necessary computing power to handle large datasets is now available in simple laptops, so there is a proliferation of solutions proposed for such feature selection problems. [sent-74, score-0.51]

35 We formatted five datasets for the purpose of benchmarking variable selection algorithms (see Table 1. [sent-76, score-0.392]

36 ) The datasets were chosen to span a variety of domains and difficulties (the input variables are continuous or binary, sparse or dense; one dataset has unbalanced classes. [sent-77, score-0.157]

37 ) One dataset (Madelon) was artificially constructed to illustrate a particular difficulty: selecting a feature set when no feature is informative by itself. [sent-78, score-0.29]

38 We chose datasets that had sufficiently many examples to create a large enough test set to obtain statistically significant results [6]. [sent-79, score-0.205]

39 To prevent researchers familiar with the datasets to have an advantage, we concealed the identity of the datasets during the benchmark. [sent-80, score-0.375]

40 In particular, we introduced a number of features called probes. [sent-82, score-0.095]

41 The probes were drawn at random from a distribution resembling that of the real features, but carrying no information about the class labels. [sent-83, score-0.152]

42 Such probes have a function in performance assessment: a good feature selection algorithm should eliminate most of the probes. [sent-84, score-0.45]

43 2 In this paper, we do not make a distinction between features and variables. [sent-86, score-0.095]

44 The benchmark addresses the problem of selecting input variables. [sent-87, score-0.154]

45 Those may actually be features derived from the original variables through preprocessing. [sent-88, score-0.095]

46 94 (1) (2) (5) (3) (29) (31) (16) (18) (34) (20) (4) (32) (33) (b) December 8th 2003 challenge results. [sent-130, score-0.196]

47 2 Performance assessment Final submissions qualified for scoring if they included the class predictions for training, validation, and test sets for all five tasks proposed, and the list of features used. [sent-232, score-0.422]

48 This metric was used because some datasets (particularly Dorothea) are unbalanced. [sent-235, score-0.157]

49 The curve represents the fraction of true positive as a function of the fraction of false negative. [sent-238, score-0.104]

50 • Fprobe: Fraction of probes found in the feature set selected. [sent-241, score-0.297]

51 We ranked the participants with the test set results using a score combining BER, Ffeat and Fprobe. [sent-242, score-0.43]

52 Briefly: We used the McNemar test to determine whether classifier A is better than classifier B according to the BER with 5% risk yielding to a score of 1 (better), 0 (don’t know) or 1 (worse). [sent-243, score-0.126]

53 The overall score for each dataset is the sum of the pairwise comparison scores (normalized by the maximum achievable score, that is the number of submissions minus one. [sent-246, score-0.297]

54 Our scoring method favors accuracy over feature set compactness. [sent-249, score-0.205]

55 Our benchmark design could not prevent participants from “cheating” in the following way. [sent-250, score-0.516]

56 An entrant could “declare” a smaller feature subset than the one used to make predictions. [sent-251, score-0.145]

57 To deter participants from cheating, we warned them that we would be performing a stage of verification. [sent-252, score-0.304]

58 3 Challenge results The overall scores of the best entries are shown in Table 2. [sent-254, score-0.102]

59 The main features of the methods of the participants listed in that table are summarized in Table 3. [sent-255, score-0.441]

60 The analysis of this section also includes the survey of ten more top ranking participants. [sent-256, score-0.165]

61 Winners The winners of the benchmark (both December 1st and 8th ) are Radford Neal and Jianguo Zhang, with a combination of Bayesian neural networks [14] and Dirichlet diffusion trees [15]. [sent-257, score-0.285]

62 They are also the top entrants December 1st for Arcene and Dexter and December 1st and 8th for Dorothea. [sent-259, score-0.129]

63 • They then applied a classification method based on Bayesian learning, using an Automatic Relevance Determination (ARD) prior that allows the model to determine which of these features are most relevant. [sent-261, score-0.095]

64 As allowed by the challenge rules, the winners constructed these trees using both the training data and the unlabeled data in the validation and test sets. [sent-264, score-0.556]

65 The classifiers are grouped in four categories: N - neural network, K - SVM or other kernel method, T tree classifiers (none found in the top ranking methods), O - other. [sent-267, score-0.165]

66 The feature selection engines (Fengine) are grouped in three categories: C - single variable criteria including correlation coefficients, T - tree classifiers or RF used as a filter E - Wrapper or embedded methods. [sent-268, score-0.398]

67 The search methods are identified by: E embedded, R - feature ranking, B - backward elimination, S - more elaborated search. [sent-269, score-0.145]

68 Our findings include: Feature selection The winners and several top ranking challengers use a combination of filters and embedded methods3 . [sent-271, score-0.549]

69 Several high ranking participants obtain good results using only filters, even simple correlation coefficients. [sent-272, score-0.422]

70 The second best entrants use Random Forests, an ensemble of tree classifiers, to perform feature selection [3]. [sent-273, score-0.454]

71 4 Search strategies are generally unsophisticated (simple feature ranking, forward selection or backward elimination. [sent-274, score-0.298]

72 Only one group used “random probes” purposely introduced to track the fraction of falsely selected features. [sent-278, score-0.1]

73 In spite of the high risk of overfitting, 7 of the 9 top groups using kernel methods found that Gaussian kernels gave them better results than the linear kernel on Arcene, Dexter, Dorothea, or Gisette (for Madelon all best ranking groups used a Gaussian kernel. [sent-282, score-0.387]

74 ) Ensemble methods Some groups relied on a committee of classifiers to make the final decision. [sent-283, score-0.111]

75 The techniques to build such committee include sampling 3 We distinguish embedded methods that have a feature selection mechanism built into the learning algorithm from wrappers, which perform feature selection by using the classifier as a black box. [sent-284, score-0.696]

76 4 Random Forests (RF) are classification techniques with an embedded feature selection mechanism. [sent-285, score-0.398]

77 The participants used the features generated by RF, but did not use RF for classification. [sent-286, score-0.399]

78 Most groups that used ensemble methods reported improved accuracy. [sent-288, score-0.185]

79 Transduction Since all the datasets were provided since the beginning of the benchmark (validation and test set deprived of their class labels), it was possible to make use of the unlabelled data as part of learning (sometimes referred to as transduction [17]). [sent-289, score-0.42]

80 Only two groups took advantage of that, including the winners. [sent-290, score-0.111]

81 Preprocessing Centering and scaling the features was the most common preprocessing used. [sent-291, score-0.095]

82 4 Conclusions and future work The challenge demonstrated both that feature selection can be performed effectively and that eliminating meaningless features is not critical to achieve good classification performance. [sent-295, score-0.589]

83 By design, our datasets include many irrelevant “distracters” features, called “probes”. [sent-296, score-0.157]

84 It is surprising that some of the best entries use all the features. [sent-298, score-0.102]

85 For many years, filter methods have dominated feature selection for computational reasons. [sent-301, score-0.298]

86 It was understood that wrapper and embedded methods are more powerful, but too computationally expensive. [sent-302, score-0.172]

87 Some of the top ranking entries use one or several filters as their only selection strategy. [sent-303, score-0.42]

88 A filter as simple as the Pearson correlation coefficient proves to be very effective, even though it does not remove feature redundancy and therefore yields unnecessarily large feature subsets. [sent-304, score-0.29]

89 Other entrants combined filters and embedded methods to further reduce the feature set and eliminate redundancies. [sent-305, score-0.327]

90 Several challenge datasets included a very large number of features (up to 100,000) and only a few hundred examples. [sent-307, score-0.503]

91 Not surprisingly, the winning entries use as classifies either ensemble methods or strongly regularized classifiers. [sent-309, score-0.176]

92 Hence, with adequate regularization, non-linear classifiers do not overfit the data, even when the number of features exceeds the number of examples by orders of magnitude. [sent-311, score-0.095]

93 Principal Component Analysis was successfully used by several researchers to reduce the dimension of input space down to a few hundred features, without any knowledge of the class labels. [sent-312, score-0.116]

94 The analysis of the challenge results revealed that hyperparameter selection may have played an important role in winning the challenge. [sent-314, score-0.349]

95 Indeed, several groups were using the same classifier (e. [sent-315, score-0.111]

96 We have started laying the basis of a new benchmark on the theme of model selection and hyperparameter selection [8]. [sent-318, score-0.504]

97 We also thank the people who formatted the data and made them available: Thorsten Joachims, Yann Le Cun, and the KDD Cup 2001 organizers. [sent-321, score-0.082]

98 Selection of relevant features and examples in machine learning. [sent-327, score-0.095]

99 Design of experiments of the NIPS 2003 variable selection benchmark. [sent-351, score-0.153]

100 Model selection and ensemble methods challenge in preparation http://clopinet. [sent-367, score-0.423]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('participants', 0.304), ('submissions', 0.219), ('december', 0.196), ('challenge', 0.196), ('ber', 0.174), ('datasets', 0.157), ('benchmark', 0.154), ('selection', 0.153), ('probes', 0.152), ('feature', 0.145), ('yes', 0.143), ('validation', 0.137), ('winners', 0.131), ('rf', 0.127), ('neal', 0.127), ('submit', 0.119), ('ranking', 0.118), ('groups', 0.111), ('ffeat', 0.11), ('ghostminer', 0.11), ('usion', 0.11), ('entries', 0.102), ('auc', 0.101), ('embedded', 0.1), ('classi', 0.099), ('fe', 0.095), ('features', 0.095), ('arcene', 0.082), ('ari', 0.082), ('cbagroup', 0.082), ('dexter', 0.082), ('dorothea', 0.082), ('entrants', 0.082), ('formatted', 0.082), ('madelon', 0.082), ('di', 0.08), ('score', 0.078), ('ve', 0.074), ('ensemble', 0.074), ('competitions', 0.072), ('rlsc', 0.072), ('wrapper', 0.072), ('guyon', 0.07), ('roc', 0.066), ('format', 0.065), ('nips', 0.063), ('dense', 0.061), ('researchers', 0.061), ('transduction', 0.061), ('scoring', 0.06), ('dirichlet', 0.06), ('period', 0.06), ('ers', 0.06), ('lters', 0.059), ('chen', 0.059), ('design', 0.058), ('coe', 0.057), ('forests', 0.057), ('cheating', 0.055), ('distracters', 0.055), ('elissee', 0.055), ('fengine', 0.055), ('gideon', 0.055), ('gisette', 0.055), ('gunn', 0.055), ('isabelle', 0.055), ('nikravesh', 0.055), ('proliferation', 0.055), ('team', 0.055), ('hundred', 0.055), ('attracted', 0.055), ('fraction', 0.052), ('pr', 0.05), ('group', 0.048), ('sa', 0.048), ('test', 0.048), ('kdd', 0.048), ('bagging', 0.048), ('contributed', 0.048), ('asa', 0.048), ('eligible', 0.048), ('ard', 0.048), ('workshops', 0.048), ('wrappers', 0.048), ('kremer', 0.048), ('top', 0.047), ('competition', 0.047), ('balanced', 0.045), ('development', 0.045), ('papers', 0.045), ('unlabeled', 0.044), ('leo', 0.044), ('stimulate', 0.044), ('theme', 0.044), ('parenthesis', 0.044), ('table', 0.042), ('http', 0.041), ('lter', 0.041), ('er', 0.041), ('ties', 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

2 0.13701141 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes. 1

3 0.13118185 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

4 0.1253994 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1

5 0.08981435 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

6 0.088344492 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

7 0.074071467 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

8 0.071080551 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

9 0.070331082 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

10 0.069638439 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

11 0.06904804 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

12 0.064889796 68 nips-2004-Face Detection --- Efficient and Rank Deficient

13 0.064853445 100 nips-2004-Learning Preferences for Multiclass Problems

14 0.064730033 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

15 0.063474603 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.063173316 164 nips-2004-Semi-supervised Learning by Entropy Minimization

17 0.062987052 19 nips-2004-An Application of Boosting to Graph Classification

18 0.062983595 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

19 0.062933944 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

20 0.062498074 34 nips-2004-Breaking SVM Complexity with Cross-Training


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, 0.082), (2, -0.041), (3, 0.029), (4, -0.015), (5, 0.12), (6, 0.087), (7, 0.081), (8, 0.13), (9, -0.07), (10, -0.162), (11, 0.086), (12, 0.013), (13, 0.074), (14, -0.056), (15, 0.128), (16, -0.09), (17, -0.044), (18, 0.051), (19, 0.011), (20, 0.011), (21, 0.016), (22, -0.04), (23, 0.041), (24, 0.035), (25, 0.118), (26, -0.037), (27, -0.04), (28, 0.027), (29, 0.039), (30, -0.044), (31, -0.016), (32, 0.028), (33, 0.051), (34, -0.083), (35, -0.08), (36, -0.163), (37, 0.006), (38, -0.053), (39, -0.024), (40, 0.029), (41, -0.198), (42, 0.085), (43, 0.054), (44, 0.007), (45, -0.103), (46, 0.017), (47, -0.012), (48, -0.015), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9477939 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

2 0.62477505 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

3 0.59320611 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1

4 0.58231139 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth

Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes. 1

5 0.53328353 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

Author: Mario Marchand, Mohak Shah

Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1

6 0.53284621 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

7 0.52282935 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

8 0.51010311 136 nips-2004-On Semi-Supervised Classification

9 0.47908512 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

10 0.46113804 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

11 0.42826724 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

12 0.42643511 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

13 0.4158774 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

14 0.41030261 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

15 0.39639139 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

16 0.39438009 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

17 0.3918024 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

18 0.3905732 127 nips-2004-Neighbourhood Components Analysis

19 0.38069817 193 nips-2004-Theories of Access Consciousness

20 0.37950554 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.089), (15, 0.102), (26, 0.061), (31, 0.031), (33, 0.18), (35, 0.018), (39, 0.022), (50, 0.051), (76, 0.019), (86, 0.361)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80963987 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng

Abstract: Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. Using “dependency path” features extracted from parse trees, we introduce a general-purpose formalization and generalization of these patterns. Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. 1

same-paper 2 0.7741425 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

3 0.59933192 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

4 0.55031359 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

5 0.54794151 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

6 0.54511243 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.54499269 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.54491097 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

9 0.54254156 131 nips-2004-Non-Local Manifold Tangent Learning

10 0.54212058 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

11 0.5412941 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

12 0.54097307 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

13 0.54047155 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.54011619 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.5400458 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.53990477 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

17 0.53977621 103 nips-2004-Limits of Spectral Clustering

18 0.53869176 116 nips-2004-Message Errors in Belief Propagation

19 0.53852564 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.53846127 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices