nips nips2003 nips2003-113 knowledge-graph by maker-knowledge-mining

113 nips-2003-Learning with Local and Global Consistency

Source: pdf

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. [sent-4, score-0.789]

2 A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. [sent-5, score-1.093]

3 We present a simple algorithm to obtain such a smooth solution. [sent-6, score-0.099]

4 Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. [sent-7, score-0.359]

5 1 Introduction We consider the general problem of learning from labeled and unlabeled data. [sent-8, score-0.689]

6 , yl } ∈ L and the remaining points are unlabeled. [sent-21, score-0.101]

7 The goal is to predict the labels of the unlabeled points. [sent-22, score-0.351]

8 The performance of an algorithm is measured by the error rate on these unlabeled points only. [sent-23, score-0.421]

9 Since labeling often requires expensive human labor, whereas unlabeled data is far easier to obtain, semisupervised learning is very useful in many real-world problems and has recently attracted a considerable amount of research [10]. [sent-25, score-0.421]

10 A typical application is web categorization, in which manually classiﬁed web pages are always a very small part of the entire web, and the number of unlabeled examples is large. [sent-26, score-0.408]

11 The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure (typically referred to as a cluster or a manifold) are likely to have the same label. [sent-27, score-0.377]

12 This argument is akin to that in [2, 3, 4, 10, 15] and often called the cluster assumption [4, 10]. [sent-28, score-0.096]

13 Orthodox supervised learning algorithms, such as k-NN, in general depend only on the ﬁrst assumption of local consistency. [sent-30, score-0.195]

14 To illustrate the prior assumption of consistency underlying semi-supervised learning, let us consider a toy dataset generated according to a pattern of two intertwining moons in Figure 1(a). [sent-31, score-0.72]

15 Every point should be similar to points in its local neighborhood, and furthermore, points in one moon should be more similar to each other than to points in the other moon. [sent-32, score-0.38]

16 The classiﬁcation results given by the Support Vector Machine (SVM) with a RBF kernel (a) Toy Data (Two Moons) 1. [sent-33, score-0.096]

17 5 unlabeled point labeled point −1 labeled point +1 1 0. [sent-34, score-1.109]

18 5 Figure 1: Classiﬁcation on the two moons pattern. [sent-69, score-0.225]

19 (a) toy data set with two labeled points; (b) classifying result given by the SVM with a RBF kernel; (c) k-NN with k = 1; (d) ideal classiﬁcation that we hope to obtain. [sent-70, score-0.733]

20 According to the assumption of consistency, however, the two moons should be classiﬁed as shown in Figure 1(d). [sent-72, score-0.286]

21 The main differences between the various semi-supervised learning algorithms, such as spectral methods [2, 4, 6], random walks [13, 15], graph mincuts [3] and transductive SVM [14], lie in their way of realizing the assumption of consistency. [sent-73, score-0.275]

22 A principled approach to formalize the assumption is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure revealed by known labeled and unlabeled points. [sent-74, score-1.121]

23 Here we propose a simple iteration algorithm to construct such a smooth function inspired by the work on spreading activation networks [1, 11] and diffusion kernels [7, 8, 12], recent work on semi-supervised learning and clustering [2, 4, 9], and more speciﬁcally by the work of Zhu et al. [sent-75, score-0.512]

24 The keynote of our method is to let every point iteratively spread its label information to its neighbors until a global stable state is achieved. [sent-77, score-0.296]

25 , c}, the ﬁrst l points xi (i ≤ l) are labeled as yi ∈ L and the remaining points xu (l+1 ≤ u ≤ n) are unlabeled. [sent-88, score-0.64]

26 The goal is to predict the label of the unlabeled points. [sent-89, score-0.464]

27 , Fn ]T ∈ F corresponds to a classiﬁcation on the dataset X by labeling each point xi as a label yi = arg maxj≤c Fij . [sent-94, score-0.402]

28 We can understand F as a vectorial function F : X → Rc which assigns a vector Fi to each point xi . [sent-95, score-0.106]

29 Deﬁne a n × c matrix Y ∈ F with Yij = 1 if xi is labeled as yi = j and Yij = 0 otherwise. [sent-96, score-0.493]

30 Form the afﬁnity matrix W deﬁned by Wij = exp(− xi − xj 2 /2σ 2 ) if i = j and Wii = 0. [sent-99, score-0.17]

31 Construct the matrix S = D −1/2 W D−1/2 in which D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of W. [sent-101, score-0.166]

32 Label each point xi as a label ∗ yi = arg maxj≤c Fij . [sent-106, score-0.323]

33 This algorithm can be understood intuitively in terms of spreading activation networks [1, 11] from experimental psychology. [sent-107, score-0.139]

34 We ﬁrst deﬁne a pairwise relationship W on the dataset X with the diagonal elements being zero. [sent-108, score-0.173]

35 In the second step, the weight matrix W of G is normalized symmetrically, which is necessary for the convergence of the following iteration. [sent-110, score-0.097]

36 The ﬁrst two steps are exactly the same as in spectral clustering [9]. [sent-111, score-0.103]

37 During each iteration of the third step each point receives the information from its neighbors (ﬁrst term), and also retains its initial information (second term). [sent-112, score-0.204]

38 The parameter α speciﬁes the relative amount of the information from its neighbors and its initial label information. [sent-113, score-0.254]

39 It is worth mentioning that self-reinforcement is avoided since the diagonal elements of the afﬁnity matrix are set to zero in the ﬁrst step. [sent-114, score-0.179]

40 Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process. [sent-116, score-0.589]

41 (1) i=0 Since 0 < α < 1 and the eigenvalues of S in [-1, 1] (note that S is similar to the stochastic matrix P = D −1 W = D−1/2 SD1/2 ), t−1 lim (αS) t→∞ Hence t−1 = 0, and lim t→∞ i=0 (αS)i = (I − αS)−1 . [sent-120, score-0.159]

42 This also shows that the iteration result does not depend on the initial value for the iteration. [sent-123, score-0.125]

43 In addition, it is worth to notice that (I − αS)−1 is in fact a graph or diffusion kernel [7, 12]. [sent-124, score-0.254]

44 Now we discuss some possible variants of this method. [sent-125, score-0.103]

45 Then the classifying function is F ∗ = (I −αP T )−1 Y. [sent-131, score-0.197]

46 We will compare these variants with the original algorithm in the experiments. [sent-133, score-0.103]

47 3 Regularization Framework Here we develop a regularization framework for the above iteration algorithm. [sent-134, score-0.148]

48 Then the classifying function is F ∗ = arg min Q(F ). [sent-136, score-0.235]

49 F ∈F (5) The ﬁrst term of the right-hand side in the cost function is the smoothness constraint, which means that a good classifying function should not change too much between nearby points. [sent-137, score-0.286]

50 The second term is the ﬁtting constraint, which means a good classifying function should not change too much from the initial label assignment. [sent-138, score-0.38]

51 Note that the ﬁtting constraint contains labeled as well as unlabeled data. [sent-140, score-0.656]

52 As we have mentioned, the points involving pairwise relationships can be be thought of as an undirected weighted graph, the weights of which represent the pairwise relationships. [sent-144, score-0.179]

53 The smoothness term essentially splits the function value at each point among the edges attached to it before computing the local changes, and the value assigned to each edge is proportional to its weight. [sent-147, score-0.12]

54 (6) Similarly we can develop the optimization frameworks for the variants F ∗ = (I −αP )−1 Y and F ∗ = (D − αW )−1 Y . [sent-153, score-0.103]

55 ’s harmonic Gaussian ﬁeld method coupled with the Class Mass Normalization (CMN) [15], which is closely related to ours. [sent-157, score-0.331]

56 To the best of our knowledge, there is no reliable approach for model selection if only very few labeled points are available. [sent-158, score-0.437]

57 Hence we let all algorithms use their respective optimal parameters, except that the parameter α used in our methods and its variants was simply ﬁxed at 0. [sent-159, score-0.103]

58 The convergence process of our iteration algorithm with t increasing from 1 to 400 is shown from (a) to (d). [sent-198, score-0.161]

59 Note that the initial label information are diffused along the moons. [sent-199, score-0.254]

60 Figure 3: The real-valued classifying function becomes ﬂatter and ﬂatter with respect to the two moons pattern with increasing t. [sent-200, score-0.455]

61 5 Figure 4: Smooth classiﬁcation results given by supervised classiﬁers with the global consistency: (a) the classiﬁcation result given by the SVM with a RBF kernel; (b) smooth the result of the SVM using the consistency method. [sent-220, score-0.385]

62 1 Toy Problem In this experiment we considered the toy problem mentioned in Section 1 (Figure 1). [sent-222, score-0.169]

63 The afﬁnity matrix is deﬁned by a RBF kernel but the diagonal elements are set to zero. [sent-223, score-0.243]

64 The convergence process of our iteration algorithm with t increasing from 1 to 400 is shown in Figure 2(a)-2(d). [sent-224, score-0.161]

65 Note that the initial label information are diffused along the moons. [sent-225, score-0.254]

66 The assumption of consistency essentially means that a good classifying function should change slowly on the coherent structure aggregated by a large amount of data. [sent-226, score-0.512]

67 This can be illustrated by this toy problem very clearly. [sent-227, score-0.169]

68 In Figure 3, we show that f (xi ) becomes successively ﬂatter with respect to the two moons pattern from Figure 3(a)3(d) with increasing t. [sent-229, score-0.258]

69 Note that two clear moons emerge in the Figure 3(d). [sent-230, score-0.281]

70 The basic idea of our method is to construct a smooth function. [sent-231, score-0.138]

71 It is natural to consider using this method to improve a supervised classiﬁer by smoothing its classifying result. [sent-232, score-0.299]

72 In other words, we use the classifying result given by a supervised classiﬁer as the input of our algorithm. [sent-233, score-0.26]

73 This conjecture is demonstrated by a toy problem in Figure 4. [sent-234, score-0.169]

74 Note that the points classiﬁed incorrectly by the SVM are successfully smoothed by the consistency method. [sent-238, score-0.324]

75 2 Digit Recognition In this experiment, we addressed a classiﬁcation task using the USPS handwritten 16x16 digits dataset. [sent-240, score-0.111]

76 The width of the RBF kernel for SVM was set to 5, and for the harmonic Gaussian ﬁeld method it was set to 1. [sent-244, score-0.478]

77 In our method and its variants, the afﬁnity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method, but the diagonal elements were set to 0. [sent-246, score-0.625]

78 Samples were chosen so that they contain at least one labeled point for each class. [sent-248, score-0.375]

79 Our consistency method and one of its variant are clearly superior to the orthodox supervised learning algorithms k-NN and SVM, and also better than the harmonic Gaussian method. [sent-249, score-0.792]

80 This enables us to incorporate prior knowledge about digit image invariance in an elegant way, e. [sent-251, score-0.108]

81 , by using a jittered kernel to compute the afﬁnity matrix [5]. [sent-253, score-0.151]

82 3 k−NN (k = 1) SVM (RBF kernel) harmonic Gaussian consistency method variant consistency (1) variant consistency (2) 0. [sent-256, score-1.142]

83 7 k−NN (k = 1) SVM (RBF kernel) harmonic Gaussian consistency method variant consistency (1) variant consistency (2) 0. [sent-261, score-1.142]

84 25 0 4 10 15 20 25 # labeled points 30 40 50 0. [sent-271, score-0.437]

85 2 4 10 15 20 25 # labeled points 30 40 50 Figure 5: Left panel: the error rates of digit recognition with USPS handwritten 16x16 digits dataset for a total of 3874 (a subset containing digits from 1 to 4). [sent-272, score-0.806]

86 Right panel: the error rates of text classiﬁcation with 3970 document vectors in a 8014-dimensional space. [sent-273, score-0.093]

87 Samples are chosen so that they contain at least one labeled point for each class. [sent-274, score-0.375]

88 The distance between points xi and xj was deﬁned to be d(xi , xj ) = 1− xi , xj / xi xj [15]. [sent-285, score-0.494]

89 The width of the RBF kernel for SVM was set to 1. [sent-287, score-0.147]

90 5, and for the harmonic Gaussian method it was set to 0. [sent-288, score-0.331]

91 In our methods, the afﬁnity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method, but the diagonal elements were set to 0. [sent-290, score-0.586]

92 Samples were chosen so that they contain at least one labeled point for each class. [sent-292, score-0.375]

93 It is interesting to note that the harmonic method is very good when the number of labeled points is 4, i. [sent-293, score-0.768]

94 We think this is because there are almost equal proportions of different classes in the dataset, and so with four labeled points, the proportions happen to be estimated exactly. [sent-296, score-0.46]

95 The harmonic method becomes worse, however, if slightly more labeled points are used, for instance, 10 labeled points, which leads to pretty poor estimation. [sent-297, score-1.135]

96 As the number of labeled points increases further, the harmonic method works well again and somewhat better than our method, since the proportions of classes are estimated successfully again. [sent-298, score-0.83]

97 However, our decision rule is much simpler, which in fact corresponds to the so-called naive threshold, the baseline of the harmonic method. [sent-299, score-0.292]

98 5 Conclusion The key to semi-supervised learning problems is the consistency assumption, which essentially requires a classifying function to be sufﬁciently smooth with respect to the intrinsic structure revealed by a huge amount of labeled and unlabeled points. [sent-300, score-1.347]

99 We proposed a simple algorithm to obtain such a solution, which demonstrated effective use of unlabeled data in experiments including toy data, digit recognition and text categorization. [sent-301, score-0.687]

100 Learning from labeled and unlabeled data using graph mincuts. [sent-320, score-0.708]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('labeled', 0.336), ('unlabeled', 0.32), ('harmonic', 0.292), ('moons', 0.225), ('consistency', 0.223), ('classifying', 0.197), ('rbf', 0.191), ('toy', 0.169), ('svm', 0.157), ('label', 0.144), ('nity', 0.124), ('af', 0.124), ('classi', 0.115), ('digit', 0.108), ('atter', 0.107), ('sf', 0.105), ('variants', 0.103), ('points', 0.101), ('smooth', 0.099), ('xl', 0.099), ('kernel', 0.096), ('spreading', 0.093), ('iteration', 0.086), ('zhu', 0.083), ('diffusion', 0.074), ('digits', 0.071), ('diffused', 0.071), ('orthodox', 0.071), ('variant', 0.071), ('nn', 0.07), ('transductive', 0.067), ('xi', 0.067), ('supervised', 0.063), ('proportions', 0.062), ('regularization', 0.062), ('spectral', 0.062), ('fij', 0.062), ('panel', 0.061), ('assumption', 0.061), ('cation', 0.061), ('icml', 0.06), ('yij', 0.056), ('emerge', 0.056), ('diagonal', 0.056), ('revealed', 0.055), ('matrix', 0.055), ('intrinsic', 0.053), ('text', 0.053), ('graph', 0.052), ('lim', 0.052), ('width', 0.051), ('symmetrically', 0.049), ('maxj', 0.049), ('xj', 0.048), ('activation', 0.046), ('nearby', 0.046), ('web', 0.044), ('nips', 0.043), ('fi', 0.043), ('smoothness', 0.043), ('dataset', 0.042), ('convergence', 0.042), ('olivier', 0.041), ('clustering', 0.041), ('kernels', 0.04), ('chapelle', 0.04), ('handwritten', 0.04), ('document', 0.04), ('neighbors', 0.04), ('point', 0.039), ('initial', 0.039), ('documents', 0.039), ('usps', 0.039), ('method', 0.039), ('pairwise', 0.039), ('sch', 0.039), ('local', 0.038), ('arg', 0.038), ('weston', 0.038), ('recognition', 0.037), ('labeling', 0.037), ('elements', 0.036), ('cluster', 0.035), ('yi', 0.035), ('gaussian', 0.034), ('spread', 0.034), ('increasing', 0.033), ('learning', 0.033), ('worth', 0.032), ('germany', 0.032), ('amount', 0.031), ('ideal', 0.031), ('wij', 0.031), ('mass', 0.031), ('labels', 0.031), ('harvard', 0.031), ('constructive', 0.031), ('pretty', 0.031), ('djj', 0.031), ('autos', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

2 0.18670644 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie

Abstract: A key issue in supervised protein classiﬁcation is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classiﬁcation performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classiﬁcation performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efﬁciency. 1

3 0.18310721 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

4 0.15606248 164 nips-2003-Ranking on Data Manifolds

Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf

Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1

5 0.13962758 152 nips-2003-Pairwise Clustering and Graphical Models

Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss

Abstract: Signiﬁcant progress in clustering has been achieved by algorithms that are based on pairwise afﬁnities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efﬁcient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difﬁcult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for afﬁnity matrices based on labeled data. 1

6 0.1341922 172 nips-2003-Semi-Supervised Learning with Trees

7 0.1280264 1 nips-2003-1-norm Support Vector Machines

8 0.11341741 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

9 0.10788281 112 nips-2003-Learning to Find Pre-Images

10 0.10673018 107 nips-2003-Learning Spectral Clustering

11 0.10097387 124 nips-2003-Max-Margin Markov Networks

12 0.098259553 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

13 0.09779761 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

14 0.093844123 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

15 0.090446264 53 nips-2003-Discriminating Deformable Shape Classes

16 0.089935504 48 nips-2003-Convex Methods for Transduction

17 0.088589728 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

18 0.087641239 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

19 0.087330945 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

20 0.086876482 121 nips-2003-Log-Linear Models for Label Ranking

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.275), (1, -0.178), (2, -0.076), (3, -0.077), (4, -0.01), (5, 0.059), (6, -0.178), (7, -0.033), (8, 0.033), (9, 0.022), (10, 0.181), (11, 0.091), (12, 0.003), (13, -0.043), (14, 0.102), (15, -0.019), (16, 0.003), (17, 0.02), (18, -0.07), (19, -0.106), (20, 0.012), (21, 0.06), (22, 0.059), (23, 0.03), (24, -0.018), (25, -0.01), (26, 0.049), (27, 0.155), (28, 0.05), (29, -0.06), (30, -0.055), (31, -0.082), (32, -0.011), (33, 0.102), (34, 0.12), (35, 0.114), (36, 0.066), (37, 0.027), (38, -0.1), (39, -0.04), (40, -0.086), (41, 0.123), (42, -0.057), (43, 0.043), (44, 0.058), (45, -0.02), (46, -0.109), (47, 0.072), (48, 0.019), (49, -0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9608565 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

2 0.7600922 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie

3 0.6788559 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

4 0.63710022 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efﬁcient computation of the optimal Bayesian classiﬁcation function from the labeled examples. We test our approach on eight real-world datasets. 1

5 0.56684995 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

6 0.51842237 112 nips-2003-Learning to Find Pre-Images

7 0.51238155 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

8 0.50221515 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

9 0.49872416 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

10 0.48250824 164 nips-2003-Ranking on Data Manifolds

11 0.47323793 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

12 0.45973656 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.45876354 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

14 0.45756194 53 nips-2003-Discriminating Deformable Shape Classes

15 0.45515791 48 nips-2003-Convex Methods for Transduction

16 0.45026487 124 nips-2003-Max-Margin Markov Networks

17 0.44943726 120 nips-2003-Locality Preserving Projections

18 0.44015834 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

19 0.43681538 107 nips-2003-Learning Spectral Clustering

20 0.4334287 1 nips-2003-1-norm Support Vector Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.079), (11, 0.044), (16, 0.013), (30, 0.02), (35, 0.076), (50, 0.219), (53, 0.104), (69, 0.04), (71, 0.089), (76, 0.053), (85, 0.071), (91, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90044653 56 nips-2003-Dopamine Modulation in a Basal Ganglio-Cortical Network of Working Memory

Author: Aaron J. Gruber, Peter Dayan, Boris S. Gutkin, Sara A. Solla

Abstract: Dopamine exerts two classes of effect on the sustained neural activity in prefrontal cortex that underlies working memory. Direct release in the cortex increases the contrast of prefrontal neurons, enhancing the robustness of storage. Release of dopamine in the striatum is associated with salient stimuli and makes medium spiny neurons bistable; this modulation of the output of spiny neurons affects prefrontal cortex so as to indirectly gate access to working memory and additionally damp sensitivity to noise. Existing models have treated dopamine in one or other structure, or have addressed basal ganglia gating of working memory exclusive of dopamine effects. In this paper we combine these mechanisms and explore their joint effect. We model a memory-guided saccade task to illustrate how dopamine’s actions lead to working memory that is selective for salient input and has increased robustness to distraction. 1

same-paper 2 0.82748103 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

3 0.71111858 164 nips-2003-Ranking on Data Manifolds

Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf

4 0.68244195 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.68243957 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

6 0.68136555 81 nips-2003-Geometric Analysis of Constrained Curves

7 0.67902702 143 nips-2003-On the Dynamics of Boosting

8 0.67707789 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

9 0.67485428 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

10 0.67333448 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

11 0.67326516 189 nips-2003-Tree-structured Approximations by Expectation Propagation

12 0.6728344 107 nips-2003-Learning Spectral Clustering

13 0.6723879 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

14 0.67202115 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

15 0.6714437 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

16 0.67104584 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

17 0.67072713 172 nips-2003-Semi-Supervised Learning with Trees

18 0.66968501 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

19 0.66921902 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

20 0.66677445 41 nips-2003-Boosting versus Covering