nips nips2003 nips2003-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. [sent-4, score-0.789]
2 A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. [sent-5, score-1.093]
3 We present a simple algorithm to obtain such a smooth solution. [sent-6, score-0.099]
4 Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. [sent-7, score-0.359]
5 1 Introduction We consider the general problem of learning from labeled and unlabeled data. [sent-8, score-0.689]
6 , yl } ∈ L and the remaining points are unlabeled. [sent-21, score-0.101]
7 The goal is to predict the labels of the unlabeled points. [sent-22, score-0.351]
8 The performance of an algorithm is measured by the error rate on these unlabeled points only. [sent-23, score-0.421]
9 Since labeling often requires expensive human labor, whereas unlabeled data is far easier to obtain, semisupervised learning is very useful in many real-world problems and has recently attracted a considerable amount of research [10]. [sent-25, score-0.421]
10 A typical application is web categorization, in which manually classified web pages are always a very small part of the entire web, and the number of unlabeled examples is large. [sent-26, score-0.408]
11 The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure (typically referred to as a cluster or a manifold) are likely to have the same label. [sent-27, score-0.377]
12 This argument is akin to that in [2, 3, 4, 10, 15] and often called the cluster assumption [4, 10]. [sent-28, score-0.096]
13 Orthodox supervised learning algorithms, such as k-NN, in general depend only on the first assumption of local consistency. [sent-30, score-0.195]
14 To illustrate the prior assumption of consistency underlying semi-supervised learning, let us consider a toy dataset generated according to a pattern of two intertwining moons in Figure 1(a). [sent-31, score-0.72]
15 Every point should be similar to points in its local neighborhood, and furthermore, points in one moon should be more similar to each other than to points in the other moon. [sent-32, score-0.38]
16 The classification results given by the Support Vector Machine (SVM) with a RBF kernel (a) Toy Data (Two Moons) 1. [sent-33, score-0.096]
17 5 unlabeled point labeled point −1 labeled point +1 1 0. [sent-34, score-1.109]
18 5 Figure 1: Classification on the two moons pattern. [sent-69, score-0.225]
19 (a) toy data set with two labeled points; (b) classifying result given by the SVM with a RBF kernel; (c) k-NN with k = 1; (d) ideal classification that we hope to obtain. [sent-70, score-0.733]
20 According to the assumption of consistency, however, the two moons should be classified as shown in Figure 1(d). [sent-72, score-0.286]
21 The main differences between the various semi-supervised learning algorithms, such as spectral methods [2, 4, 6], random walks [13, 15], graph mincuts [3] and transductive SVM [14], lie in their way of realizing the assumption of consistency. [sent-73, score-0.275]
22 A principled approach to formalize the assumption is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure revealed by known labeled and unlabeled points. [sent-74, score-1.121]
23 Here we propose a simple iteration algorithm to construct such a smooth function inspired by the work on spreading activation networks [1, 11] and diffusion kernels [7, 8, 12], recent work on semi-supervised learning and clustering [2, 4, 9], and more specifically by the work of Zhu et al. [sent-75, score-0.512]
24 The keynote of our method is to let every point iteratively spread its label information to its neighbors until a global stable state is achieved. [sent-77, score-0.296]
25 , c}, the first l points xi (i ≤ l) are labeled as yi ∈ L and the remaining points xu (l+1 ≤ u ≤ n) are unlabeled. [sent-88, score-0.64]
26 The goal is to predict the label of the unlabeled points. [sent-89, score-0.464]
27 , Fn ]T ∈ F corresponds to a classification on the dataset X by labeling each point xi as a label yi = arg maxj≤c Fij . [sent-94, score-0.402]
28 We can understand F as a vectorial function F : X → Rc which assigns a vector Fi to each point xi . [sent-95, score-0.106]
29 Define a n × c matrix Y ∈ F with Yij = 1 if xi is labeled as yi = j and Yij = 0 otherwise. [sent-96, score-0.493]
30 Form the affinity matrix W defined by Wij = exp(− xi − xj 2 /2σ 2 ) if i = j and Wii = 0. [sent-99, score-0.17]
31 Construct the matrix S = D −1/2 W D−1/2 in which D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of W. [sent-101, score-0.166]
32 Label each point xi as a label ∗ yi = arg maxj≤c Fij . [sent-106, score-0.323]
33 This algorithm can be understood intuitively in terms of spreading activation networks [1, 11] from experimental psychology. [sent-107, score-0.139]
34 We first define a pairwise relationship W on the dataset X with the diagonal elements being zero. [sent-108, score-0.173]
35 In the second step, the weight matrix W of G is normalized symmetrically, which is necessary for the convergence of the following iteration. [sent-110, score-0.097]
36 The first two steps are exactly the same as in spectral clustering [9]. [sent-111, score-0.103]
37 During each iteration of the third step each point receives the information from its neighbors (first term), and also retains its initial information (second term). [sent-112, score-0.204]
38 The parameter α specifies the relative amount of the information from its neighbors and its initial label information. [sent-113, score-0.254]
39 It is worth mentioning that self-reinforcement is avoided since the diagonal elements of the affinity matrix are set to zero in the first step. [sent-114, score-0.179]
40 Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process. [sent-116, score-0.589]
41 (1) i=0 Since 0 < α < 1 and the eigenvalues of S in [-1, 1] (note that S is similar to the stochastic matrix P = D −1 W = D−1/2 SD1/2 ), t−1 lim (αS) t→∞ Hence t−1 = 0, and lim t→∞ i=0 (αS)i = (I − αS)−1 . [sent-120, score-0.159]
42 This also shows that the iteration result does not depend on the initial value for the iteration. [sent-123, score-0.125]
43 In addition, it is worth to notice that (I − αS)−1 is in fact a graph or diffusion kernel [7, 12]. [sent-124, score-0.254]
44 Now we discuss some possible variants of this method. [sent-125, score-0.103]
45 Then the classifying function is F ∗ = (I −αP T )−1 Y. [sent-131, score-0.197]
46 We will compare these variants with the original algorithm in the experiments. [sent-133, score-0.103]
47 3 Regularization Framework Here we develop a regularization framework for the above iteration algorithm. [sent-134, score-0.148]
48 Then the classifying function is F ∗ = arg min Q(F ). [sent-136, score-0.235]
49 F ∈F (5) The first term of the right-hand side in the cost function is the smoothness constraint, which means that a good classifying function should not change too much between nearby points. [sent-137, score-0.286]
50 The second term is the fitting constraint, which means a good classifying function should not change too much from the initial label assignment. [sent-138, score-0.38]
51 Note that the fitting constraint contains labeled as well as unlabeled data. [sent-140, score-0.656]
52 As we have mentioned, the points involving pairwise relationships can be be thought of as an undirected weighted graph, the weights of which represent the pairwise relationships. [sent-144, score-0.179]
53 The smoothness term essentially splits the function value at each point among the edges attached to it before computing the local changes, and the value assigned to each edge is proportional to its weight. [sent-147, score-0.12]
54 (6) Similarly we can develop the optimization frameworks for the variants F ∗ = (I −αP )−1 Y and F ∗ = (D − αW )−1 Y . [sent-153, score-0.103]
55 ’s harmonic Gaussian field method coupled with the Class Mass Normalization (CMN) [15], which is closely related to ours. [sent-157, score-0.331]
56 To the best of our knowledge, there is no reliable approach for model selection if only very few labeled points are available. [sent-158, score-0.437]
57 Hence we let all algorithms use their respective optimal parameters, except that the parameter α used in our methods and its variants was simply fixed at 0. [sent-159, score-0.103]
58 The convergence process of our iteration algorithm with t increasing from 1 to 400 is shown from (a) to (d). [sent-198, score-0.161]
59 Note that the initial label information are diffused along the moons. [sent-199, score-0.254]
60 Figure 3: The real-valued classifying function becomes flatter and flatter with respect to the two moons pattern with increasing t. [sent-200, score-0.455]
61 5 Figure 4: Smooth classification results given by supervised classifiers with the global consistency: (a) the classification result given by the SVM with a RBF kernel; (b) smooth the result of the SVM using the consistency method. [sent-220, score-0.385]
62 1 Toy Problem In this experiment we considered the toy problem mentioned in Section 1 (Figure 1). [sent-222, score-0.169]
63 The affinity matrix is defined by a RBF kernel but the diagonal elements are set to zero. [sent-223, score-0.243]
64 The convergence process of our iteration algorithm with t increasing from 1 to 400 is shown in Figure 2(a)-2(d). [sent-224, score-0.161]
65 Note that the initial label information are diffused along the moons. [sent-225, score-0.254]
66 The assumption of consistency essentially means that a good classifying function should change slowly on the coherent structure aggregated by a large amount of data. [sent-226, score-0.512]
67 This can be illustrated by this toy problem very clearly. [sent-227, score-0.169]
68 In Figure 3, we show that f (xi ) becomes successively flatter with respect to the two moons pattern from Figure 3(a)3(d) with increasing t. [sent-229, score-0.258]
69 Note that two clear moons emerge in the Figure 3(d). [sent-230, score-0.281]
70 The basic idea of our method is to construct a smooth function. [sent-231, score-0.138]
71 It is natural to consider using this method to improve a supervised classifier by smoothing its classifying result. [sent-232, score-0.299]
72 In other words, we use the classifying result given by a supervised classifier as the input of our algorithm. [sent-233, score-0.26]
73 This conjecture is demonstrated by a toy problem in Figure 4. [sent-234, score-0.169]
74 Note that the points classified incorrectly by the SVM are successfully smoothed by the consistency method. [sent-238, score-0.324]
75 2 Digit Recognition In this experiment, we addressed a classification task using the USPS handwritten 16x16 digits dataset. [sent-240, score-0.111]
76 The width of the RBF kernel for SVM was set to 5, and for the harmonic Gaussian field method it was set to 1. [sent-244, score-0.478]
77 In our method and its variants, the affinity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method, but the diagonal elements were set to 0. [sent-246, score-0.625]
78 Samples were chosen so that they contain at least one labeled point for each class. [sent-248, score-0.375]
79 Our consistency method and one of its variant are clearly superior to the orthodox supervised learning algorithms k-NN and SVM, and also better than the harmonic Gaussian method. [sent-249, score-0.792]
80 This enables us to incorporate prior knowledge about digit image invariance in an elegant way, e. [sent-251, score-0.108]
81 , by using a jittered kernel to compute the affinity matrix [5]. [sent-253, score-0.151]
82 3 k−NN (k = 1) SVM (RBF kernel) harmonic Gaussian consistency method variant consistency (1) variant consistency (2) 0. [sent-256, score-1.142]
83 7 k−NN (k = 1) SVM (RBF kernel) harmonic Gaussian consistency method variant consistency (1) variant consistency (2) 0. [sent-261, score-1.142]
84 25 0 4 10 15 20 25 # labeled points 30 40 50 0. [sent-271, score-0.437]
85 2 4 10 15 20 25 # labeled points 30 40 50 Figure 5: Left panel: the error rates of digit recognition with USPS handwritten 16x16 digits dataset for a total of 3874 (a subset containing digits from 1 to 4). [sent-272, score-0.806]
86 Right panel: the error rates of text classification with 3970 document vectors in a 8014-dimensional space. [sent-273, score-0.093]
87 Samples are chosen so that they contain at least one labeled point for each class. [sent-274, score-0.375]
88 The distance between points xi and xj was defined to be d(xi , xj ) = 1− xi , xj / xi xj [15]. [sent-285, score-0.494]
89 The width of the RBF kernel for SVM was set to 1. [sent-287, score-0.147]
90 5, and for the harmonic Gaussian method it was set to 0. [sent-288, score-0.331]
91 In our methods, the affinity matrix was constructed by the RBF kernel with the same width used as in the harmonic Gaussian method, but the diagonal elements were set to 0. [sent-290, score-0.586]
92 Samples were chosen so that they contain at least one labeled point for each class. [sent-292, score-0.375]
93 It is interesting to note that the harmonic method is very good when the number of labeled points is 4, i. [sent-293, score-0.768]
94 We think this is because there are almost equal proportions of different classes in the dataset, and so with four labeled points, the proportions happen to be estimated exactly. [sent-296, score-0.46]
95 The harmonic method becomes worse, however, if slightly more labeled points are used, for instance, 10 labeled points, which leads to pretty poor estimation. [sent-297, score-1.135]
96 As the number of labeled points increases further, the harmonic method works well again and somewhat better than our method, since the proportions of classes are estimated successfully again. [sent-298, score-0.83]
97 However, our decision rule is much simpler, which in fact corresponds to the so-called naive threshold, the baseline of the harmonic method. [sent-299, score-0.292]
98 5 Conclusion The key to semi-supervised learning problems is the consistency assumption, which essentially requires a classifying function to be sufficiently smooth with respect to the intrinsic structure revealed by a huge amount of labeled and unlabeled points. [sent-300, score-1.347]
99 We proposed a simple algorithm to obtain such a solution, which demonstrated effective use of unlabeled data in experiments including toy data, digit recognition and text categorization. [sent-301, score-0.687]
100 Learning from labeled and unlabeled data using graph mincuts. [sent-320, score-0.708]
wordName wordTfidf (topN-words)
[('labeled', 0.336), ('unlabeled', 0.32), ('harmonic', 0.292), ('moons', 0.225), ('consistency', 0.223), ('classifying', 0.197), ('rbf', 0.191), ('toy', 0.169), ('svm', 0.157), ('label', 0.144), ('nity', 0.124), ('af', 0.124), ('classi', 0.115), ('digit', 0.108), ('atter', 0.107), ('sf', 0.105), ('variants', 0.103), ('points', 0.101), ('smooth', 0.099), ('xl', 0.099), ('kernel', 0.096), ('spreading', 0.093), ('iteration', 0.086), ('zhu', 0.083), ('diffusion', 0.074), ('digits', 0.071), ('diffused', 0.071), ('orthodox', 0.071), ('variant', 0.071), ('nn', 0.07), ('transductive', 0.067), ('xi', 0.067), ('supervised', 0.063), ('proportions', 0.062), ('regularization', 0.062), ('spectral', 0.062), ('fij', 0.062), ('panel', 0.061), ('assumption', 0.061), ('cation', 0.061), ('icml', 0.06), ('yij', 0.056), ('emerge', 0.056), ('diagonal', 0.056), ('revealed', 0.055), ('matrix', 0.055), ('intrinsic', 0.053), ('text', 0.053), ('graph', 0.052), ('lim', 0.052), ('width', 0.051), ('symmetrically', 0.049), ('maxj', 0.049), ('xj', 0.048), ('activation', 0.046), ('nearby', 0.046), ('web', 0.044), ('nips', 0.043), ('fi', 0.043), ('smoothness', 0.043), ('dataset', 0.042), ('convergence', 0.042), ('olivier', 0.041), ('clustering', 0.041), ('kernels', 0.04), ('chapelle', 0.04), ('handwritten', 0.04), ('document', 0.04), ('neighbors', 0.04), ('point', 0.039), ('initial', 0.039), ('documents', 0.039), ('usps', 0.039), ('method', 0.039), ('pairwise', 0.039), ('sch', 0.039), ('local', 0.038), ('arg', 0.038), ('weston', 0.038), ('recognition', 0.037), ('labeling', 0.037), ('elements', 0.036), ('cluster', 0.035), ('yi', 0.035), ('gaussian', 0.034), ('spread', 0.034), ('increasing', 0.033), ('learning', 0.033), ('worth', 0.032), ('germany', 0.032), ('amount', 0.031), ('ideal', 0.031), ('wij', 0.031), ('mass', 0.031), ('labels', 0.031), ('harvard', 0.031), ('constructive', 0.031), ('pretty', 0.031), ('djj', 0.031), ('autos', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
2 0.18670644 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie
Abstract: A key issue in supervised protein classification is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efficiency. 1
3 0.18310721 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
4 0.15606248 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
5 0.13962758 152 nips-2003-Pairwise Clustering and Graphical Models
Author: Noam Shental, Assaf Zomet, Tomer Hertz, Yair Weiss
Abstract: Significant progress in clustering has been achieved by algorithms that are based on pairwise affinities between the datapoints. In particular, spectral clustering methods have the advantage of being able to divide arbitrarily shaped clusters and are based on efficient eigenvector calculations. However, spectral methods lack a straightforward probabilistic interpretation which makes it difficult to automatically set parameters using training data. In this paper we use the previously proposed typical cut framework for pairwise clustering. We show an equivalence between calculating the typical cut and inference in an undirected graphical model. We show that for clustering problems with hundreds of datapoints exact inference may still be possible. For more complicated datasets, we show that loopy belief propagation (BP) and generalized belief propagation (GBP) can give excellent results on challenging clustering problems. We also use graphical models to derive a learning algorithm for affinity matrices based on labeled data. 1
6 0.1341922 172 nips-2003-Semi-Supervised Learning with Trees
7 0.1280264 1 nips-2003-1-norm Support Vector Machines
8 0.11341741 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
9 0.10788281 112 nips-2003-Learning to Find Pre-Images
10 0.10673018 107 nips-2003-Learning Spectral Clustering
11 0.10097387 124 nips-2003-Max-Margin Markov Networks
12 0.098259553 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
13 0.09779761 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
14 0.093844123 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
15 0.090446264 53 nips-2003-Discriminating Deformable Shape Classes
16 0.089935504 48 nips-2003-Convex Methods for Transduction
17 0.088589728 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
18 0.087641239 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
19 0.087330945 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
20 0.086876482 121 nips-2003-Log-Linear Models for Label Ranking
topicId topicWeight
[(0, -0.275), (1, -0.178), (2, -0.076), (3, -0.077), (4, -0.01), (5, 0.059), (6, -0.178), (7, -0.033), (8, 0.033), (9, 0.022), (10, 0.181), (11, 0.091), (12, 0.003), (13, -0.043), (14, 0.102), (15, -0.019), (16, 0.003), (17, 0.02), (18, -0.07), (19, -0.106), (20, 0.012), (21, 0.06), (22, 0.059), (23, 0.03), (24, -0.018), (25, -0.01), (26, 0.049), (27, 0.155), (28, 0.05), (29, -0.06), (30, -0.055), (31, -0.082), (32, -0.011), (33, 0.102), (34, 0.12), (35, 0.114), (36, 0.066), (37, 0.027), (38, -0.1), (39, -0.04), (40, -0.086), (41, 0.123), (42, -0.057), (43, 0.043), (44, 0.058), (45, -0.02), (46, -0.109), (47, 0.072), (48, 0.019), (49, -0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.9608565 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
2 0.7600922 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
Author: Jason Weston, Dengyong Zhou, André Elisseeff, William S. Noble, Christina S. Leslie
Abstract: A key issue in supervised protein classification is the representation of input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data — examples with known 3D structures, organized into structural classes — while in practice, unlabeled data is far more plentiful. In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods while achieving far greater computational efficiency. 1
3 0.6788559 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
4 0.63710022 172 nips-2003-Semi-Supervised Learning with Trees
Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum
Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efficient computation of the optimal Bayesian classification function from the labeled examples. We test our approach on eight real-world datasets. 1
5 0.56684995 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
6 0.51842237 112 nips-2003-Learning to Find Pre-Images
7 0.51238155 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
8 0.50221515 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
9 0.49872416 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
10 0.48250824 164 nips-2003-Ranking on Data Manifolds
11 0.47323793 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
12 0.45973656 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
13 0.45876354 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
14 0.45756194 53 nips-2003-Discriminating Deformable Shape Classes
15 0.45515791 48 nips-2003-Convex Methods for Transduction
16 0.45026487 124 nips-2003-Max-Margin Markov Networks
17 0.44943726 120 nips-2003-Locality Preserving Projections
18 0.44015834 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
19 0.43681538 107 nips-2003-Learning Spectral Clustering
20 0.4334287 1 nips-2003-1-norm Support Vector Machines
topicId topicWeight
[(0, 0.079), (11, 0.044), (16, 0.013), (30, 0.02), (35, 0.076), (50, 0.219), (53, 0.104), (69, 0.04), (71, 0.089), (76, 0.053), (85, 0.071), (91, 0.111)]
simIndex simValue paperId paperTitle
1 0.90044653 56 nips-2003-Dopamine Modulation in a Basal Ganglio-Cortical Network of Working Memory
Author: Aaron J. Gruber, Peter Dayan, Boris S. Gutkin, Sara A. Solla
Abstract: Dopamine exerts two classes of effect on the sustained neural activity in prefrontal cortex that underlies working memory. Direct release in the cortex increases the contrast of prefrontal neurons, enhancing the robustness of storage. Release of dopamine in the striatum is associated with salient stimuli and makes medium spiny neurons bistable; this modulation of the output of spiny neurons affects prefrontal cortex so as to indirectly gate access to working memory and additionally damp sensitivity to noise. Existing models have treated dopamine in one or other structure, or have addressed basal ganglia gating of working memory exclusive of dopamine effects. In this paper we combine these mechanisms and explore their joint effect. We model a memory-guided saccade task to illustrate how dopamine’s actions lead to working memory that is selective for salient input and has increased robustness to distraction. 1
same-paper 2 0.82748103 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
3 0.71111858 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
4 0.68244195 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
5 0.68243957 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.68136555 81 nips-2003-Geometric Analysis of Constrained Curves
7 0.67902702 143 nips-2003-On the Dynamics of Boosting
8 0.67707789 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
9 0.67485428 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
10 0.67333448 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
11 0.67326516 189 nips-2003-Tree-structured Approximations by Expectation Propagation
12 0.6728344 107 nips-2003-Learning Spectral Clustering
13 0.6723879 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
14 0.67202115 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
15 0.6714437 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
16 0.67104584 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
17 0.67072713 172 nips-2003-Semi-Supervised Learning with Trees
18 0.66968501 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
19 0.66921902 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
20 0.66677445 41 nips-2003-Boosting versus Covering