nips nips2003 nips2003-164 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. [sent-4, score-1.1]
2 Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. [sent-5, score-0.87]
3 The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. [sent-6, score-0.309]
4 1 Introduction The Google search engine [2] accomplishes web page ranking using PageRank algorithm, which exploits the global, rather than local, hyperlink structure of the web [1]. [sent-8, score-1.066]
5 Intuitively, it can be thought of as modelling the behavior of a random surfer on the graph of the web, who simply keeps clicking on successive links at random and also periodically jumps to a random page. [sent-9, score-0.127]
6 The web pages are ranked according to the stationary distribution of the random walk. [sent-10, score-0.255]
7 Empirical results show PageRank is superior to the naive ranking method, in which the web pages are simply ranked according to the sum of inbound hyperlinks, and accordingly only the local structure of the web is exploited. [sent-11, score-1.106]
8 Our interest here is in the situation where the objects to be ranked are represented as vectors in Euclidean space, such as text or image data. [sent-12, score-0.151]
9 Our goal is to rank the data with respect to the intrinsic global manifold structure [6, 7] collectively revealed by a huge amount of data. [sent-13, score-0.351]
10 We believe for many real world data types this should be superior to a local method, which rank data simply by pairwise Euclidean distances or inner products. [sent-14, score-0.163]
11 Let us consider a toy problem to explain our motivation. [sent-15, score-0.057]
12 We are given a set of points constructed in two moons pattern (Figure 1(a)). [sent-16, score-0.115]
13 A query is given in the upper moon, and the task is to rank the remaining points according to their relevances to the query. [sent-17, score-0.484]
14 Intuitively, the relevant degrees of points in the upper moon to the query should decrease along the moon shape. [sent-18, score-0.624]
15 This should also happen for the points in the lower moon. [sent-19, score-0.058]
16 Furthermore, all of the points in the upper moon should be more relevant to the query than the points in the lower moon. [sent-20, score-0.541]
17 If we rank the points with respect to the query simply by Euclidean distance, then the left-most points in the lower moon will be more relevant to the query than the right-most points in the upper moon (Figure 1(b)). [sent-21, score-1.145]
18 We propose a simple universal ranking algorithm, which can exploit the intrinsic manifold (a) Two moons ranking problem (b) Ranking by Euclidean distance (c) Ideal ranking query Figure 1: Ranking on the two moons pattern. [sent-23, score-3.008]
19 The marker sizes are proportional to the ranking in the last two figures. [sent-24, score-0.811]
20 (a) toy data set with a single query; (b) ranking by the Euclidean distances; (c) ideal ranking result we hope to obtain. [sent-25, score-1.663]
21 In fact the ranking problem can be viewed as an extreme case of semi-supervised learning, in which only positive labeled points are available. [sent-28, score-0.871]
22 We first form a weighted network on the data, and assign a positive ranking score to each query and zero to the remaining points which are ranked with respect to the queries. [sent-30, score-1.352]
23 All points then spread their ranking score to their nearby neighbors via the weighted network. [sent-31, score-1.003]
24 The spread process is repeated until a global stable state is achieved, and all points except queries are ranked according to their final ranking scores. [sent-32, score-1.234]
25 Section 4 further introduces a variant of PageRank, which can rank the data with respect to the specific queries. [sent-36, score-0.104]
26 Finally, Section 5 presents experimental results on toy data, on digit image, and on text documents, and Section 6 concludes this paper. [sent-37, score-0.231]
27 , xn } ⊂ Rm , the first q points are the queries and the rest are the points that we want to rank according to their relevances to the queries. [sent-44, score-0.448]
28 Let d : X × X −→ R denote a metric on X , such as Euclidean distance, which assigns each pair of points xi and xi a distance d(xi , xj ). [sent-45, score-0.191]
29 Let f : X −→ R denote a ranking function which assigns to each point xi a ranking value fi . [sent-46, score-1.663]
30 If we have prior knowledge about the confidences of queries, then we can assign different ranking scores to the queries proportional to their respective confidences. [sent-53, score-1.111]
31 Sort the pairwise distances among points in ascending order. [sent-55, score-0.078]
32 Repeat connecting the two points with an edge according the order until a connected graph is obtained. [sent-56, score-0.152]
33 Form the affinity matrix W defined by Wij = exp[−d2 (xi , xj )/2σ 2 ] if there is an edge linking xi and xj . [sent-58, score-0.085]
34 Rank each point xi according its ranking scores fi∗ (largest ranked first). [sent-66, score-1.025]
35 The network is simply weighted in the second step and the weight is symmetrically normalized in the third step. [sent-69, score-0.073]
36 In the forth step, all points spread their ranking score to their neighbors via the weighted network. [sent-71, score-1.003]
37 The spread process is repeated until a global stable state is achieved, and in the fifth step the points are ranked according to their final ranking scores. [sent-72, score-1.044]
38 The parameter α specifies the relative contributions to the ranking scores from neighbors and the initial ranking scores. [sent-73, score-1.714]
39 It is worth mentioning that self-reinforcement is avoided since the diagonal elements of the affinity matrix are set to zero in the second step. [sent-74, score-0.064]
40 In addition, the information is spread symmetrically since S is a symmetric matrix. [sent-75, score-0.088]
41 Clearly, the scaling factor β does not make contributions for our ranking task. [sent-82, score-0.811]
42 (2) We can use this closed form to compute the ranking scores of points directly. [sent-84, score-0.981]
43 Our experiments show that a few iterations are enough to yield high quality ranking results. [sent-86, score-0.795]
44 3 Connections with Google Let G = (V, E) denote a directed graph with vertices. [sent-87, score-0.044]
45 Let W denote the n × n adjacency matrix W, in which Wij = 1 if there is a link in E from vertex xi to vertex xj , and Wij = 0 otherwise. [sent-88, score-0.158]
46 Define a random walk on G determined by the following transition probability matrix P = (1 − )U + D −1 W, (3) where U is the matrix with all entries equal to 1/n. [sent-90, score-0.106]
47 This can be interpreted as a probability of transition to an adjacent vertex, and a probability 1 − of jumping to any point on the graph uniform randomly. [sent-91, score-0.044]
48 Then the ranking scores over V computed by PageRank is given by the stationary distribution π of the random walk. [sent-92, score-0.933]
49 If we also rank all points without queries using our method, as is done by Google, then we have the following theorem: Theorem 2 For the task of ranking data represented by a connected and undirected graph without queries, f ∗ and PageRank yield the same ranking list. [sent-95, score-2.02]
50 We fist show that the stationary distribution π of the random walk used in Google is proportional to the vertex degree if the graph G is undirected and connected. [sent-97, score-0.211]
51 Let vol G denote the volume of G, which is given by the sum of vertex degrees. [sent-100, score-0.046]
52 Hence π is also the the stationary distribution of the random walk determined by the transition probability matrix D −1 W. [sent-103, score-0.1]
53 Now we consider the ranking result given by our method in the situation without queries. [sent-104, score-0.795]
54 (5) A standard result [4] of linear algebra states that if f (0) is a vector not orthogonal to the principal eigenvector, then the sequence {f (t)} converges to the principal eigenvector of S. [sent-106, score-0.082]
55 (6) Comparing (4) with (6), it is clear that f ∗ and π give the same ranking list. [sent-111, score-0.795]
56 4 Personalized Google Although PageRank is designed to rank all points without respect to any query, it is easy to modify for query-based ranking problems. [sent-113, score-0.957]
57 The ranking scores given by PageRank are the elements of the convergence solution π ∗ of the iteration equation π(t + 1) = αP T π(t). [sent-115, score-0.905]
58 (7) By analogy with the algorithm in Section 2, we can add a query term on the right-hand side of (7) for the query-based ranking, π(t + 1) = αP T π(t) + (1 − α)y. [sent-116, score-0.284]
59 (8) This can be viewed as the personalized version of PageRank. [sent-117, score-0.047]
60 (10) Hence the main difference between π ∗ and f ∗ is that in the latter the initial ranking score yi of each query xi is weighted with respect to its degree. [sent-122, score-1.204]
61 The above observation motivates us to propose a more general personalized PageRank algorithm, π(t + 1) = αP T π(t) + (1 − α)D k y, (11) in which we assign different importance to queries with respect to their degree. [sent-123, score-0.279]
62 This shows the ranking and classification problems are closely related. [sent-127, score-0.795]
63 We can do a similar analysis of the relations to Kleinberg’s HITS [5], which is another popular web page ranking algorithm. [sent-128, score-0.907]
64 The basic idea of this method is also to iteratively spread the ranking scores via the existing web graph. [sent-129, score-1.023]
65 5 Experiments We validate our method using a toy problem and two real-world domains: image and text. [sent-131, score-0.082]
66 the image and document categories (which is not true in real-world ranking problems), we can compute the ranking error using the Receiver Operator Characteristic (ROC) score [3] to evaluate ranking algorithms. [sent-136, score-2.49]
67 The returned score is between 0 and 1, a score of 1 indicating a perfect ranking. [sent-137, score-0.12]
68 1 Toy Problem In this experiment we considered the toy ranking problem mentioned in the introduction section. [sent-139, score-0.852]
69 The connected graph described in the first step of our algorithm is shown in Figure 2(a). [sent-140, score-0.072]
70 The ranking scores with different time steps: t = 5, 10, 50, 100 are shown in Figures 2(b)-(e). [sent-141, score-0.883]
71 Note that the scores on each moon decrease along the moon shape away from the query, and the scores on the moon containing the query point are larger than on the other moon. [sent-142, score-0.883]
72 Ranking by Euclidean distance is shown in Figure 2(f), which fails to capture the two moons structure. [sent-143, score-0.101]
73 It is worth mentioning that simply ranking the data according to the shortest paths [7] on the graph does not work well. [sent-144, score-0.966]
74 It appears that shortest paths are sensitive to the small changes in the graph. [sent-146, score-0.045]
75 2 Image Ranking In this experiment we address a task of ranking on the USPS handwritten 16x16 digits dataset. [sent-151, score-0.858]
76 (a) Connected graph Figure 2: Ranking on the pattern of two moons. [sent-154, score-0.044]
77 (a) connected graph; (b)-(e) ranking with the different time steps: t = 5, 10, 50, 100; (f) ranking by Euclidean distance. [sent-155, score-1.618]
78 6 10 Manifold ranking Euclidean distance 2 4 2 4 1 0. [sent-177, score-0.839]
79 65 Manifold ranking Euclidean distance 2 4 6 # queries 8 10 0. [sent-200, score-1.029]
80 6 Manifold ranking Euclidean distance 6 # queries 8 10 Figure 3: ROC on USPS for queries from digits 1 to 6. [sent-201, score-1.282]
81 Note that this experimental results also provide indirect proof of the intrinsic manifold structure in USPS. [sent-202, score-0.184]
82 The left panel shows the top 99 by the manifold ranking; and the right panel shows the top 99 by the Euclidean distance based ranking. [sent-205, score-0.265]
83 We randomly select examples from one class of digits to be the query set over 30 trials, and then rank the remaining digits with respect to these sets. [sent-207, score-0.514]
84 The Euclidean distance based ranking method is used as the baseline: given a query set {xs }(s ∈ S), the points x are ranked according to that the highest ranking is given to the point x with the lowest score of mins∈S x − xs . [sent-210, score-2.173]
85 The results, measured as ROC scores, are summarized in Figure 3; each plot corresponds to a different query class, from digit one to six respectively. [sent-211, score-0.426]
86 Our algorithm is comparable to the baseline when a digit 1 is the query. [sent-212, score-0.142]
87 This experimental result also provides indirect proof of the underlying manifold structure in the USPS digit dataset [6, 7]. [sent-214, score-0.29]
88 The top ranked 99 images obtained by our algorithm and Euclidean distance, with a random digit 2 as the query, are shown in Figure 4. [sent-215, score-0.252]
89 Furthermore, there are many curly 2s in the right panel, which do not match well with the query: the 2s in the left panel are more similar to the query than the 2s in the right panel. [sent-218, score-0.331]
90 This subtle superiority makes a great deal of sense in the real-word ranking task, in which users are only interested in very few leading ranking results. [sent-219, score-1.637]
91 The ROC measure is too simple to reflect this subtle superiority however. [sent-220, score-0.047]
92 3 Text Ranking In this experiment, we investigate the task of text ranking using the 20-newsgroups dataset. [sent-222, score-0.827]
93 We choose the topic rec which contains autos, motorcycles, baseball and hockey from the version 20-news-18828. [sent-223, score-0.094]
94 We use the ranking method based on normalized inner product as the baseline. [sent-228, score-0.861]
95 The affinity matrix W is also constructed by inner product, i. [sent-229, score-0.063]
96 The ROC scores for 100 randomly selected queries for each class are given in Figure 5. [sent-232, score-0.278]
97 7 manifold ranking Figure 5: ROC score scatter plots of 100 random queries from the category autos, motorcycles, baseball and hockey contained in the 20-newsgroups dataset. [sent-289, score-1.282]
98 Potentially, if one was given a small labeled set or a query set greater than size 1, one could use standard cross validation techniques. [sent-291, score-0.302]
99 For example one could implement an iterative feedback framework: as the user specifies positive feedback this can be used to extend the query set and improve the ranking output. [sent-294, score-1.079]
100 The anatomy of a large scale hypertextual web search engine. [sent-305, score-0.089]
wordName wordTfidf (topN-words)
[('ranking', 0.795), ('query', 0.284), ('queries', 0.19), ('digit', 0.142), ('moon', 0.141), ('manifold', 0.127), ('pagerank', 0.125), ('euclidean', 0.124), ('roc', 0.108), ('ranked', 0.094), ('web', 0.089), ('scores', 0.088), ('google', 0.085), ('rank', 0.084), ('digits', 0.063), ('score', 0.06), ('points', 0.058), ('toy', 0.057), ('moons', 0.057), ('spread', 0.051), ('autos', 0.047), ('baseball', 0.047), ('hockey', 0.047), ('motorcycles', 0.047), ('personalized', 0.047), ('panel', 0.047), ('vertex', 0.046), ('distance', 0.044), ('graph', 0.044), ('inner', 0.042), ('closed', 0.04), ('sf', 0.04), ('symmetrically', 0.037), ('nity', 0.036), ('intrinsic', 0.036), ('af', 0.036), ('relevances', 0.036), ('stationary', 0.034), ('text', 0.032), ('xq', 0.031), ('documents', 0.029), ('walk', 0.029), ('usps', 0.029), ('fi', 0.029), ('hyperlink', 0.028), ('connected', 0.028), ('superiority', 0.027), ('mentioning', 0.027), ('eigenvector', 0.026), ('xi', 0.026), ('undirected', 0.026), ('image', 0.025), ('paths', 0.025), ('global', 0.024), ('product', 0.024), ('wij', 0.024), ('engine', 0.024), ('collectively', 0.024), ('bousquet', 0.024), ('dy', 0.024), ('page', 0.023), ('according', 0.022), ('assign', 0.022), ('zhou', 0.022), ('converges', 0.022), ('iteration', 0.022), ('matrix', 0.021), ('indirect', 0.021), ('xs', 0.021), ('shortest', 0.02), ('subtle', 0.02), ('document', 0.02), ('neighbors', 0.02), ('respect', 0.02), ('distances', 0.02), ('xj', 0.019), ('weston', 0.019), ('entries', 0.019), ('weighted', 0.019), ('revealed', 0.018), ('universal', 0.018), ('huge', 0.018), ('let', 0.018), ('links', 0.018), ('labeled', 0.018), ('exploits', 0.018), ('assigns', 0.018), ('simply', 0.017), ('principal', 0.017), ('worth', 0.016), ('contributions', 0.016), ('theorem', 0.016), ('ideal', 0.016), ('proportional', 0.016), ('random', 0.016), ('albert', 0.016), ('jeong', 0.016), ('baltimore', 0.016), ('kleinberg', 0.016), ('fth', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
2 0.48541766 121 nips-2003-Log-Linear Models for Label Ranking
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
3 0.15606248 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
4 0.11761042 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
5 0.088969596 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
6 0.085987851 12 nips-2003-A Model for Learning the Semantics of Pictures
7 0.068785064 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
8 0.065548703 108 nips-2003-Learning a Distance Metric from Relative Comparisons
9 0.058355272 126 nips-2003-Measure Based Regularization
10 0.054945417 120 nips-2003-Locality Preserving Projections
11 0.047097143 118 nips-2003-Link Prediction in Relational Data
12 0.045576133 152 nips-2003-Pairwise Clustering and Graphical Models
13 0.044615451 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
14 0.044284929 88 nips-2003-Image Reconstruction by Linear Programming
15 0.04365531 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
16 0.042792942 73 nips-2003-Feature Selection in Clustering Problems
17 0.042146303 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
18 0.041294746 128 nips-2003-Minimax Embeddings
19 0.040119972 112 nips-2003-Learning to Find Pre-Images
20 0.038244437 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
topicId topicWeight
[(0, -0.142), (1, -0.1), (2, -0.057), (3, -0.017), (4, -0.04), (5, -0.007), (6, -0.173), (7, -0.049), (8, 0.069), (9, -0.163), (10, 0.272), (11, 0.095), (12, 0.131), (13, 0.026), (14, 0.2), (15, -0.269), (16, -0.349), (17, -0.301), (18, -0.094), (19, -0.128), (20, -0.181), (21, 0.093), (22, 0.122), (23, -0.034), (24, 0.154), (25, 0.138), (26, -0.084), (27, -0.005), (28, -0.054), (29, -0.099), (30, -0.069), (31, -0.027), (32, -0.007), (33, -0.046), (34, 0.007), (35, 0.094), (36, 0.01), (37, -0.02), (38, 0.07), (39, 0.038), (40, -0.083), (41, -0.039), (42, 0.036), (43, 0.077), (44, 0.033), (45, -0.01), (46, 0.022), (47, -0.038), (48, -0.024), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.97657615 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
2 0.86617112 121 nips-2003-Log-Linear Models for Label Ranking
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
3 0.46271953 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
4 0.34071723 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
5 0.25593552 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
6 0.22396143 120 nips-2003-Locality Preserving Projections
7 0.1986056 126 nips-2003-Measure Based Regularization
8 0.19016074 108 nips-2003-Learning a Distance Metric from Relative Comparisons
9 0.18633382 30 nips-2003-Approximability of Probability Distributions
10 0.18035689 12 nips-2003-A Model for Learning the Semantics of Pictures
11 0.1776499 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
12 0.17496572 172 nips-2003-Semi-Supervised Learning with Trees
13 0.17355172 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
14 0.16844395 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
15 0.16331989 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
16 0.1607502 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
17 0.16044113 152 nips-2003-Pairwise Clustering and Graphical Models
18 0.16003111 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
19 0.15552811 168 nips-2003-Salient Boundary Detection using Ratio Contour
20 0.15327463 118 nips-2003-Link Prediction in Relational Data
topicId topicWeight
[(0, 0.043), (11, 0.066), (16, 0.241), (29, 0.015), (30, 0.015), (35, 0.064), (50, 0.057), (53, 0.107), (69, 0.013), (71, 0.087), (76, 0.038), (85, 0.037), (91, 0.08), (99, 0.028)]
simIndex simValue paperId paperTitle
1 0.86360496 118 nips-2003-Link Prediction in Relational Data
Author: Ben Taskar, Ming-fai Wong, Pieter Abbeel, Daphne Koller
Abstract: Many real-world domains are relational in nature, consisting of a set of objects related to each other in complex ways. This paper focuses on predicting the existence and the type of links between entities in such domains. We apply the relational Markov network framework of Taskar et al. to define a joint probabilistic model over the entire link graph — entity attributes and links. The application of the RMN algorithm to this task requires the definition of probabilistic patterns over subgraph structures. We apply this method to two new relational datasets, one involving university webpages, and the other a social network. We show that the collective classification approach of RMNs, and the introduction of subgraph patterns over link labels, provide significant improvements in accuracy over flat classification, which attempts to predict each link in isolation. 1
same-paper 2 0.84996575 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
3 0.61143428 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
4 0.58013517 12 nips-2003-A Model for Learning the Semantics of Pictures
Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon
Abstract: We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model significantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval. 1
5 0.5786714 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.57827473 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
7 0.57737553 122 nips-2003-Margin Maximizing Loss Functions
8 0.57469273 30 nips-2003-Approximability of Probability Distributions
9 0.57387596 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
10 0.57324106 189 nips-2003-Tree-structured Approximations by Expectation Propagation
11 0.57305908 81 nips-2003-Geometric Analysis of Constrained Curves
12 0.57240397 168 nips-2003-Salient Boundary Detection using Ratio Contour
13 0.57238132 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
14 0.57229495 107 nips-2003-Learning Spectral Clustering
15 0.57222539 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
16 0.56985557 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
17 0.56944478 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
18 0.56887013 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
19 0.56883198 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
20 0.56874418 112 nips-2003-Learning to Find Pre-Images