nips nips2004 nips2004-165 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf
Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. [sent-7, score-0.812]
2 To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. [sent-8, score-0.638]
3 1 Introduction We consider semi-supervised classification problems on weighted directed graphs, in which some nodes in the graph are labeled as positive or negative, and where the task consists in classifying unlabeled nodes. [sent-11, score-0.669]
4 Typical examples of this kind are Web page categorization based on hyperlink structure [4, 11] and document classification or recommendation based on citation graphs [10], yet similar problems exist in other domains such as computational biology. [sent-12, score-0.382]
5 For the sake of concreteness, we will mainly focus on the Web graph in the sequel, i. [sent-13, score-0.16]
6 the considered graph represents a subgraph of the Web, where nodes correspond to Web pages and directed edges represent hyperlinks between them (cf. [sent-15, score-0.715]
7 We refrain from utilizing attributes or features associated with each node, which may or may not be available in applications, but rather focus on the analysis of the connectivity of the graph as a means for classifying unlabeled nodes. [sent-17, score-0.288]
8 Such an approach inevitably needs to make some a priori premise about how connectivity and categorization of individual nodes may be related in real-world graphs. [sent-18, score-0.199]
9 The fundamental assumption of our framework is the category similarity of co-linked nodes in a directed graph. [sent-19, score-0.403]
10 This is a slightly more complex concept than in the case of undirected (weighted) graphs [1, 18, 12, 15, 17], where a typical assumption is that an edge connecting two nodes will more or less increase the likelihood of the nodes belonging to the same category. [sent-20, score-0.382]
11 Co-linkage on the other hand seems a more suitable and promising concept in directed graphs, as is witnessed by its successful use in Web page categorization [4] as well as co-citation analysis for information retrieval [10]. [sent-21, score-0.438]
12 nodes with common parents, and co-parent structures, i. [sent-24, score-0.086]
13 In most Web and citation graph related application, the first assumption, namely that nodes with highly overlapping parent sets are likely to belong to the same category, seems to be more relevant (cf. [sent-27, score-0.332]
14 One possible way of designing classifiers based on graph connectivity is to construct a kernel matrix based on pairwise links [11] and then to adopt a standard kernel method, e. [sent-29, score-0.367]
15 The idea of exploiting global rather than local graph structure is widely used in other Web-related techniques, including Web page ranking [2, 13], finding similar Web pages [7], detecting Web communities [13, 9] and so on. [sent-33, score-0.376]
16 The major innovation of this paper is a general regularization framework on directed graphs, in which the directionality and global relationships are considered, and a computationally attractive classification algorithm, which is derived from the proposed regularization framework. [sent-34, score-0.563]
17 1 Regularization Framework Preliminaries A directed graph Γ = (V, E) consists of a set of vertices, denoted by V and a set of edges, denoted by E ⊆ V × V . [sent-36, score-0.505]
18 Each edge is an ordered pair of nodes [u, v] representing a directed connection from u to v. [sent-37, score-0.377]
19 In a weighted directed graph, a weight function w : V × V → R+ is associated with Γ, satisfying w([u, v]) = 0 if and only if [u, v] ∈ E. [sent-41, score-0.291]
20 Typically, we can equip a directed graph with a canonical weight function by defining w([u, v]) ≡ 1 if and only if [u, v] ∈ E. [sent-42, score-0.495]
21 The in-degree p(v) and out-degree q(v) of a vertex v ∈ V , respectively, are defined as p(v) ≡ w([u, v]), and {u|[u,v]∈E} q(v) ≡ w([v, u]) . [sent-43, score-0.167]
22 (1) {u|[v,u]∈E} Let H(V ) denote the space of functions f : V → R, which assigns a real value f (v) to each vertex v. [sent-44, score-0.167]
23 The function f can be represented as a column vector in R|V | , where |V | denotes the number of the vertices in V . [sent-45, score-0.09]
24 Bipartite Graphs A bipartite graph G = (H, A, L) is a special type of directed graph that consists of two sets of vertices, denoted by H and A respectively, and a set of edges (or links), denoted by L ⊆ H × A. [sent-49, score-0.886]
25 In a bipartite graph, each edge connects a vertex in H to a vertex in A. [sent-50, score-0.523]
26 Any directed graph Γ = (V, E) can be regarded as a bipartite graph using the following simple construction [14]: H ≡ {h|h ∈ V, q(h) > 0}, A ≡ {a|a ∈ V, p(a) > 0}, and L ≡ E. [sent-51, score-0.849]
27 Figure 1 depicts the construction of the bipartite graph. [sent-52, score-0.212]
28 Notice that vertices of the original graph Γ may appear in both vertex sets H and A of the constructed bipartite graph. [sent-53, score-0.606]
29 The intuition behind the construction of the bipartite graph is provided by the so-called hub and authority web model introduced by Kleinberg [13]. [sent-54, score-1.588]
30 The model distinguishes between two types of Web pages: authoritative pages, which are pages relevant to some topic, and hub pages, which are pages pointing to relevant pages. [sent-55, score-0.661]
31 Note that some Web pages can Figure 1: Constructing a bipartite graph from a directed one. [sent-56, score-0.747]
32 The hub set H = {1, 3, 5, 6}, and the authority set A = {2, 3, 4}. [sent-59, score-0.844]
33 Notice that the vertex indexed by 3 is simultaneously in the hub and authority set. [sent-60, score-1.011]
34 simultaneously be both hub and authority pages (see Figure 1). [sent-61, score-0.951]
35 Hubs and authorities exhibit a mutually reinforcement relationship: a good hub node points to many good authorities and a good authority node is pointed to by many good hubs. [sent-62, score-1.019]
36 It is interesting to note that in general there is no direct link from one authority to another. [sent-63, score-0.508]
37 It is the hub pages that glue together authorities on a common topic. [sent-64, score-0.578]
38 According to Kleinberg’s model, we suggestively call the vertex set H in the bipartite graph the hub set, and the vertex set A the authority set. [sent-65, score-1.527]
39 q(h) (3) In addition, we define ch (v, v) = 0 for all v in the authority set A and for all h in the hub set H. [sent-68, score-0.954]
40 Such a relevance measure can be naturally understood in the situation of citation networks. [sent-69, score-0.108]
41 If two articles are simultaneously cited by some other article, then this should make it more likely that both articles deal with a similar topics. [sent-70, score-0.132]
42 Moreover, the more articles cite both articles together, the more significant the connection. [sent-71, score-0.132]
43 Let us consider the following two web sites: Yahoo! [sent-73, score-0.372]
44 consists of pages having a large number of diverse hyperlinks. [sent-76, score-0.107]
45 The fact that two web pages are co-linked by Yahoo! [sent-77, score-0.479]
46 In contrast, the pages on the kernel machine Web site have much fewer hyperlinks, but the Web pages pointed to are closely related in topic. [sent-79, score-0.316]
47 Let f denote a function defined on the authority set A. [sent-80, score-0.447]
48 The smoothness of function f can be measured by the following functional: ΩA (f ) = 1 2 ch ([u, v]) u,v h f (u) p(u) − f (v) p(v) 2 . [sent-81, score-0.21]
49 (4) The smoothness functional penalizes large differences in function values for vertices in the authority set A that are strongly related. [sent-82, score-0.637]
50 Left panel: vertices u and v in the authority set A are colinked by vertex h in the hub set H. [sent-84, score-1.101]
51 Right panel: vertices u and v in the hub set H co-link vertex a in the authority set A. [sent-85, score-1.101]
52 Many web pages contain links to popular sites like the Google search engine. [sent-88, score-0.578]
53 This does not mean though that all these Web pages share a common topic. [sent-89, score-0.107]
54 However, if two web pages point to web page like the one of the Learning with Kernels book, it is likely to express a common interest for kernel methods. [sent-90, score-0.975]
55 Now define a linear operator T : H(A) → H(H) by w([h, a]) (T f )(h) = a q(h)p(a) f (a). [sent-91, score-0.08]
56 (6) Then its adjoint T ∗ : H(H) → H(A) is given by w([h, a]) (T ∗ f )(a) = h q(h)p(a) These two operators T and T ∗ were also implicitly suggested by [8] for developing a new Web page ranking algorithm. [sent-93, score-0.146]
57 Further define the operator SA : H(A) → H(A) by composing T and T ∗ , i. [sent-94, score-0.08]
58 Comparing with the combinatorial Laplace operator defined on undirected graphs [5], we can think of the operator ∆A as a Laplacian but defined on the authority set of directed graphs. [sent-99, score-1.151]
59 In fact, we can further show that the eigenvalues of the operator SA are scattered in [0, 1], and accordingly the eigenvalues of the Laplacian ∆A fall into [0, 1]. [sent-101, score-0.108]
60 Similarly, if two distinct vertices u and v co-link vertex a in the authority set A as shown in right panel of Figure 2, then u and v are also thought to be related. [sent-102, score-0.749]
61 p(a) (9) and the smoothness of function f on the hub set H can be measured by: ΩH (f ) = 1 2 ca ([u, v]) u,v a f (u) q(u) − f (v) q(v) 2 . [sent-104, score-0.497]
62 Convexly combining together the two smoothness functionals (4) and (10), we obtain a smoothness measure of function f defined on the whole vertex set V : Ωγ (f ) = γΩA (f ) + (1 − γ)ΩH (f ), 0 ≤ γ ≤ 1, (11) where the parameter γ weighs the relative importance between ΩA (f ) and ΩH (f ). [sent-107, score-0.404]
63 Extend the operator T to H(V ) by defining (T f )(v) = 0 if v is only in the authority set A and not in the hub set H. [sent-108, score-0.924]
64 Similarly extend T ∗ by defining (T ∗ f )(v) = 0 if v is only in the hub set H and not in the authority set A. [sent-109, score-0.844]
65 Then, if the remaining operators are extended correspondingly, one can define the operator Sγ : H(V ) → H(V ) by Sγ = γSA + (1 − γ)SH , (12) and the Laplacian on directed graphs ∆γ : H(V ) → H(V ) by ∆γ = I − S γ . [sent-110, score-0.531]
66 4 Regularization Define a function y in H(V ) in which y(v) = 1 or −1 if vertex v is labeled as positive or negative, and 0, if it is not labeled. [sent-115, score-0.236]
67 The classification problem can be regarded as the problem of finding a function f , which reproduces the target function y to a sufficient degree of accuracy while being smooth in a sense quantified by the above smoothness functional. [sent-116, score-0.126]
68 (14) f ∗ = argmin Ωγ (f ) + 2 f ∈H(V ) The final classification of vertex v is obtained as sign f ∗ (v). [sent-118, score-0.167]
69 The first term in the bracket is called the smoothness term or regularizer, which measures the smoothness of function f, and the second term is called the fitting term, which measures its closeness to the given function y. [sent-119, score-0.315]
70 It is worth noting that the closed form solution presented by Corollary 5 shares the same appearance as the algorithm proposed by [17], which operates on undirected graphs. [sent-129, score-0.137]
71 3 Experiments We considered the Web page categorization task on the WebKB dataset [6]. [sent-130, score-0.147]
72 We only addressed a subset which contains the pages from the four universities: Cornell, Texas, Washington, Wisconsin. [sent-131, score-0.107]
73 We removed pages without incoming or outgoing links, resulting in 858, 825, 1195 and 1238 pages respectively, for a total of 4116. [sent-132, score-0.214]
74 These pages were manually classified into the following seven categories: student, faculty, staff, department, course, project and other. [sent-133, score-0.107]
75 The first is used to illustrate the significance of connectivity information in classification, whereas the second one stresses the importance of preserving the directionality of edges. [sent-135, score-0.139]
76 We may assign a weight to each hyperlink according to the textual content of web pages or the anchor text contained in hyperlinks. [sent-136, score-0.556]
77 However, here we are only interested in how much we can obtain from link structure only and hence adopt the canonical weight function defined in Section 2. [sent-137, score-0.083]
78 We first study an extreme classification problem: predicting which university the pages belong to from very few labeled training examples. [sent-139, score-0.176]
79 Since pages within a university are well-linked, and cross links between different universities are rare, we can imagine that few training labels are enough to exactly classify pages based on link information only. [sent-140, score-0.444]
80 For each of the universities, we in turn viewed corresponding pages as positive examples and the pages from the remaining universities as negative examples. [sent-141, score-0.337]
81 We have randomly draw two pages as the training examples under the constraint that there is at least one labeled instance for each class. [sent-142, score-0.176]
82 ) Since the Web graph is not connected, some small isolated subgraphs possibly do not contain labeled instances. [sent-147, score-0.278]
83 The values of our classifying function on the pages contained in these subgraphs will be zeros and we simply think of these pages as negative examples. [sent-148, score-0.366]
84 We compare our method with SVMs using a kernel matrix K constructed as K = W T W [11], where W denotes the adjacency matrix of the web graph and W T denotes the transpose of W . [sent-150, score-0.579]
85 02) However, to be fair, we should state that the kernel matrix that we used in the SVM may not be the best possible kernel matrix for this task — this is an ongoing research issue which is not the topic of the present paper. [sent-167, score-0.118]
86 The other investigated task is to discriminate the student pages in a university from the non-student pages in the same university. [sent-168, score-0.291]
87 As a baseline we have applied our regularization method on the undirected graph [17] obtained by treating links as undirected or bidirectional, i. [sent-169, score-0.527]
88 The experimental results in Figure 3(a)-3(d) clearly demonstrate that taking the directionality of edges into account can yield substantial accuracy gains. [sent-173, score-0.106]
89 We think that is because the subgraphs in each university are quite small, limiting the information conveyed in the graph structure. [sent-176, score-0.252]
90 This confirms the conjecture that co-link structure among authority pages is much more important than within the hub set. [sent-179, score-0.951]
91 10) 2 4 6 8 10 12 14 # labeled points 16 18 directed (γ=1, α=0. [sent-188, score-0.36]
92 76 4 6 8 10 12 14 # labeled points (d) Wisconsin 16 18 20 12 14 # labeled points 16 18 20 0. [sent-194, score-0.138]
93 Figure (a)-(d) depict the AUC scores of the directed and undirected regularization methods on the classification problem student vs. [sent-248, score-0.532]
94 4 Conclusions We proposed a general regularization framework on directed graphs, which has been validated on a real-word Web data set. [sent-251, score-0.392]
95 In addition, it is worth noticing that this framework can be applied without any essential changes to bipartite graphs, e. [sent-253, score-0.241]
96 to graphs describing customers’ purchase behavior in market basket analysis. [sent-255, score-0.143]
97 Moreover, in the absence of labeled instances, this framework can be utilized in an unsupervised setting as a (spectral) clustering method for directed or bipartite graphs. [sent-256, score-0.575]
98 The anatomy of a large scale hypertextual web search engine. [sent-268, score-0.372]
99 PageRank, HITS and a unified framework for link analysis. [sent-327, score-0.087]
100 (4): ΩA (f ) = f (u)f (v) f 2 (u) − p(u) p(u)p(v) ch ([u, v]) u,v h = f 2 (u) − p(u) ch ([u, v]) u v h ch ([u, v]) u,v h f (u)f (v) . [sent-393, score-0.33]
wordName wordTfidf (topN-words)
[('authority', 0.447), ('hub', 0.397), ('web', 0.372), ('directed', 0.291), ('bipartite', 0.189), ('vertex', 0.167), ('graph', 0.16), ('auc', 0.147), ('undirected', 0.111), ('ch', 0.11), ('pages', 0.107), ('smoothness', 0.1), ('universities', 0.099), ('graphs', 0.099), ('cornell', 0.098), ('vertices', 0.09), ('citation', 0.086), ('nodes', 0.086), ('operator', 0.08), ('page', 0.077), ('proposition', 0.076), ('regularization', 0.075), ('authorities', 0.074), ('directionality', 0.074), ('categorization', 0.07), ('links', 0.07), ('labeled', 0.069), ('articles', 0.066), ('www', 0.065), ('link', 0.061), ('sa', 0.058), ('texas', 0.055), ('student', 0.055), ('sh', 0.052), ('authoritative', 0.05), ('hyperlink', 0.05), ('subgraphs', 0.049), ('yahoo', 0.049), ('laplacian', 0.048), ('kernel', 0.047), ('panel', 0.045), ('think', 0.043), ('connectivity', 0.043), ('classi', 0.043), ('hypertext', 0.043), ('webkb', 0.043), ('bracket', 0.043), ('acm', 0.042), ('svms', 0.041), ('hyperlinks', 0.039), ('kleinberg', 0.039), ('functionals', 0.037), ('giles', 0.037), ('operators', 0.037), ('washington', 0.034), ('wisconsin', 0.033), ('zhou', 0.033), ('classifying', 0.033), ('ranking', 0.032), ('cation', 0.032), ('notice', 0.032), ('edges', 0.032), ('unlabeled', 0.03), ('icml', 0.03), ('sites', 0.029), ('completes', 0.029), ('site', 0.028), ('accordingly', 0.028), ('pointed', 0.027), ('contained', 0.027), ('denoted', 0.027), ('spectral', 0.027), ('worth', 0.026), ('framework', 0.026), ('regarded', 0.026), ('substituting', 0.026), ('corollary', 0.024), ('uence', 0.024), ('remaining', 0.024), ('term', 0.024), ('topic', 0.024), ('induced', 0.024), ('construction', 0.023), ('relevance', 0.022), ('canonical', 0.022), ('investigated', 0.022), ('dom', 0.022), ('innovation', 0.022), ('basket', 0.022), ('zha', 0.022), ('symbolic', 0.022), ('flake', 0.022), ('freitag', 0.022), ('gondek', 0.022), ('stresses', 0.022), ('purchase', 0.022), ('refrain', 0.022), ('libraries', 0.022), ('chakrabarti', 0.022), ('equip', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 165 nips-2004-Semi-supervised Learning on Directed Graphs
Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf
Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1
2 0.1368694 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
Author: Corinna Cortes, Mehryar Mohri
Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.
3 0.1167357 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth
Abstract: The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an -accurate estimate of the expected accuracy of a ranking function with δ-confidence is larger than that required to obtain an -accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes. 1
4 0.11046536 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
5 0.10747682 54 nips-2004-Distributed Information Regularization on Graphs
Author: Adrian Corduneanu, Tommi S. Jaakkola
Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classification tasks. 1
6 0.095849283 19 nips-2004-An Application of Boosting to Graph Classification
7 0.088823795 177 nips-2004-Supervised Graph Inference
8 0.077790499 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
9 0.075646907 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
10 0.074058004 141 nips-2004-Optimal sub-graphical models
11 0.068556763 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
12 0.067429975 145 nips-2004-Parametric Embedding for Class Visualization
13 0.066281267 103 nips-2004-Limits of Spectral Clustering
14 0.063492455 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
15 0.060606625 168 nips-2004-Semigroup Kernels on Finite Sets
16 0.0593316 136 nips-2004-On Semi-Supervised Classification
17 0.059233021 57 nips-2004-Economic Properties of Social Networks
18 0.05848052 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
19 0.055864505 164 nips-2004-Semi-supervised Learning by Entropy Minimization
20 0.05314656 30 nips-2004-Binet-Cauchy Kernels
topicId topicWeight
[(0, -0.164), (1, 0.081), (2, -0.034), (3, 0.054), (4, -0.017), (5, 0.046), (6, -0.021), (7, 0.102), (8, -0.006), (9, -0.181), (10, -0.166), (11, 0.0), (12, 0.05), (13, 0.046), (14, -0.042), (15, 0.195), (16, 0.014), (17, -0.097), (18, 0.074), (19, 0.055), (20, -0.015), (21, -0.065), (22, 0.06), (23, 0.022), (24, -0.136), (25, -0.008), (26, -0.031), (27, -0.087), (28, -0.022), (29, -0.038), (30, -0.018), (31, 0.005), (32, 0.023), (33, 0.008), (34, 0.006), (35, -0.033), (36, 0.145), (37, -0.08), (38, 0.105), (39, -0.05), (40, -0.035), (41, -0.011), (42, -0.127), (43, 0.096), (44, -0.099), (45, 0.15), (46, -0.073), (47, 0.068), (48, -0.004), (49, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.94842309 165 nips-2004-Semi-supervised Learning on Directed Graphs
Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf
Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1
2 0.59156287 141 nips-2004-Optimal sub-graphical models
Author: Mukund Narasimhan, Jeff A. Bilmes
Abstract: We investigate the problem of reducing the complexity of a graphical model (G, PG ) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H ∈ H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k − 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parameter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability distribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graphical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1.1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S ⊆ V such that {a, b} ∈ E for all ∗ This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} d {c, f, g} {b, c} {b, e, c} b c {f, c} {c, e} {e, c, f } g {b, e} a e f {a, b, e} Figure 1: A triangulated graph G and a junction-tree for G a, b ∈ S. A clique S is maximal if S is not properly contained in another clique. If α and β are non-adjacent vertices of G then a set of vertices S ⊆ V \ {α, β} is called an (α, β)-separator if α and β are in distinct components of G[V \ S]. S is a minimal (α, β)-separator if no proper subset of S is an (α, β)-separator. S is said to be a minimal separator if S is a minimal (α, β)-separator for some non adjacent a, b ∈ V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximalcliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If Vi Vj ∈ S (where Vi , Vj ∈ K and hence are cliques of G), then Vi ∩ Vj = φ. We label each edge Vi Vj ∈ S with the set Vij = Vi ∩ Vj , which is a non-empty complete separator in G. The removal of any link Vi Vj ∈ S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi ). We will let K(i) be the nodes of T (i) , and V (i) = ∪V ∈K (i) V be the set of vertices corresponding to the subtree T (i) . The junction tree property ensures that V (i) ∩ V (j) = Vi ∩ Vj = Vij . We will let G(i) be the subgraph induced by V (i) . A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1 , X2 , . . . , Xn , and G is a graph with vertex set V (G) = {X1 , X2 , . . . , Xn } such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG ) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG ) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if Vi Vj ∈ S, then the removal {b, c, d} d {b, c} {b, e, c} b c c {c, f, g} {f, c} {e, c, f } g {b, e} a e e f {a, b, e} Figure 2: The graphs G(i) , G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi ∩ Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting subgraphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum independent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j) , then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G (i) and G(j) need not necessarily match up on Vij , the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We partition the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij , and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given optimal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If k |Vij | = k, then in any subgraph H of G, H[Vij ] must be one of the 2(2) possible subgraphs k of G[Vij ]. So, if k is sufficiently small (so 2(2) is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2.1 Separable objective functions For cases where exact inference in the graphical model (G, PG ) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G (i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij . Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH (i) and PH (j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H (i) = H[V (i) ] and H (j) = H[V (j) ], The following result assures us that the KL-divergence also factors according to the separator Vij . Lemma 1. Suppose that (G, PG ) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH (i) ) + D(PG(j) PH (j) ) − D(PG[Vij ] PH[Vij ] ). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a separator of H. Therefore, PH {Xv }v∈V = PH (i) ({Xv }v∈V (i) )·PH (j) ({Xv }v∈V (j) ) . PH[Vij ] ({Xv }v∈V ) The result ij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H ∈ H as one of finding subgraphs of H (i) ⊆ G(i) and H (j) ⊆ G(j) that are compatible, in the sense that they match up on the overlap Vij , and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they “separate” in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i) ) + r(G(j) ) if G has a separator G(i) ∩ G(j) . Therefore, the representation cost reduction would satisfy r(G) − r(H) = (r(G (i) ) − r(H (i) )) + (r(G(j) ) − r(H (j) )), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2.2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompositions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche’s dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd d; bc; e abe dbc ebc e; cf ; g cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junctiontree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let Vi Vj ∈ S be any link in T . Then removal of the link Vi Vj results in two disjoint junctiontrees T (i) and T (j) . We label the root of R by the decomposition (V (i) ; Vij ; V (j) ). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j) ) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link Vi Vj ∈ S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M . 3 3.1 Finding optimal subgraphical models Optimal sub (k − 1)-trees of k-trees Suppose that G is a k-tree. A sub (k − 1)-tree of G is a subgraph H of G that is (k − 1)tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i) ] and H[V (j) ] are subgraphs of G(i) and G(j) and are partial (k − 1)-trees. We will be interested in finding sub (k − 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k ≥ 3. The following result characterizes the various possibilities for H[Vij ] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corresponding to the link ij of the junction-tree T . In any (k − 1)-tree H ⊆ G either 1. There is a u ∈ S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} ⊆ S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i) ] or H[V (j) ] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u ∈ S such that u is not connected to vertices in both V (i) \S and V (j) \. Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k − 1)-tree, all minimal separators of H must contain k − 1 vertices which must therefore be S \{u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u ∈ S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x ∈ S is connected to some yi ∈ V (i) \ S and yj ∈ V (j) \ S, then x is contained in every minimal yi /yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k − 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w ∈ S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} ⊂ Sy or {z, w} ⊂ Sx . Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k − 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k · 2 + 2 · k constraints with each separator Vij of G as 2 follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k · 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x ∈ S, y ∈ S, s ∈ {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H (i) and if s = j, then x is connected only to H (j) . If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H (j) . Similarly if s = j, then H (j) does not contain any unchorded path between x and y, and there is no constraint on H (i) . Now suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H (i) and H (j) both satisfy the same constraint they must match up on the common vertices Vij . Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H (i) and H (j) together is triangulated. Proof. Suppose that both H (i) and H (j) are both triangulated and both satisfy the same constraint. If both H (i) and H (j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H (i) and H (j) . Therefore, there must be two vertices of S \{u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t ≥ 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H (j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H (i) and H (j) satisfy the same constraint. If |S| = k, then there are 2k + 2 · k ∈ O(k 2 ) ∈ O(n2 ). We can use a divide and conquer 2 strategy to find the optimal sub (k − 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k − 1) tree, and every sub (k−1)-tree results from this operation. Therefore, the optimal (k−1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1) 2 ). This procedure can be generalized to find the optimal sub (k − d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k − d)-tree of G. While we can compute (k − d)-trees of G by first going from a k tree to a (k − 1) tree, then from a (k − 1)-tree to a (k − 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3.2 Optimal triangulated subgraphs with |E(G)| − d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H ⊂ G such that |E(H)| = |E(G)| − d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d ). Now, let us solve this problem using the framework we’ve described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij ∈ M corresponding to the separator Vij , let CM = (l, r, c, s, A) : l + r − c = d, s a d bit string, A ∈ E(G[Vij ]) . The constraint reprec sented by (l, r, c, A) is this : A is a set of d edges of G[Vij ] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c represents the double count, and so is subtracted from the total. If k is the size of the largest k clique, then the total number of such constraints is bounded by 2d · 2d · (2) ∈ O(k 2d ) d which could be better than O(n2d ) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H ∈ H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distributions on vertex sets of cliques. It is clear that these KL-divergences can be computed R ← separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl ∈ CMl compatible with cM to minimize table[Ml , cl ]; Pick constraint cr ∈ CMr compatible with cM to minimize table[Mr , cr ]; loss ← D(PG [M ] PH [M ]); table[M, cM ] ← table[Ml , cl ] + table[Mr , cr ] − loss; else table[M, cM ] ← D(PG [M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k ) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in exponential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm. References [1] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, v. 14, 1968, Pages 462–467. [2] F. Gavril, “Algorithms on clique separable graphs”, Discrete Mathematics v. 9 (1977), pp. 159–165. [3] R. E. Tarjan. “Decomposition by Clique Separators”, Discrete Mathematics, v. 55 (1985), pp. 221–232. [4] U. Kjaerulff. “Reduction of computational complexity in Bayesian networks through removal of weak dependencies”, Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pp. 374–382, 1994. [5] T. Kloks, “Treewidth: Computations and Approximations”, Springer-Verlag, 1994. [6] A. Darwiche and M. Hopkins. “Using recursive decomposition to construct elimination orders, jointrees and dtrees”, Technical Report D-122, Computer Science Dept., UCLA. [7] S. Lauritzen. “Graphical Models”, Oxford University Press, Oxford, 1996. [8] T. A. McKee and F. R. McMorris. “Topics in Intersection Graph Theory”, SIAM Monographs on Discrete Mathematics and Applications, 1999. [9] D. Karger and N. Srebro. “Learning Markov networks: Maximum bounded tree-width graphs.” In Symposium on Discrete Algorithms, 2001, Pages 391-401. [10] M. Narasimhan and J. Bilmes. “Optimization on separator-clique trees.”, Technical report UWEETR 2004-10, June 2004.
3 0.56932837 57 nips-2004-Economic Properties of Social Networks
Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri
Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1
4 0.54143161 177 nips-2004-Supervised Graph Inference
Author: Jean-philippe Vert, Yoshihiro Yamanishi
Abstract: We formulate the problem of graph inference where part of the graph is known as a supervised learning problem, and propose an algorithm to solve it. The method involves the learning of a mapping of the vertices to a Euclidean space where the graph is easy to infer, and can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of metabolic network reconstruction from genomic data. 1
5 0.51451415 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
Author: Corinna Cortes, Mehryar Mohri
Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.
6 0.50049937 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
7 0.45633224 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
8 0.44804686 54 nips-2004-Distributed Information Regularization on Graphs
9 0.43965581 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
10 0.40915418 19 nips-2004-An Application of Boosting to Graph Classification
11 0.40369469 30 nips-2004-Binet-Cauchy Kernels
12 0.38710701 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
13 0.38582203 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
14 0.34649622 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
15 0.34005174 168 nips-2004-Semigroup Kernels on Finite Sets
16 0.31978804 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
17 0.31290114 130 nips-2004-Newscast EM
18 0.29747468 100 nips-2004-Learning Preferences for Multiclass Problems
19 0.295349 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
20 0.29523784 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
topicId topicWeight
[(13, 0.075), (15, 0.133), (16, 0.329), (25, 0.01), (26, 0.043), (31, 0.024), (33, 0.14), (35, 0.02), (39, 0.013), (50, 0.068), (76, 0.013), (77, 0.012), (87, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.77980375 165 nips-2004-Semi-supervised Learning on Directed Graphs
Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf
Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1
2 0.70905077 148 nips-2004-Probabilistic Computation in Spiking Populations
Author: Richard S. Zemel, Rama Natarajan, Peter Dayan, Quentin J. Huys
Abstract: As animals interact with their environments, they must constantly update estimates about their states. Bayesian models combine prior probabilities, a dynamical model and sensory evidence to update estimates optimally. These models are consistent with the results of many diverse psychophysical studies. However, little is known about the neural representation and manipulation of such Bayesian information, particularly in populations of spiking neurons. We consider this issue, suggesting a model based on standard neural architecture and activations. We illustrate the approach on a simple random walk example, and apply it to a sensorimotor integration task that provides a particularly compelling example of dynamic probabilistic computation. Bayesian models have been used to explain a gamut of experimental results in tasks which require estimates to be derived from multiple sensory cues. These include a wide range of psychophysical studies of perception;13 motor action;7 and decision-making.3, 5 Central to Bayesian inference is that computations are sensitive to uncertainties about afferent and efferent quantities, arising from ignorance, noise, or inherent ambiguity (e.g., the aperture problem), and that these uncertainties change over time as information accumulates and dissipates. Understanding how neurons represent and manipulate uncertain quantities is therefore key to understanding the neural instantiation of these Bayesian inferences. Most previous work on representing probabilistic inference in neural populations has focused on the representation of static information.1, 12, 15 These encompass various strategies for encoding and decoding uncertain quantities, but do not readily generalize to real-world dynamic information processing tasks, particularly the most interesting cases with stimuli changing over the same timescale as spiking itself.11 Notable exceptions are the recent, seminal, but, as we argue, representationally restricted, models proposed by Gold and Shadlen,5 Rao,10 and Deneve.4 In this paper, we first show how probabilistic information varying over time can be represented in a spiking population code. Second, we present a method for producing spiking codes that facilitate further processing of the probabilistic information. Finally, we show the utility of this method by applying it to a temporal sensorimotor integration task. 1 TRAJECTORY ENCODING AND DECODING We assume that population spikes R(t) arise stochastically in relation to the trajectory X(t) of an underlying (but hidden) variable. We use RT and XT for the whole trajectory and spike trains respectively from time 0 to T . The spikes RT constitute the observations and are assumed to be probabilistically related to the signal by a tuning function f (X, θ i ): P (R(i, T )|X(T )) ∝ f (X, θi ) (1) for the spike train of the ith neuron, with parameters θi . Therefore, via standard Bayesian inference, RT determines a distribution over the hidden variable at time T , P (X(T )|RT ). We first consider a version of the dynamics and input coding that permits an analytical examination of the impact of spikes. Let X(t) follow a stationary Gaussian process such that the joint distribution P (X(t1 ), X(t2 ), . . . , X(tm )) is Gaussian for any finite collection of times, with a covariance matrix which depends on time differences: Ctt = c(|t − t |). Function c(|∆t|) controls the smoothness of the resulting random walks. Then, P (X(T )|RT ) ∝ p(X(T )) X(T ) dX(T )P (RT |X(T ))P (X(T )|X(T )) (2) where P (X(T )|X(T )) is the distribution over the whole trajectory X(T ) conditional on the value of X(T ) at its end point. If RT are a set of conditionally independent inhomogeneous Poisson processes, we have P (RT |X(T )) ∝ iτ f (X(tiτ ), θi ) exp − i τ dτ f (X(τ ), θi ) , (3) where tiτ ∀τ are the spike times τ of neuron i in RT . Let χ = [X(tiτ )] be the vector of stimulus positions at the times at which we observed a spike and Θ = [θ(tiτ )] be the vector of spike positions. If the tuning functions are Gaussian f (X, θi ) ∝ exp(−(X − θi )2 /2σ 2 ) and sufficiently dense that i τ dτ f (X, θi ) is independent of X (a standard assumption in population coding), then P (RT |X(T )) ∝ exp(− χ − Θ 2 /2σ 2 ) and in Equation 2, we can marginalize out X(T ) except at the spike times tiτ : P (X(T )|RT ) ∝ p(X(T )) −1 χ dχ exp −[χ, X(T )]T C 2 [χ, X(T )] − χ−Θ 2σ 2 2 (4) and C is the block covariance matrix between X(tiτ ), x(T ) at the spike times [ttτ ] and the final time T . This Gaussian integral has P (X(T )|RT ) ∼ N (µ(T ), ν(T )), with µ(T ) = CT t (Ctt + Iσ 2 )−1 Θ = kΘ ν(T ) = CT T − kCtT (5) CT T is the T, T th element of the covariance matrix and CT t is similarly a row vector. The dependence in µ on past spike times is specified chiefly by the inverse covariance matrix, and acts as an effective kernel (k). This kernel is not stationary, since it depends on factors such as the local density of spiking in the spike train RT . For example, consider where X(t) evolves according to a diffusion process with drift: dX = −αXdt + σ dN (t) (6) where α prevents it from wandering too far, N (t) is white Gaussian noise with mean zero and σ 2 variance. Figure 1A shows sample kernels for this process. Inspection of Figure 1A reveals some important traits. First, the monotonically decreasing kernel magnitude as the time span between the spike and the current time T grows matches the intuition that recent spikes play a more significant role in determining the posterior over X(T ). Second, the kernel is nearly exponential, with a time constant that depends on the time constant of the covariance function and the density of the spikes; two settings of these parameters produced the two groupings of kernels in the figure. Finally, the fully adaptive kernel k can be locally well approximated by a metronomic kernel k (shown in red in Figure 1A) that assumes regular spiking. This takes advantage of the general fact, indicated by the grouping of kernels, that the kernel depends weakly on the actual spike pattern, but strongly on the average rate. The merits of the metronomic kernel are that it is stationary and only depends on a single mean rate rather than the full spike train RT . It also justifies s Kernels k and k −0.5 C 5 0 0.03 0.06 0.09 0.04 0.06 0.08 t−t Time spike True stimulus and means D Full kernel E Regular, stationary kernel −0.5 0 −0.5 0.03 0.04 0.05 0.06 0.07 Time 0.08 0.09 0 0.5 0.1 Space 0 Space −4 10 Space Variance ratio 10 −2 10 0.5 B ν2 / σ2 Kernel size (weight) A 0.1 0 0.5 0.03 0.04 0.05 0.06 0.07 Time 0.08 0.09 0.1 Figure 1: Exact and approximate spike decoding with the Gaussian process prior. Spikes are shown in yellow, the true stimulus in green, and P (X(T )|RT ) in gray. Blue: exact inference with nonstationary and red: approximate inference with regular spiking. A Kernel samples for a diffusion process as defined by equations 5, 6. B, C: Mean and variance of the inference. D: Exact inference with full kernel k and E: approximation based on metronomic kernel k
3 0.54996729 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.54136032 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
5 0.541224 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
6 0.53980106 178 nips-2004-Support Vector Classification with Input Data Uncertainty
7 0.5393796 131 nips-2004-Non-Local Manifold Tangent Learning
8 0.53868824 168 nips-2004-Semigroup Kernels on Finite Sets
9 0.53856617 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
10 0.5381946 69 nips-2004-Fast Rates to Bayes for Kernel Machines
11 0.53765082 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
12 0.53752655 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
13 0.53682339 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
14 0.53552067 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
15 0.53469777 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
16 0.53459525 29 nips-2004-Beat Tracking the Graphical Model Way
17 0.53425688 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
18 0.53409678 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
19 0.53383011 17 nips-2004-Adaptive Manifold Learning
20 0.53360695 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill