nips nips2003 nips2003-121 knowledge-graph by maker-knowledge-mining

121 nips-2003-Log-Linear Models for Label Ranking

Source: pdf

Author: Ofer Dekel, Yoram Singer, Christopher D. Manning

Abstract: Label ranking is the task of inferring a total order over a predeﬁned set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classiﬁcation. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract Label ranking is the task of inferring a total order over a predeﬁned set of labels for each given instance. [sent-13, score-0.68]

2 We present a general framework for batch learning of label ranking functions from supervised data. [sent-14, score-0.735]

3 This enables us to accommodate a variety of ranking problems. [sent-16, score-0.559]

4 In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. [sent-17, score-0.625]

5 We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. [sent-19, score-0.824]

6 That is, a label deemed relevant to an instance should be ranked higher than a label which is considered less relevant. [sent-23, score-0.548]

7 With each training instance we receive supervision given as a set of preferences over the labels. [sent-24, score-0.282]

8 Concretely, the supervision we receive with each instance is given in the form of a preference graph: a simple directed graph for which the labels are the graph vertices. [sent-25, score-1.102]

9 A directed edge from a label y to another label y denotes that according to the supervision, y is more relevant to the instance than y . [sent-26, score-0.601]

10 We do not impose any further constraints on the structure of the preference graph. [sent-27, score-0.268]

11 The simplest setting is multiclass categorization in which each instance is associated with a single label out of k possible labels. [sent-29, score-0.392]

12 Using the graph representation for multiclass problems, the preference graph induced by the supervision has k vertices and k − 1 edges. [sent-32, score-1.001]

13 A directed edge points from the (single) relevant label to each of the k − 1 irrelevant labels (Fig. [sent-33, score-0.509]

14 An interesting and practical generalization of multiclass problems is multilabel problems [10, 6, 4], in which a set of relevant labels (rather than a single label) is associated with each instance. [sent-35, score-0.449]

15 In this case the supervision is represented by a directed bipartite 1 3 1 1 2 2 3 4 5 2 1 5 4 2 3 4 (a) 5 3 4 (b) 5 (c) (d) Figure 1: The supervision provided to the algorithm associates every training instance with a preference graph. [sent-36, score-0.788]

16 (b) multiclass multilabel categorization where {1, 2} is the set of correct labels. [sent-39, score-0.326]

17 (c) A multi-layer graph that encodes three levels of label “goodness”, useful for instance in hierarchical multiclass settings. [sent-40, score-0.541]

18 (d) a general (possibly cyclic) preference graph with no predeﬁned structure. [sent-41, score-0.479]

19 graph where the relevant labels constitute one side of the graph and the irrelevant labels the other side and there is a directed edge from each relevant label to each irrelevant label. [sent-42, score-1.134]

20 For concreteness, we use the term label ranking for all of these problems. [sent-50, score-0.735]

21 Our learning framework decomposes each preference graph into subgraphs, where the graph decomposition procedure may take a general form and can change as a function of the instances. [sent-51, score-0.828]

22 Ranking algorithms, especially in multilabel categorization problems, often reduce the ranking task into multiple binary decision problems by enumerating over all pairs of labels [7, 6, 4]. [sent-52, score-0.922]

23 Such a reduction can easily be accommodated within our framework by decomposing the preference graph into elementary subgraphs, each consisting of a single edge. [sent-53, score-0.479]

24 Another approach is to compare a highly preferred label (such as the correct or best parse of a sentence) with less preferred labels. [sent-54, score-0.364]

25 Such approaches can be analyzed within our framework by deﬁning a graph decomposition procedure that generates a subgraph for each relevant label and the neighboring labels that it is preferred over. [sent-55, score-0.786]

26 Returning to multilabel settings, this decomposition amounts to a loss that counts the number of relevant labels which are wrongly ranked below irrelevant ones. [sent-56, score-0.684]

27 Our framework employing graph decomposition can also be used in other settings such as element ranking via projections [3, 11]. [sent-59, score-0.915]

28 Furthermore, settings in which a semi-metric is deﬁned over the label-set can also be reduced to the problem of label ranking, such as the parse ordering case mentioned above or when the labels are arranged in a hierarchical structure. [sent-60, score-0.409]

29 We employ such a reduction in the category ranking experiments described in Sec. [sent-61, score-0.654]

30 3 we present an algorithm for learning label ranking functions. [sent-66, score-0.735]

31 A label ranking for an instance x ∈ X is a total order over Y, where y y implies that y is preferred over y as a label for x. [sent-71, score-1.038]

32 A label ranking function f : X × Y → R induces a label ranking for x ∈ X by y y ⇐⇒ f (x, y) > f (x, y ). [sent-72, score-1.536]

33 Overloading our notation, we denote the label ranking induced by f for x by f (x). [sent-73, score-0.757]

34 We are m also provided with a training set S = {(xi , Gi )}i=1 where every example is comprised of an instance xi ∈ X and a preference graph Gi . [sent-78, score-0.768]

35 As deﬁned in the previous section, a preference graph is a directed graph G = (V, E), for which the set of vertices V is deﬁned to be the set of labels Y and E is some ﬁnite set of directed edges. [sent-79, score-1.019]

36 Every edge in a directed graph e ∈ E is associated with an initial vertex, init(e) ∈ V , and a terminal vertex, term(e) ∈ V . [sent-80, score-0.341]

37 The existence of a directed edge between two labels in a preference graph indicates that init(e) is preferred over term(e) and should be ranked higher. [sent-81, score-0.857]

38 We require preference graphs to be simple, namely to have no more than a single edge between any pair of vertices and to not contain any self-loops. [sent-82, score-0.457]

39 However, we impose no additional constraints on the supervision, namely, the set of edges in a preference graph may be sparse and may even include cycles. [sent-83, score-0.554]

40 Informally, our goal is for the label ranking induced by f to be as consistent as possible with all of the preference graphs given in S. [sent-86, score-1.06]

41 We say that f (xi ) disagrees with a preference graph Gi = (Vi , Ei ) if there exists an edge e ∈ Ei for which f xi , init(e) ≤ f xi , term(e) . [sent-87, score-0.774]

42 A simple measure of empirical ranking accuracy immediately follows from the deﬁnition of δ: We deﬁne the 0 − 1 error attained by a ranking function f on a training set S to be the number of training examples for which f (xi ) disagrees with Gi , namely, m ε0−1 (f, S) = δ(f (xi ), Gi ) . [sent-91, score-1.404]

43 i=1 The 0 − 1 error may be natural for certain ranking problems, however in general it is a rather crude measure of ranking inaccuracy, as it is invariant to the exact number of edges in Gi with which f (xi ) disagrees. [sent-92, score-1.253]

44 Many ranking problems require a more reﬁned notion of ranking accuracy. [sent-93, score-1.147]

45 Thus, we deﬁne the disagreement error attained by f (xi ) with respect to Gi to be the fraction of edges in Ei with which f (xi ) disagrees. [sent-94, score-0.425]

46 The disagreement error attained on the entire training set is the sum of disagreement errors over all training examples. [sent-95, score-0.551]

47 Formally, we deﬁne the disagreement error attained on S as m εdis (f, S) = i=1 e ∈ Ei s. [sent-96, score-0.326]

48 Both the 0 − 1 error and the disagreement error are reasonable measures of ranking inaccuracy. [sent-99, score-0.828]

49 It turns out that both are instances of a more general notion of ranking error of which additional meaningful instances exist. [sent-100, score-0.675]

50 The missing ingredient needed to deﬁne the generalized error is a graph decomposition procedure A that we assume is given together with the training data. [sent-102, score-0.523]

51 a preference graph Gi and returns a set of si subgraphs of Gi , denoted {Gi,1 , . [sent-106, score-0.659]

52 Each subgraph Gi,k is itself a preference graph and therefore δ(f (xi ), Gi,k ) is well deﬁned. [sent-110, score-0.527]

53 We now deﬁne the generalized error attained by f (xi ) with respect to Gi as the fraction of subgraphs in A(Gi ) with which f (xi ) disagrees. [sent-111, score-0.382]

54 The generalized error attained on S is the sum of generalized errors over all training instances. [sent-112, score-0.367]

55 Formally, the generalized ranking error is deﬁned as m εgen (f, S, A) = i=1 1 si si δ(f (xi ), Gi,k ) where {Gi,1 , . [sent-113, score-0.845]

56 (1) k=1 Previously used losses for label ranking are special cases of the generalized error and are derived by choosing an appropriate decomposition procedure A. [sent-117, score-1.042]

57 For instance, when A is deﬁned to be the identity transformation on graphs (A(G) = {G}), then the generalized ranking error is reduced to the 0 − 1 error. [sent-118, score-0.73]

58 Alternatively, for a graph G with s edges, we can deﬁne A to return s different subgraphs of G, each consisting of a single edge from G (Fig. [sent-119, score-0.376]

59 2 top) and the generalized ranking error reduces to the disagreement error. [sent-120, score-0.844]

60 A vertex is said to dominate the set of neighboring vertices that are connected to its outgoing edges. [sent-122, score-0.28]

61 We would like every vertex in the preference graph to be ranked above all of its dominated neighbors. [sent-123, score-0.779]

62 The domination error attained by f (xi ) with respect to Gi is the fraction of vertices with outgoing edges which are not ranked above all of their dominated neighbors. [sent-124, score-0.645]

63 Formally, let A be the procedure that takes a preference graph G = (V, E) and returns a subgraph for each vertex with outgoing edges, each such subgraph consisting of a dominating vertex, its dominated neighbors and edges between them (Fig. [sent-125, score-0.916]

64 Minimizing the domination error is useful for solving multilabel classiﬁcation problems. [sent-128, score-0.325]

65 In these problems Y is of ﬁnite cardinality and every instance xi is associated with a set of correct labels Yi ⊆ Y. [sent-129, score-0.411]

66 In order to reduce this problem to a ranking problem, we construct preference graphs Gi = (Y, Ei ), where Ei contains edges from every vertex in Yi to every vertex in Y \ Yi . [sent-130, score-1.283]

67 In this case, the domination loss simply counts the number of labels in Yi that are not ranked above all of the labels in Y \ Yi . [sent-131, score-0.54]

68 The dominated error is proportional to the number of labels with incoming edges that are not ranked below all of the labels that dominate them. [sent-133, score-0.53]

69 Its graph decomposition procedure is depicted at the bottom of Fig. [sent-134, score-0.349]

70 Additional instances of the generalized ranking error exist, and can be tailored to ﬁt most ranking problems. [sent-136, score-1.282]

71 xi ∈ X and Gi is a preference graph, i=1 a decomposition procedure A and a set of base ranking functions {h1 , . [sent-140, score-1.094]

72 , 0) πi,e,j = hj xi , term(e) − hj xi , init(e) [1 ≤ i ≤ m, e ∈ Ei , 1 ≤ j ≤ n] ρ = maxi,e j |πi,e,j | I TERATE : For t = 1, 2, . [sent-147, score-0.302]

73 3 Minimizing the Generalized Ranking Error Our goal is to minimize the generalized error for a given training set S and graph decomposition procedure A. [sent-151, score-0.523]

74 Theorem 1 Let S = {(xi , Gi )}m be a training set such that every xi ∈ X and every i=1 Gi is a preference graph. [sent-165, score-0.501]

75 Let A be a graph decomposition procedure that deﬁnes for each preference graph Gi a set of subgraphs {Gi,1 , . [sent-166, score-0.933]

76 Denote by ft the ranking function obtained at iteration t of the algorithm given in Fig. [sent-170, score-0.707]

77 Proof Deﬁne ∆t,i,k to be the difference between the loss attained by ft and the loss attained by ft+1 on (xi , Gi,k ), that is ∆t,i,k = L(ft (xi ), Gi,k ) − L(ft+1 (xi ), Gi,k ), and deﬁne φt,i,k = e∈Ei,k exp(λt · π i,e ). [sent-174, score-0.517]

78 4 Experiments To demonstrate our framework, we chose to learn a category ranking problem on a subset of the Reuters Corpus, Vol. [sent-203, score-0.632]

79 It would certainly be correct to categorize an article on government borrowing as either government ﬁnance or economics, however these general categories are less speciﬁc and do not describe the article as well. [sent-230, score-0.391]

80 In summary, the category hierarchy induces a preference over the set of labels. [sent-232, score-0.407]

81 We exploit this property to generate supervision for the label ranking problem at hand. [sent-233, score-0.879]

82 Formally, we view every category as a vertex in a rooted tree, where the tree root corresponds to a general abstract category that is relevant to all of the articles in the corpus and every category is a speciﬁc instance of its parent in the tree. [sent-234, score-0.743]

83 To ﬁx this inconsistency, we added a dummy child vertex to every inner vertex and diverted all paths that originally end in this inner vertex to its new child. [sent-237, score-0.501]

84 Our learning problem then becomes the problem of ranking leaves. [sent-238, score-0.559]

85 The severity of wrongly categorizing an article by a leaf is proportional to the graph distance between this leaf and the closest correct leaf given in the corpus. [sent-239, score-0.485]

86 The preference graph that encodes this preference is a multi-layer graph where the top layer contains all of the correct labels, the second layer contains all of their sibling vertices in the tree and so on. [sent-240, score-1.087]

87 Every vertex in the multi-layer preference graph has outgoing edges to all vertices in lower layers, but there are no edges between vertices in the same layer. [sent-241, score-0.951]

88 For practical purposes, we conducted experiments using only 3-layer preference graphs generated by collapsing all of the layers below 3 to a single layer. [sent-242, score-0.303]

89 The word counts for each article were used to construct base ranking functions in the following way: for every word w and every category y, let w(xi ) denote the number of appearances of w in the article xi . [sent-244, score-1.043]

90 These words then deﬁne 103 · 3200 base ranking functions as shown in Eq. [sent-247, score-0.587]

91 Next, we ran our learning algorithm using each of the 4 graph decomposition procedures discussed above: zero-one, disagreement, domination and dominated. [sent-249, score-0.433]

92 First, these results are not comparable with previous results for multilabel problems using this corpus, since label ranking is a more difﬁcult task. [sent-254, score-0.915]

93 For instance, an average preference graph in the test data has 820 edges, and the error for such a graph equals zero only if every single edge agrees with the ranking function. [sent-255, score-1.416]

94 Second, the experiments clearly indicate that the results obtained by minimizing the domination loss are better than the other ranking losses, no matter what error is used for evaluation. [sent-256, score-0.812]

95 In particular, employing the domination loss yields signiﬁcantly better results than using the disagreement loss which has been the commonly used decomposition method in categorization problems [7, 10, 6, 4]. [sent-257, score-0.62]

96 5 Summary We presented a general framework for label ranking problems by means of preference graphs and the graph decomposition procedure. [sent-258, score-1.386]

97 We then described and analyzed a boosting algorithm that works with any choice of graph decomposition. [sent-260, score-0.3]

98 We are currently exporting the approach to learning in inner product spaces, where different graph decomposition procedures result in different bindings of slack variables. [sent-261, score-0.344]

99 Another interesting question is whether the graph decomposition approach can be combined with probabilistic models for orderings [9] to achieve algorithmic efﬁciency. [sent-262, score-0.319]

100 New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. [sent-266, score-0.587]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ranking', 0.559), ('preference', 0.268), ('graph', 0.211), ('gi', 0.202), ('dom', 0.196), ('label', 0.176), ('multilabel', 0.151), ('disagreement', 0.149), ('supervision', 0.144), ('ei', 0.138), ('init', 0.132), ('vertex', 0.126), ('ft', 0.125), ('labels', 0.121), ('attained', 0.117), ('domination', 0.114), ('decomposition', 0.108), ('subgraphs', 0.105), ('xi', 0.101), ('boosting', 0.089), ('corpus', 0.085), ('dis', 0.083), ('article', 0.08), ('loss', 0.079), ('multiclass', 0.077), ('exp', 0.077), ('instance', 0.077), ('ranked', 0.077), ('generalized', 0.076), ('gen', 0.075), ('parses', 0.075), ('reuters', 0.075), ('si', 0.075), ('edges', 0.075), ('government', 0.075), ('articles', 0.075), ('category', 0.073), ('directed', 0.07), ('vertices', 0.068), ('induces', 0.066), ('categorization', 0.062), ('error', 0.06), ('outgoing', 0.06), ('edge', 0.06), ('parse', 0.052), ('preferred', 0.05), ('dominated', 0.05), ('hj', 0.05), ('subgraph', 0.048), ('every', 0.047), ('categories', 0.045), ('relevant', 0.042), ('leaf', 0.04), ('irrelevant', 0.04), ('sign', 0.039), ('training', 0.038), ('lebanon', 0.038), ('manning', 0.038), ('penn', 0.038), ('treebank', 0.038), ('wrongly', 0.038), ('formally', 0.037), ('settings', 0.037), ('correct', 0.036), ('graphs', 0.035), ('prede', 0.034), ('losses', 0.033), ('disagrees', 0.033), ('dekel', 0.033), ('misclassifying', 0.033), ('yi', 0.033), ('ne', 0.03), ('economics', 0.03), ('procedure', 0.03), ('problems', 0.029), ('nips', 0.029), ('base', 0.028), ('instances', 0.028), ('counts', 0.028), ('goodness', 0.028), ('parsing', 0.028), ('hn', 0.028), ('namely', 0.026), ('comprised', 0.026), ('dominate', 0.026), ('paths', 0.026), ('globally', 0.026), ('tree', 0.025), ('inner', 0.025), ('plugging', 0.025), ('collins', 0.025), ('crammer', 0.025), ('fraction', 0.024), ('sentence', 0.024), ('arranged', 0.023), ('preferences', 0.023), ('converges', 0.023), ('iteration', 0.023), ('employ', 0.022), ('induced', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 121 nips-2003-Log-Linear Models for Label Ranking

Author: Ofer Dekel, Yoram Singer, Christopher D. Manning

2 0.48541766 164 nips-2003-Ranking on Data Manifolds

Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf

Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1

3 0.10501904 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

Author: Ingo Steinwart

Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1

4 0.10207935 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

5 0.095992692 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

6 0.092069663 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

7 0.086876482 113 nips-2003-Learning with Local and Global Consistency

8 0.085217074 99 nips-2003-Kernels for Structured Natural Language Data

9 0.078032337 30 nips-2003-Approximability of Probability Distributions

10 0.071201451 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.071044654 168 nips-2003-Salient Boundary Detection using Ratio Contour

12 0.069672689 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

13 0.065254882 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

14 0.064842969 1 nips-2003-1-norm Support Vector Machines

15 0.06278722 117 nips-2003-Linear Response for Approximate Inference

16 0.061603956 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

17 0.061206222 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

18 0.061040241 3 nips-2003-AUC Optimization vs. Error Rate Minimization

19 0.060614806 128 nips-2003-Minimax Embeddings

20 0.059635121 48 nips-2003-Convex Methods for Transduction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.197), (1, -0.108), (2, -0.083), (3, -0.04), (4, 0.062), (5, -0.094), (6, -0.159), (7, -0.081), (8, 0.023), (9, -0.127), (10, 0.268), (11, 0.092), (12, 0.14), (13, 0.026), (14, 0.16), (15, -0.229), (16, -0.359), (17, -0.348), (18, -0.084), (19, -0.127), (20, -0.184), (21, 0.031), (22, 0.135), (23, -0.018), (24, 0.137), (25, 0.089), (26, -0.055), (27, -0.011), (28, -0.097), (29, -0.089), (30, 0.014), (31, -0.031), (32, -0.056), (33, -0.089), (34, -0.027), (35, 0.008), (36, -0.021), (37, 0.005), (38, 0.079), (39, 0.044), (40, 0.018), (41, -0.014), (42, 0.027), (43, 0.045), (44, -0.007), (45, 0.043), (46, 0.064), (47, 0.048), (48, -0.007), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96174735 121 nips-2003-Log-Linear Models for Label Ranking

Author: Ofer Dekel, Yoram Singer, Christopher D. Manning

2 0.91137493 164 nips-2003-Ranking on Data Manifolds

Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf

3 0.459916 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

4 0.31125423 30 nips-2003-Approximability of Probability Distributions

Author: Alina Beygelzimer, Irina Rish

Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1

5 0.30497909 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. 1

6 0.2935915 99 nips-2003-Kernels for Structured Natural Language Data

7 0.28304222 168 nips-2003-Salient Boundary Detection using Ratio Contour

8 0.26945168 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

9 0.2649748 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

10 0.24016081 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

11 0.23650774 118 nips-2003-Link Prediction in Relational Data

12 0.22716638 124 nips-2003-Max-Margin Markov Networks

13 0.22633079 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

14 0.22247082 122 nips-2003-Margin Maximizing Loss Functions

15 0.19990005 108 nips-2003-Learning a Distance Metric from Relative Comparisons

16 0.19936776 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

17 0.19607919 48 nips-2003-Convex Methods for Transduction

18 0.19451506 128 nips-2003-Minimax Embeddings

19 0.19412397 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

20 0.19169939 146 nips-2003-Online Learning of Non-stationary Sequences

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (11, 0.031), (29, 0.01), (30, 0.014), (35, 0.048), (53, 0.066), (71, 0.104), (76, 0.035), (77, 0.3), (85, 0.074), (91, 0.117), (99, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81839716 123 nips-2003-Markov Models for Automated ECG Interval Analysis

Author: Nicholas P. Hughes, Lionel Tarassenko, Stephen J. Roberts

Abstract: We examine the use of hidden Markov and hidden semi-Markov models for automatically segmenting an electrocardiogram waveform into its constituent waveform features. An undecimated wavelet transform is used to generate an overcomplete representation of the signal that is more appropriate for subsequent modelling. We show that the state durations implicit in a standard hidden Markov model are ill-suited to those of real ECG features, and we investigate the use of hidden semi-Markov models for improved state duration modelling. 1

same-paper 2 0.8183859 121 nips-2003-Log-Linear Models for Label Ranking

Author: Ofer Dekel, Yoram Singer, Christopher D. Manning

3 0.53650254 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a ﬁnite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

4 0.53368437 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. This faster method could be used for more demanding classiﬁcation tasks, such as on-line learning. 1

5 0.53026652 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-speciﬁc planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

6 0.52746761 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.52726847 189 nips-2003-Tree-structured Approximations by Expectation Propagation

8 0.52579081 117 nips-2003-Linear Response for Approximate Inference

9 0.52562308 41 nips-2003-Boosting versus Covering

10 0.52385467 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

11 0.5237906 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

12 0.52340078 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

13 0.52298456 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

14 0.52252817 143 nips-2003-On the Dynamics of Boosting

15 0.52203119 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

16 0.52134347 30 nips-2003-Approximability of Probability Distributions

17 0.52082366 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

18 0.52037233 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

19 0.51990062 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

20 0.51980603 100 nips-2003-Laplace Propagation