nips nips2003 nips2003-121 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. [sent-13, score-0.68]
2 We present a general framework for batch learning of label ranking functions from supervised data. [sent-14, score-0.735]
3 This enables us to accommodate a variety of ranking problems. [sent-16, score-0.559]
4 In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. [sent-17, score-0.625]
5 We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. [sent-19, score-0.824]
6 That is, a label deemed relevant to an instance should be ranked higher than a label which is considered less relevant. [sent-23, score-0.548]
7 With each training instance we receive supervision given as a set of preferences over the labels. [sent-24, score-0.282]
8 Concretely, the supervision we receive with each instance is given in the form of a preference graph: a simple directed graph for which the labels are the graph vertices. [sent-25, score-1.102]
9 A directed edge from a label y to another label y denotes that according to the supervision, y is more relevant to the instance than y . [sent-26, score-0.601]
10 We do not impose any further constraints on the structure of the preference graph. [sent-27, score-0.268]
11 The simplest setting is multiclass categorization in which each instance is associated with a single label out of k possible labels. [sent-29, score-0.392]
12 Using the graph representation for multiclass problems, the preference graph induced by the supervision has k vertices and k − 1 edges. [sent-32, score-1.001]
13 A directed edge points from the (single) relevant label to each of the k − 1 irrelevant labels (Fig. [sent-33, score-0.509]
14 An interesting and practical generalization of multiclass problems is multilabel problems [10, 6, 4], in which a set of relevant labels (rather than a single label) is associated with each instance. [sent-35, score-0.449]
15 In this case the supervision is represented by a directed bipartite 1 3 1 1 2 2 3 4 5 2 1 5 4 2 3 4 (a) 5 3 4 (b) 5 (c) (d) Figure 1: The supervision provided to the algorithm associates every training instance with a preference graph. [sent-36, score-0.788]
16 (b) multiclass multilabel categorization where {1, 2} is the set of correct labels. [sent-39, score-0.326]
17 (c) A multi-layer graph that encodes three levels of label “goodness”, useful for instance in hierarchical multiclass settings. [sent-40, score-0.541]
18 (d) a general (possibly cyclic) preference graph with no predefined structure. [sent-41, score-0.479]
19 graph where the relevant labels constitute one side of the graph and the irrelevant labels the other side and there is a directed edge from each relevant label to each irrelevant label. [sent-42, score-1.134]
20 For concreteness, we use the term label ranking for all of these problems. [sent-50, score-0.735]
21 Our learning framework decomposes each preference graph into subgraphs, where the graph decomposition procedure may take a general form and can change as a function of the instances. [sent-51, score-0.828]
22 Ranking algorithms, especially in multilabel categorization problems, often reduce the ranking task into multiple binary decision problems by enumerating over all pairs of labels [7, 6, 4]. [sent-52, score-0.922]
23 Such a reduction can easily be accommodated within our framework by decomposing the preference graph into elementary subgraphs, each consisting of a single edge. [sent-53, score-0.479]
24 Another approach is to compare a highly preferred label (such as the correct or best parse of a sentence) with less preferred labels. [sent-54, score-0.364]
25 Such approaches can be analyzed within our framework by defining a graph decomposition procedure that generates a subgraph for each relevant label and the neighboring labels that it is preferred over. [sent-55, score-0.786]
26 Returning to multilabel settings, this decomposition amounts to a loss that counts the number of relevant labels which are wrongly ranked below irrelevant ones. [sent-56, score-0.684]
27 Our framework employing graph decomposition can also be used in other settings such as element ranking via projections [3, 11]. [sent-59, score-0.915]
28 Furthermore, settings in which a semi-metric is defined over the label-set can also be reduced to the problem of label ranking, such as the parse ordering case mentioned above or when the labels are arranged in a hierarchical structure. [sent-60, score-0.409]
29 We employ such a reduction in the category ranking experiments described in Sec. [sent-61, score-0.654]
30 3 we present an algorithm for learning label ranking functions. [sent-66, score-0.735]
31 A label ranking for an instance x ∈ X is a total order over Y, where y y implies that y is preferred over y as a label for x. [sent-71, score-1.038]
32 A label ranking function f : X × Y → R induces a label ranking for x ∈ X by y y ⇐⇒ f (x, y) > f (x, y ). [sent-72, score-1.536]
33 Overloading our notation, we denote the label ranking induced by f for x by f (x). [sent-73, score-0.757]
34 We are m also provided with a training set S = {(xi , Gi )}i=1 where every example is comprised of an instance xi ∈ X and a preference graph Gi . [sent-78, score-0.768]
35 As defined in the previous section, a preference graph is a directed graph G = (V, E), for which the set of vertices V is defined to be the set of labels Y and E is some finite set of directed edges. [sent-79, score-1.019]
36 Every edge in a directed graph e ∈ E is associated with an initial vertex, init(e) ∈ V , and a terminal vertex, term(e) ∈ V . [sent-80, score-0.341]
37 The existence of a directed edge between two labels in a preference graph indicates that init(e) is preferred over term(e) and should be ranked higher. [sent-81, score-0.857]
38 We require preference graphs to be simple, namely to have no more than a single edge between any pair of vertices and to not contain any self-loops. [sent-82, score-0.457]
39 However, we impose no additional constraints on the supervision, namely, the set of edges in a preference graph may be sparse and may even include cycles. [sent-83, score-0.554]
40 Informally, our goal is for the label ranking induced by f to be as consistent as possible with all of the preference graphs given in S. [sent-86, score-1.06]
41 We say that f (xi ) disagrees with a preference graph Gi = (Vi , Ei ) if there exists an edge e ∈ Ei for which f xi , init(e) ≤ f xi , term(e) . [sent-87, score-0.774]
42 A simple measure of empirical ranking accuracy immediately follows from the definition of δ: We define the 0 − 1 error attained by a ranking function f on a training set S to be the number of training examples for which f (xi ) disagrees with Gi , namely, m ε0−1 (f, S) = δ(f (xi ), Gi ) . [sent-91, score-1.404]
43 i=1 The 0 − 1 error may be natural for certain ranking problems, however in general it is a rather crude measure of ranking inaccuracy, as it is invariant to the exact number of edges in Gi with which f (xi ) disagrees. [sent-92, score-1.253]
44 Many ranking problems require a more refined notion of ranking accuracy. [sent-93, score-1.147]
45 Thus, we define the disagreement error attained by f (xi ) with respect to Gi to be the fraction of edges in Ei with which f (xi ) disagrees. [sent-94, score-0.425]
46 The disagreement error attained on the entire training set is the sum of disagreement errors over all training examples. [sent-95, score-0.551]
47 Formally, we define the disagreement error attained on S as m εdis (f, S) = i=1 e ∈ Ei s. [sent-96, score-0.326]
48 Both the 0 − 1 error and the disagreement error are reasonable measures of ranking inaccuracy. [sent-99, score-0.828]
49 It turns out that both are instances of a more general notion of ranking error of which additional meaningful instances exist. [sent-100, score-0.675]
50 The missing ingredient needed to define the generalized error is a graph decomposition procedure A that we assume is given together with the training data. [sent-102, score-0.523]
51 a preference graph Gi and returns a set of si subgraphs of Gi , denoted {Gi,1 , . [sent-106, score-0.659]
52 Each subgraph Gi,k is itself a preference graph and therefore δ(f (xi ), Gi,k ) is well defined. [sent-110, score-0.527]
53 We now define the generalized error attained by f (xi ) with respect to Gi as the fraction of subgraphs in A(Gi ) with which f (xi ) disagrees. [sent-111, score-0.382]
54 The generalized error attained on S is the sum of generalized errors over all training instances. [sent-112, score-0.367]
55 Formally, the generalized ranking error is defined as m εgen (f, S, A) = i=1 1 si si δ(f (xi ), Gi,k ) where {Gi,1 , . [sent-113, score-0.845]
56 (1) k=1 Previously used losses for label ranking are special cases of the generalized error and are derived by choosing an appropriate decomposition procedure A. [sent-117, score-1.042]
57 For instance, when A is defined to be the identity transformation on graphs (A(G) = {G}), then the generalized ranking error is reduced to the 0 − 1 error. [sent-118, score-0.73]
58 Alternatively, for a graph G with s edges, we can define A to return s different subgraphs of G, each consisting of a single edge from G (Fig. [sent-119, score-0.376]
59 2 top) and the generalized ranking error reduces to the disagreement error. [sent-120, score-0.844]
60 A vertex is said to dominate the set of neighboring vertices that are connected to its outgoing edges. [sent-122, score-0.28]
61 We would like every vertex in the preference graph to be ranked above all of its dominated neighbors. [sent-123, score-0.779]
62 The domination error attained by f (xi ) with respect to Gi is the fraction of vertices with outgoing edges which are not ranked above all of their dominated neighbors. [sent-124, score-0.645]
63 Formally, let A be the procedure that takes a preference graph G = (V, E) and returns a subgraph for each vertex with outgoing edges, each such subgraph consisting of a dominating vertex, its dominated neighbors and edges between them (Fig. [sent-125, score-0.916]
64 Minimizing the domination error is useful for solving multilabel classification problems. [sent-128, score-0.325]
65 In these problems Y is of finite cardinality and every instance xi is associated with a set of correct labels Yi ⊆ Y. [sent-129, score-0.411]
66 In order to reduce this problem to a ranking problem, we construct preference graphs Gi = (Y, Ei ), where Ei contains edges from every vertex in Yi to every vertex in Y \ Yi . [sent-130, score-1.283]
67 In this case, the domination loss simply counts the number of labels in Yi that are not ranked above all of the labels in Y \ Yi . [sent-131, score-0.54]
68 The dominated error is proportional to the number of labels with incoming edges that are not ranked below all of the labels that dominate them. [sent-133, score-0.53]
69 Its graph decomposition procedure is depicted at the bottom of Fig. [sent-134, score-0.349]
70 Additional instances of the generalized ranking error exist, and can be tailored to fit most ranking problems. [sent-136, score-1.282]
71 xi ∈ X and Gi is a preference graph, i=1 a decomposition procedure A and a set of base ranking functions {h1 , . [sent-140, score-1.094]
72 , 0) πi,e,j = hj xi , term(e) − hj xi , init(e) [1 ≤ i ≤ m, e ∈ Ei , 1 ≤ j ≤ n] ρ = maxi,e j |πi,e,j | I TERATE : For t = 1, 2, . [sent-147, score-0.302]
73 3 Minimizing the Generalized Ranking Error Our goal is to minimize the generalized error for a given training set S and graph decomposition procedure A. [sent-151, score-0.523]
74 Theorem 1 Let S = {(xi , Gi )}m be a training set such that every xi ∈ X and every i=1 Gi is a preference graph. [sent-165, score-0.501]
75 Let A be a graph decomposition procedure that defines for each preference graph Gi a set of subgraphs {Gi,1 , . [sent-166, score-0.933]
76 Denote by ft the ranking function obtained at iteration t of the algorithm given in Fig. [sent-170, score-0.707]
77 Proof Define ∆t,i,k to be the difference between the loss attained by ft and the loss attained by ft+1 on (xi , Gi,k ), that is ∆t,i,k = L(ft (xi ), Gi,k ) − L(ft+1 (xi ), Gi,k ), and define φt,i,k = e∈Ei,k exp(λt · π i,e ). [sent-174, score-0.517]
78 4 Experiments To demonstrate our framework, we chose to learn a category ranking problem on a subset of the Reuters Corpus, Vol. [sent-203, score-0.632]
79 It would certainly be correct to categorize an article on government borrowing as either government finance or economics, however these general categories are less specific and do not describe the article as well. [sent-230, score-0.391]
80 In summary, the category hierarchy induces a preference over the set of labels. [sent-232, score-0.407]
81 We exploit this property to generate supervision for the label ranking problem at hand. [sent-233, score-0.879]
82 Formally, we view every category as a vertex in a rooted tree, where the tree root corresponds to a general abstract category that is relevant to all of the articles in the corpus and every category is a specific instance of its parent in the tree. [sent-234, score-0.743]
83 To fix this inconsistency, we added a dummy child vertex to every inner vertex and diverted all paths that originally end in this inner vertex to its new child. [sent-237, score-0.501]
84 Our learning problem then becomes the problem of ranking leaves. [sent-238, score-0.559]
85 The severity of wrongly categorizing an article by a leaf is proportional to the graph distance between this leaf and the closest correct leaf given in the corpus. [sent-239, score-0.485]
86 The preference graph that encodes this preference is a multi-layer graph where the top layer contains all of the correct labels, the second layer contains all of their sibling vertices in the tree and so on. [sent-240, score-1.087]
87 Every vertex in the multi-layer preference graph has outgoing edges to all vertices in lower layers, but there are no edges between vertices in the same layer. [sent-241, score-0.951]
88 For practical purposes, we conducted experiments using only 3-layer preference graphs generated by collapsing all of the layers below 3 to a single layer. [sent-242, score-0.303]
89 The word counts for each article were used to construct base ranking functions in the following way: for every word w and every category y, let w(xi ) denote the number of appearances of w in the article xi . [sent-244, score-1.043]
90 These words then define 103 · 3200 base ranking functions as shown in Eq. [sent-247, score-0.587]
91 Next, we ran our learning algorithm using each of the 4 graph decomposition procedures discussed above: zero-one, disagreement, domination and dominated. [sent-249, score-0.433]
92 First, these results are not comparable with previous results for multilabel problems using this corpus, since label ranking is a more difficult task. [sent-254, score-0.915]
93 For instance, an average preference graph in the test data has 820 edges, and the error for such a graph equals zero only if every single edge agrees with the ranking function. [sent-255, score-1.416]
94 Second, the experiments clearly indicate that the results obtained by minimizing the domination loss are better than the other ranking losses, no matter what error is used for evaluation. [sent-256, score-0.812]
95 In particular, employing the domination loss yields significantly better results than using the disagreement loss which has been the commonly used decomposition method in categorization problems [7, 10, 6, 4]. [sent-257, score-0.62]
96 5 Summary We presented a general framework for label ranking problems by means of preference graphs and the graph decomposition procedure. [sent-258, score-1.386]
97 We then described and analyzed a boosting algorithm that works with any choice of graph decomposition. [sent-260, score-0.3]
98 We are currently exporting the approach to learning in inner product spaces, where different graph decomposition procedures result in different bindings of slack variables. [sent-261, score-0.344]
99 Another interesting question is whether the graph decomposition approach can be combined with probabilistic models for orderings [9] to achieve algorithmic efficiency. [sent-262, score-0.319]
100 New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. [sent-266, score-0.587]
wordName wordTfidf (topN-words)
[('ranking', 0.559), ('preference', 0.268), ('graph', 0.211), ('gi', 0.202), ('dom', 0.196), ('label', 0.176), ('multilabel', 0.151), ('disagreement', 0.149), ('supervision', 0.144), ('ei', 0.138), ('init', 0.132), ('vertex', 0.126), ('ft', 0.125), ('labels', 0.121), ('attained', 0.117), ('domination', 0.114), ('decomposition', 0.108), ('subgraphs', 0.105), ('xi', 0.101), ('boosting', 0.089), ('corpus', 0.085), ('dis', 0.083), ('article', 0.08), ('loss', 0.079), ('multiclass', 0.077), ('exp', 0.077), ('instance', 0.077), ('ranked', 0.077), ('generalized', 0.076), ('gen', 0.075), ('parses', 0.075), ('reuters', 0.075), ('si', 0.075), ('edges', 0.075), ('government', 0.075), ('articles', 0.075), ('category', 0.073), ('directed', 0.07), ('vertices', 0.068), ('induces', 0.066), ('categorization', 0.062), ('error', 0.06), ('outgoing', 0.06), ('edge', 0.06), ('parse', 0.052), ('preferred', 0.05), ('dominated', 0.05), ('hj', 0.05), ('subgraph', 0.048), ('every', 0.047), ('categories', 0.045), ('relevant', 0.042), ('leaf', 0.04), ('irrelevant', 0.04), ('sign', 0.039), ('training', 0.038), ('lebanon', 0.038), ('manning', 0.038), ('penn', 0.038), ('treebank', 0.038), ('wrongly', 0.038), ('formally', 0.037), ('settings', 0.037), ('correct', 0.036), ('graphs', 0.035), ('prede', 0.034), ('losses', 0.033), ('disagrees', 0.033), ('dekel', 0.033), ('misclassifying', 0.033), ('yi', 0.033), ('ne', 0.03), ('economics', 0.03), ('procedure', 0.03), ('problems', 0.029), ('nips', 0.029), ('base', 0.028), ('instances', 0.028), ('counts', 0.028), ('goodness', 0.028), ('parsing', 0.028), ('hn', 0.028), ('namely', 0.026), ('comprised', 0.026), ('dominate', 0.026), ('paths', 0.026), ('globally', 0.026), ('tree', 0.025), ('inner', 0.025), ('plugging', 0.025), ('collins', 0.025), ('crammer', 0.025), ('fraction', 0.024), ('sentence', 0.024), ('arranged', 0.023), ('preferences', 0.023), ('converges', 0.023), ('iteration', 0.023), ('employ', 0.022), ('induced', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 121 nips-2003-Log-Linear Models for Label Ranking
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
2 0.48541766 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
3 0.10501904 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
4 0.10207935 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
5 0.095992692 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
6 0.092069663 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
7 0.086876482 113 nips-2003-Learning with Local and Global Consistency
8 0.085217074 99 nips-2003-Kernels for Structured Natural Language Data
9 0.078032337 30 nips-2003-Approximability of Probability Distributions
10 0.071201451 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
11 0.071044654 168 nips-2003-Salient Boundary Detection using Ratio Contour
12 0.069672689 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
13 0.065254882 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.064842969 1 nips-2003-1-norm Support Vector Machines
15 0.06278722 117 nips-2003-Linear Response for Approximate Inference
16 0.061603956 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
17 0.061206222 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
18 0.061040241 3 nips-2003-AUC Optimization vs. Error Rate Minimization
19 0.060614806 128 nips-2003-Minimax Embeddings
20 0.059635121 48 nips-2003-Convex Methods for Transduction
topicId topicWeight
[(0, -0.197), (1, -0.108), (2, -0.083), (3, -0.04), (4, 0.062), (5, -0.094), (6, -0.159), (7, -0.081), (8, 0.023), (9, -0.127), (10, 0.268), (11, 0.092), (12, 0.14), (13, 0.026), (14, 0.16), (15, -0.229), (16, -0.359), (17, -0.348), (18, -0.084), (19, -0.127), (20, -0.184), (21, 0.031), (22, 0.135), (23, -0.018), (24, 0.137), (25, 0.089), (26, -0.055), (27, -0.011), (28, -0.097), (29, -0.089), (30, 0.014), (31, -0.031), (32, -0.056), (33, -0.089), (34, -0.027), (35, 0.008), (36, -0.021), (37, 0.005), (38, 0.079), (39, 0.044), (40, 0.018), (41, -0.014), (42, 0.027), (43, 0.045), (44, -0.007), (45, 0.043), (46, 0.064), (47, 0.048), (48, -0.007), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.96174735 121 nips-2003-Log-Linear Models for Label Ranking
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
2 0.91137493 164 nips-2003-Ranking on Data Manifolds
Author: Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, Bernhard Schölkopf
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method. 1
3 0.459916 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
4 0.31125423 30 nips-2003-Approximability of Probability Distributions
Author: Alina Beygelzimer, Irina Rish
Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1
5 0.30497909 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
6 0.2935915 99 nips-2003-Kernels for Structured Natural Language Data
7 0.28304222 168 nips-2003-Salient Boundary Detection using Ratio Contour
8 0.26945168 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
9 0.2649748 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
10 0.24016081 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
11 0.23650774 118 nips-2003-Link Prediction in Relational Data
12 0.22716638 124 nips-2003-Max-Margin Markov Networks
13 0.22633079 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
14 0.22247082 122 nips-2003-Margin Maximizing Loss Functions
15 0.19990005 108 nips-2003-Learning a Distance Metric from Relative Comparisons
16 0.19936776 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
17 0.19607919 48 nips-2003-Convex Methods for Transduction
18 0.19451506 128 nips-2003-Minimax Embeddings
19 0.19412397 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
20 0.19169939 146 nips-2003-Online Learning of Non-stationary Sequences
topicId topicWeight
[(0, 0.053), (11, 0.031), (29, 0.01), (30, 0.014), (35, 0.048), (53, 0.066), (71, 0.104), (76, 0.035), (77, 0.3), (85, 0.074), (91, 0.117), (99, 0.053)]
simIndex simValue paperId paperTitle
1 0.81839716 123 nips-2003-Markov Models for Automated ECG Interval Analysis
Author: Nicholas P. Hughes, Lionel Tarassenko, Stephen J. Roberts
Abstract: We examine the use of hidden Markov and hidden semi-Markov models for automatically segmenting an electrocardiogram waveform into its constituent waveform features. An undecimated wavelet transform is used to generate an overcomplete representation of the signal that is more appropriate for subsequent modelling. We show that the state durations implicit in a standard hidden Markov model are ill-suited to those of real ECG features, and we investigate the use of hidden semi-Markov models for improved state duration modelling. 1
same-paper 2 0.8183859 121 nips-2003-Log-Linear Models for Label Ranking
Author: Ofer Dekel, Yoram Singer, Christopher D. Manning
Abstract: Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the label-set, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boosting-based learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a large-scale text corpus. 1
3 0.53650254 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.53368437 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
5 0.53026652 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
Author: Alan Fern, Sungwook Yoon, Robert Givan
Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1
6 0.52746761 78 nips-2003-Gaussian Processes in Reinforcement Learning
7 0.52726847 189 nips-2003-Tree-structured Approximations by Expectation Propagation
8 0.52579081 117 nips-2003-Linear Response for Approximate Inference
9 0.52562308 41 nips-2003-Boosting versus Covering
10 0.52385467 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
11 0.5237906 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
12 0.52340078 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
13 0.52298456 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
14 0.52252817 143 nips-2003-On the Dynamics of Boosting
15 0.52203119 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
16 0.52134347 30 nips-2003-Approximability of Probability Distributions
17 0.52082366 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
18 0.52037233 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
19 0.51990062 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
20 0.51980603 100 nips-2003-Laplace Propagation