jmlr jmlr2005 jmlr2005-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth
Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan
Reference: text
sentIndex sentText sentNum sentScore
1 Jordan Abstract We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. [sent-10, score-0.757]
2 Introduction In many learning problems, the goal is not simply to classify objects into one of a fixed number of classes; instead, a ranking of objects is desired. [sent-23, score-0.475]
3 In such problems, one wants to return to the user a list of documents that contains relevant documents at the top and irrelevant documents at the bottom; in other words, one wants a ranking of the documents such that relevant documents are ranked higher than irrelevant documents. [sent-25, score-0.47]
4 The problem of ranking has been studied from a learning perspective under a variety of settings (Cohen et al. [sent-26, score-0.433]
5 Here we consider the setting in which objects come from two categories, positive and negative; the learner is given examples of objects labeled as positive or negative, and the goal is to learn a ranking in which positive objects are ranked higher than negative ones. [sent-30, score-0.533]
6 This form of ranking problem corresponds to the ‘bipartite feedback’ case of Freund et al. [sent-32, score-0.433]
7 (2003); for this reason, we refer to it as the bipartite ranking problem. [sent-33, score-0.71]
8 Formally, the setting of the bipartite ranking problem is similar to that of the binary classification problem. [sent-34, score-0.71]
9 On the other hand, in ranking, one seeks a real-valued function f : X → R that induces a ranking over X ; an instance that is assigned a higher value by f is ranked higher than one that is assigned a lower value by f . [sent-42, score-0.47]
10 Intuitively, a good classification function should classify most instances correctly, while a good ranking function should rank most instances labeled as positive higher than most instances labeled as negative. [sent-44, score-0.537]
11 1 However, despite the apparently close relation between classification and ranking, on formalizing the above intuitions about evaluation criteria for classification and ranking functions, it turns out that a good classification function may not always translate into a good ranking function. [sent-47, score-0.866]
12 (2000) the problem of learning a ranking function is also reduced to a classification problem, but on pairs of instances. [sent-50, score-0.433]
13 2 Evaluation of (Bipartite) Ranking Functions Evaluating a ranking function has proved to be somewhat more difficult. [sent-65, score-0.433]
14 Another empirical quantity that has recently gained some attention as being wellsuited for evaluating ranking functions relates to receiver operating characteristic (ROC) curves. [sent-68, score-0.455]
15 Given a ranking function f : X →R and a finite data sequence T = ((x1 , y1 ), . [sent-70, score-0.475]
16 It is the area under the ROC curve (AUC) that has been used as an indicator of the quality of the ranking function f (Cortes and Mohri, 2004; Rosset, 2004). [sent-79, score-0.458]
17 An AUC value of one corresponds to a perfect ranking on the given data sequence (i. [sent-80, score-0.475]
18 The first is that the error rate of a classification function is not necessarily a good indicator of the AUC of a ranking function derived from it; different classification functions with the same error rate may produce ranking functions with very different AUC values. [sent-88, score-0.952]
19 The exact relationship between the (empirical) error rate of a classification function h of the form h(x) = θ( fh (x)) and the AUC value of the corresponding ranking function fh with respect to a given data sequence was studied in detail by Cortes and Mohri (2004). [sent-91, score-0.562]
20 The corresponding classification functions have the same (empirical) error rate, but the AUC values of the ranking functions are different. [sent-98, score-0.433]
21 The second important observation about the AUC is that, as defined above, it is an empirical quantity that evaluates a ranking function with respect to a particular data sequence. [sent-100, score-0.455]
22 What does the empirical AUC tell us about the expected performance of a ranking function on future examples? [sent-101, score-0.433]
23 First, what can be said about the expected performance of a ranking function based on its empirical AUC on an independent test sequence? [sent-104, score-0.433]
24 Second, what can be said about the expected performance of a learned ranking function based on its empirical AUC on the training sequence from which it is learned? [sent-105, score-0.507]
25 We start by defining the expected ranking accuracy of a ranking function (analogous to the expected error rate of a classification function) in Section 2. [sent-108, score-0.934]
26 Section 3 contains our large deviation result, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. [sent-109, score-0.568]
27 Section 4 contains our uniform convergence result, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. [sent-112, score-0.592]
28 We define below a quantity that we term the expected ranking accuracy; the purpose of this quantity will be to serve as an evaluation criterion for ranking functions (analogous to the use of the expected error rate as an evaluation criterion for classification functions). [sent-147, score-0.953]
29 Definition 1 (Expected ranking accuracy) Let f : X →R be a ranking function on X . [sent-148, score-0.866]
30 Define the expected ranking accuracy (or simply ranking accuracy) of f , denoted by A( f ), as follows: 1 A( f ) = EX∼D+1 ,X ∼D−1 I{ f (X)> f (X )} + I{ f (X)= f (X )} 2 . [sent-149, score-0.891]
31 As in the case of classification, the true ranking accuracy depends on the underlying distribution of the data and cannot be observed directly. [sent-151, score-0.458]
32 Our goal shall be to derive generalization bounds that allow the true accuracy of a ranking function to be estimated from its empirical AUC with respect to a finite data sample. [sent-152, score-0.497]
33 The following simple lemma shows that this makes sense, for given a fixed label sequence, the empirical AUC of a ranking function f is an unbiased estimator of the expected ranking accuracy of f : Lemma 2 Let f : X →R be a ranking function on X , and let y = (y1 , . [sent-153, score-1.355]
34 Note that, since the AUC of a ranking function f with respect to a data sequence T ∈ (X × Y )N is independent of the actual ordering of examples in the sequence, our results involving the conditional distribution DTX |TY =y for some label sequence y = (y1 , . [sent-162, score-0.548]
35 398 G ENERALIZATION B OUNDS FOR THE A REA U NDER THE ROC C URVE We are now ready to present the main results of this paper, namely, a large deviation bound in Section 3 and a uniform convergence bound in Section 4. [sent-170, score-0.212]
36 Large Deviation Bound for the AUC In this section we are interested in bounding the probability that the empirical AUC of a ranking function f with respect to a (random) test sequence T will have a large deviation from its expected ranking accuracy. [sent-173, score-0.974]
37 Our main tool in deriving such a large deviation bound will be the following powerful concentration inequality of McDiarmid (1989), which bounds the deviation of any function of a sample for which a single change in the sample has limited effect: Theorem 3 (McDiarmid, 1989) Let X1 , . [sent-175, score-0.236]
38 N {i:y∑ =+1} (6) i The following is the main result of this section: Theorem 5 Let f : X →R be a fixed ranking function on X and let y = (y1 , . [sent-208, score-0.433]
39 From Theorem 5, we can derive a confidence interval interpretation of the bound that gives, for any 0 < δ ≤ 1, a confidence interval based on the empirical AUC of a ranking function (on a random test sequence) which is likely to contain the true ranking accuracy with probability at least 1 − δ. [sent-248, score-1.012]
40 More specifically, we have: Corollary 6 Let f : X →R be a fixed ranking function on X and let y = (y1 , . [sent-249, score-0.433]
41 Theorem 5 also allows us to obtain an expression for a test sample size that is sufficient to obtain, for given 0 < ε, δ ≤ 1, an ε-accurate estimate of the ranking accuracy with δ-confidence: Corollary 7 Let f : X →R be a fixed ranking function on X and let 0 < ε, δ ≤ 1. [sent-256, score-0.891]
42 The confidence interval of Corollary 6 can in fact be generalized to remove the conditioning on the label vector completely: Theorem 8 Let f : X →R be a fixed ranking function on X and let N ∈ N. [sent-263, score-0.491]
43 Let f : X →R be a fixed ranking function on X and let y = (y1 , . [sent-276, score-0.433]
44 (16) to the bound of Theorem 5, we see that the AUC bound differs from the error rate bound by a factor of ρ(y)(1 − ρ(y)) in the exponent. [sent-310, score-0.175]
45 8 1 Figure 2: The test sample size bound for the AUC, for positive skew ρ ≡ ρ(y) for some label sequence y, is larger than the corresponding test sample size bound for the error rate by a factor of 1/(ρ(1 − ρ)) (see text for discussion). [sent-361, score-0.235]
46 3 Bound for Learned Ranking Functions Chosen from Finite Function Classes The large deviation result of Theorem 5 bounds the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. [sent-363, score-0.563]
47 More specifically, we have: Theorem 10 Let F be a finite class of real-valued functions on X and let fS ∈ F denote the ranking function chosen by a learning algorithm based on the training sequence S. [sent-365, score-0.475]
48 Corollary 12 Let F be a finite class of real-valued functions on X and let fS ∈ F denote the ranking function chosen by a learning algorithm based on the training sequence S. [sent-378, score-0.475]
49 Theorem 13 Let F be a finite class of real-valued functions on X and let fS ∈ F denote the ranking function chosen by a learning algorithm based on the training sequence S. [sent-384, score-0.475]
50 PS∼D M A( 2ρ(SY )(1 − ρ(SY ))M The above results apply only to ranking functions learned from finite function classes. [sent-387, score-0.465]
51 The general case, when the learned ranking function may be chosen from a possibly infinite function class, is the subject of the next section. [sent-388, score-0.465]
52 Our uniform convergence result for the AUC is expressed in terms of a new set of combinatorial parameters, termed the bipartite rank-shatter coefficients, that we define below. [sent-394, score-0.335]
53 1 Bipartite Rank-Shatter Coefficients We define first the notion of a bipartite rank matrix; this is used in our definition of bipartite rankshatter coefficients. [sent-397, score-0.603]
54 Definition 14 (Bipartite rank matrix) Let f : X →R be a ranking function on X , let m, n ∈ N, and let x = (x1 , . [sent-398, score-0.459]
55 Define the bipartite rank matrix of f with respect 1 to x, x , denoted by B f (x, x ), to be the matrix in {0, 2 , 1}m×n whose (i, j)-th element is given by B f (x, x ) ij 1 = I{ f (xi )> f (x j )} + I{ f (xi )= f (x j )} 2 (18) for all i ∈ {1, . [sent-405, score-0.303]
56 Define the (m, n)-th bipartite rank-shatter coefficient of F , denoted by r(F , m, n), as follows: r(F , m, n) = max x∈X m ,x ∈X n B f (x, x ) | f ∈ F . [sent-413, score-0.277]
57 In fact, not all 3mn matrices in {0, 1 , 1}m×n can be realized as bipartite rank matrices. [sent-416, score-0.349]
58 2 Therefore, we have r(F , m, n) ≤ ψ(m, n) , where ψ(m, n) is the number of matrices in {0, 1 , 1}m×n that can be realized as a bipartite rank 2 matrix. [sent-417, score-0.349]
59 We show that a matrix B ∈ {0, 2 , 1}m×n can be realized as a bipartite rank matrix if and only if the corresponding bipartite graph G(B) ∈ G (m, n) is acyclic. [sent-436, score-0.626]
60 Then, from the definition of a bipartite rank matrix, we get f (xi1 ) < f (x j1 ) = f (xi2 ) = f (x j2 ) = . [sent-441, score-0.303]
61 1 that a matrix B ∈ {0, 2 , 1} if and only if the corresponding bipartite graph G(B) ∈ G (m, n) is acyclic; the desired result then follows by Part 1 of the theorem. [sent-462, score-0.277]
62 Since G(B) is a complete bipartite graph, there must be an edge between these vertices. [sent-473, score-0.277]
63 We discuss further properties of the bipartite rank-shatter coefficients in Section 4. [sent-485, score-0.277]
64 Then for any 0 < δ ≤ 1, PS∼D M ˆ sup A( f ; S) − A( f ) ≥ f ∈F 8 ln r (F , 2ρ(SY )M, 2(1 − ρ(SY ))M) + ln ρ(SY )(1 − ρ(SY ))M 4. [sent-510, score-0.186]
65 1, we have r(F , m, n) ≤ ψ(m, n), where ψ(m, n) is the number of matrices 1 in {0, 2 , 1}m×n that can be realized as a bipartite rank matrix. [sent-513, score-0.349]
66 (To see this, note that the number of distinct bipartite rank matrices of size m × n is bounded above by the total number of permutations of (m + n) objects, allowing for objects to be placed at the same position. [sent-515, score-0.324]
67 Next we define a series of Y ∗ -valued function classes derived from a given ranking function class. [sent-531, score-0.433]
68 Below we make use of the above result to derive polynomial upper bounds on the bipartite rankshatter coefficients for linear and higher-order polynomial ranking functions. [sent-546, score-0.772]
69 Lemma 23 For d ∈ N, let Flin(d) denote the class of linear ranking functions on Rd : Flin(d) = { f : Rd →R | f (x) = w·x + b for some w ∈ Rd , b ∈ R} . [sent-548, score-0.433]
70 Theorem 24 For d ∈ N, let Flin(d) denote the class of linear ranking functions on Rd (defined in Lemma 23 above). [sent-564, score-0.433]
71 Lemma 25 For d, q ∈ N, let Fpoly(d,q) denote the class of polynomial ranking functions on Rd with degree less than or equal to q. [sent-567, score-0.433]
72 Theorem 26 For d, q ∈ N, let Fpoly(d,q) denote the class of polynomial ranking functions on Rd with degree less than or equal to q. [sent-585, score-0.433]
73 (2003) recently derived a uniform convergence bound for a quantity closely related to the AUC, namely the ranking loss for the bipartite ranking problem. [sent-592, score-1.267]
74 As pointed out by Cortes and Mohri (2004), the bipartite ranking loss is equal to one minus the AUC; the uniform convergence bound of Freund et al. [sent-593, score-0.812]
75 Then for any 0 < δ ≤ 1, PSX |SY =y ˆ sup A( f ; S) − A( f ) ≥ 2 f ∈F ˇ ln s(F , 2m) + ln m 12 δ +2 ˇ ln s(F , 2n) + ln n 12 δ ≤ δ, ˇ where F is the class of Y ∗ -valued functions on X defined by Eq. [sent-608, score-0.29]
76 As in the AUC definition of (Cortes and Mohri, 2004), the ranking loss defined in (Freund et al. [sent-611, score-0.433]
77 Specifically, consider the function class Flin(1) of linear ranking functions on R, given by Flin(1) = { f : R→R | f (x) = wx + b for some w ∈ R, b ∈ R} . [sent-622, score-0.433]
78 (To see this, note that for any set of m + n distinct points in R, one can obtain exactly three different ranking behaviours with functions in Flin(1) : one by setting w > 0, another by setting ˇ w < 0, and the third by setting w = 0. [sent-624, score-0.433]
79 We thus get from our result (Corollary 18) that PSX |SY =y sup f ∈Flin(1) ˆ A( f ; S) − A( f ) ≥ 8(m + n) ln 3 + ln mn 4 δ ≤ δ, and from the result of Freund et al. [sent-627, score-0.261]
80 (Theorem 27) that PSX |SY =y sup f ∈Flin(1) ˆ A( f ; S) − A( f ) ≥ 2 ln(8m + 1) + ln m 12 δ +2 ln(8n + 1) + ln n 12 δ ≤ δ. [sent-628, score-0.186]
81 For each training sequence, a linear ranking function in Flin(16) was learned using the RankBoost algorithm of Freund et al. [sent-641, score-0.465]
82 (2003) for the class of linear ranking functions on R. [sent-689, score-0.433]
83 The training AUC of the learned ranking function, its AUC on the independent test sequence, and the lower bound on its expected ranking accuracy obtained from our uniform convergence result (using Corollary 18, at a confidence level δ = 0. [sent-695, score-1.025]
84 75 0 200 400 600 0 Lower bound on expected ranking accuracy 200 0. [sent-716, score-0.502]
85 Conclusion and Open Questions We have derived geralization bounds for the area under the ROC curve (AUC), a quantity used as an evaluation criterion for the bipartite ranking problem. [sent-726, score-0.796]
86 Our uniform convergence bound for the AUC is expressed in terms of a new set of combinatorial parameters that we have termed the bipartite rank-shatter coefficients. [sent-731, score-0.379]
87 For the case of linear ranking functions on R, for which we could compute the bipartite rankshatter coefficients exactly, we have shown that our uniform convergence bound is considerably tighter than a recent uniform convergence bound derived by Freund et al. [sent-733, score-0.958]
88 This suggests that the bipartite rank-shatter coefficients we have introduced may be a more appropriate complexity measure for studying the bipartite ranking problem. [sent-735, score-0.987]
89 However, in order to take advantage of our results, one needs to be able to characterize these coefficients for the class of ranking functions of interest. [sent-736, score-0.433]
90 The biggest open question that arises from our study is, for what other function classes F can the bipartite rank-shatter coefficients r(F , m, n) be characterized? [sent-737, score-0.277]
91 We have derived in Theorem 22 a general upper bound on the bipartite rank-shatter coefficients of a function class F in terms of the ˜ standard shatter coefficients of the function class F (see Eq. [sent-738, score-0.403]
92 (21)); this allows us to establish a polynomial upper bound on the bipartite rank-shatter coefficients for linear and higher-order polynomial ranking functions on Rd and other algebraically well-behaved function classes. [sent-739, score-0.754]
93 First, can we establish analogous complexity measures and generalization bounds for other forms of ranking problems (i. [sent-743, score-0.472]
94 AGARWAL , G RAEPEL , H ERBRICH , H AR -P ELED AND ROTH Therefore, using U (D) to denote the uniform distribution over a discrete set D, we have the following: ε ˆ ˆ ˜ sup A( f ; S) − A( f ; S) ≥ 2 f ∈F PSX SX |SY =y ˜ = PSX SX |SY =y ˜ = = ε ˜ ˜ sup β f (X1 , . [sent-835, score-0.201]
95 From the definition of bipartite ˜ rank matrices (Definition 14), it follows that for any x, x ∈ X M , as f ranges over F , the number of different random variables β f (σ(x1 ), . [sent-875, score-0.303]
96 , σ(˜ M )) x x is at most the number of different bipartite rank matrices B f (z, z ) that can be realized by functions ˜ ˜ in F , where z ∈ X 2m contains xi , xi for i : yi = +1 and z ∈ X 2n contains x j , x j for j : y j = −1. [sent-881, score-0.395]
97 This number, by definition, cannot exceed r(F , 2m, 2n) (see the definition of bipartite rank-shatter coefficients, Definition 15). [sent-882, score-0.277]
98 2 G ENERALIZATION B OUNDS FOR THE A REA U NDER THE ROC C URVE ˜ xk , if Wk = xk . [sent-911, score-0.234]
99 , ∗ LD (h), we thus get ˆ sup A( f ; S) − A( f ) f ∈F (m) ≤ (n) (m) ∗ ∗ ˆ ˆ sup L∗ ( fˇz ; S−1 ) − LD−1 ( fˇz ) + sup L∗ ( fˇz ; S+1 ) − LD+1 ( fˇz ) , ˇ fˇz ∈F (28) ˇ fˇz ∈F (n) where S+1 and S−1 denote the subsequences of S containing the m positive and n negative examples, respectively. [sent-953, score-0.246]
100 (27), we have ˇ , 2m) + ln 12 ln s(F δ (m) ∗ δ ˆ , (29) ≤ PS(m) ∼D m sup L∗ ( fˇz ; S+1 ) − LD+1 ( fˇz ) ≥ 2 +1 ˇ +1 m 2 ˇ fz ∈F ˇ , 2n) + ln 12 ln s(F δ (n) ∗ δ ˆ PS(n) ∼D n . [sent-955, score-0.29]
wordName wordTfidf (topN-words)
[('auc', 0.503), ('ranking', 0.433), ('bipartite', 0.277), ('sy', 0.255), ('fs', 0.172), ('agarwal', 0.165), ('psx', 0.158), ('sx', 0.145), ('roc', 0.131), ('flin', 0.126), ('eled', 0.12), ('raepel', 0.12), ('rea', 0.12), ('urve', 0.12), ('xk', 0.117), ('ty', 0.103), ('nder', 0.101), ('erbrich', 0.101), ('roth', 0.095), ('mcdiarmid', 0.084), ('shatter', 0.082), ('sup', 0.082), ('eneralization', 0.081), ('ounds', 0.075), ('mn', 0.075), ('xm', 0.072), ('freund', 0.069), ('chebyshev', 0.068), ('ptx', 0.068), ('deviation', 0.066), ('ar', 0.066), ('fpoly', 0.06), ('mohri', 0.06), ('cients', 0.059), ('dence', 0.057), ('coef', 0.057), ('ln', 0.052), ('rd', 0.048), ('cortes', 0.047), ('realized', 0.046), ('confidence', 0.045), ('bound', 0.044), ('ex', 0.043), ('rate', 0.043), ('sequence', 0.042), ('bounds', 0.039), ('yn', 0.037), ('ranked', 0.037), ('uniform', 0.037), ('devroye', 0.035), ('xn', 0.034), ('yk', 0.033), ('learned', 0.032), ('skew', 0.031), ('ld', 0.031), ('thore', 0.031), ('labels', 0.031), ('label', 0.031), ('ties', 0.03), ('etx', 0.03), ('exy', 0.03), ('sariel', 0.03), ('shivani', 0.03), ('herbrich', 0.029), ('graepel', 0.028), ('interval', 0.027), ('instances', 0.026), ('rank', 0.026), ('width', 0.025), ('intervals', 0.025), ('area', 0.025), ('accuracy', 0.025), ('ps', 0.025), ('ralf', 0.025), ('ym', 0.024), ('hi', 0.023), ('xi', 0.023), ('contain', 0.023), ('dtx', 0.023), ('esx', 0.023), ('ety', 0.023), ('fpri', 0.023), ('matou', 0.023), ('rankboost', 0.023), ('rankshatter', 0.023), ('tpri', 0.023), ('uiuc', 0.023), ('vik', 0.023), ('quantity', 0.022), ('classi', 0.022), ('fh', 0.022), ('vertices', 0.022), ('theorem', 0.022), ('convergence', 0.021), ('deriving', 0.021), ('tighter', 0.021), ('objects', 0.021), ('cycle', 0.021), ('ck', 0.021), ('wm', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth
Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.050008841 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
4 0.047868989 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
6 0.040158655 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
7 0.039913118 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
8 0.034655858 22 jmlr-2005-Concentration Bounds for Unigram Language Models
9 0.033818122 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
10 0.03156814 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
11 0.029528107 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
12 0.029332332 71 jmlr-2005-Variational Message Passing
13 0.027402677 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
14 0.025889207 41 jmlr-2005-Kernel Methods for Measuring Independence
15 0.025598379 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
16 0.025535157 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
17 0.025497377 29 jmlr-2005-Efficient Margin Maximizing with Boosting
18 0.025234414 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
19 0.025135601 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
20 0.025036143 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
topicId topicWeight
[(0, 0.147), (1, -0.06), (2, 0.033), (3, 0.103), (4, 0.006), (5, -0.124), (6, 0.019), (7, -0.047), (8, -0.012), (9, -0.065), (10, -0.036), (11, -0.004), (12, -0.097), (13, -0.014), (14, -0.036), (15, -0.01), (16, 0.156), (17, 0.058), (18, -0.085), (19, -0.101), (20, 0.061), (21, 0.082), (22, 0.15), (23, 0.169), (24, 0.294), (25, 0.406), (26, -0.204), (27, -0.226), (28, -0.04), (29, -0.311), (30, -0.113), (31, 0.049), (32, 0.123), (33, -0.102), (34, 0.018), (35, -0.002), (36, 0.329), (37, 0.228), (38, -0.06), (39, 0.143), (40, 0.186), (41, 0.068), (42, 0.179), (43, 0.049), (44, 0.03), (45, -0.062), (46, -0.05), (47, 0.038), (48, 0.095), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.95716941 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth
Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan
2 0.17691886 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
3 0.14911966 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
4 0.14829604 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
7 0.13457201 11 jmlr-2005-Algorithmic Stability and Meta-Learning
8 0.12741408 36 jmlr-2005-Gaussian Processes for Ordinal Regression
9 0.12716001 71 jmlr-2005-Variational Message Passing
10 0.12509377 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
11 0.1244611 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
12 0.11706275 22 jmlr-2005-Concentration Bounds for Unigram Language Models
13 0.11604982 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
14 0.11186785 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
15 0.10439217 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
16 0.10405952 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
17 0.10166022 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.095194265 41 jmlr-2005-Kernel Methods for Measuring Independence
19 0.092344165 64 jmlr-2005-Semigroup Kernels on Measures
20 0.085350893 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
topicId topicWeight
[(13, 0.014), (17, 0.032), (19, 0.016), (36, 0.024), (37, 0.018), (42, 0.014), (43, 0.039), (47, 0.012), (52, 0.097), (59, 0.022), (62, 0.434), (70, 0.019), (88, 0.124), (90, 0.018), (94, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.66677475 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
Author: Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth
Abstract: We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. Keywords: generalization bounds, area under the ROC curve, ranking, large deviations, uniform convergence ∗. Parts of the results contained in this paper were presented at the 18th Annual Conference on Neural Information Processing Systems in December, 2004 (Agarwal et al., 2005a) and at the 10th International Workshop on Artificial Intelligence and Statistics in January, 2005 (Agarwal et al., 2005b). ©2005 Shivan
2 0.35674465 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
3 0.35633683 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
4 0.35370073 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
5 0.34904218 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
6 0.34811169 39 jmlr-2005-Information Bottleneck for Gaussian Variables
7 0.3474443 44 jmlr-2005-Learning Module Networks
8 0.34724656 3 jmlr-2005-A Classification Framework for Anomaly Detection
9 0.34603238 71 jmlr-2005-Variational Message Passing
10 0.34557807 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
11 0.34214932 20 jmlr-2005-Clustering with Bregman Divergences
12 0.34011465 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
13 0.33999553 48 jmlr-2005-Learning the Kernel Function via Regularization
14 0.33885437 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.33869368 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
16 0.33832431 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
17 0.33782685 67 jmlr-2005-Stability of Randomized Learning Algorithms
18 0.3375366 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
19 0.33726561 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
20 0.33232608 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors