jmlr jmlr2009 jmlr2009-37 knowledge-graph by maker-knowledge-mining

37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability


Source: pdf

Author: Shivani Agarwal, Partha Niyogi

Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. [sent-5, score-1.512]

2 Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. [sent-7, score-1.259]

3 AGARWAL AND N IYOGI important is the relative ranking of instances induced by those scores. [sent-24, score-0.635]

4 Although there have been several recent advances in developing algorithms for various settings of the ranking problem, the study of generalization properties of ranking algorithms has been largely limited to the special setting of bipartite ranking (Freund et al. [sent-26, score-1.832]

5 In this paper, we study generalization properties of ranking algorithms in a more general setting of the ranking problem that arises frequently in applications. [sent-29, score-1.191]

6 Our generalization bounds are derived using the notion of algorithmic stability; we show that a number of practical ranking algorithms satisfy the required stability conditions, and therefore can be analyzed using our bounds. [sent-30, score-0.912]

7 Since then, however, the number of domains in which ranking has found applications has grown quickly, and as a result, ranking has gained considerable attention in machine learning and learning theory in recent years. [sent-34, score-1.128]

8 Since then there have been many other algorithmic developments: for example, Radlinski and Joachims (2005) have developed an algorithmic framework for ranking in information retrieval applications; Burges et al. [sent-39, score-0.61]

9 (2008) have studied statistical convergence properties of ranking algorithms—specifically, ranking algorithms based on empirical and convex risk minimization—using the theory of U-statistics. [sent-43, score-1.162]

10 Their bound was derived from uniform convergence results for the binary classification error, and was expressed in terms of the VC-dimension of a class of binary classification functions derived from the class of ranking functions searched by the algorithm. [sent-46, score-0.68]

11 Agarwal and Niyogi (2005) used a different tool, namely that of algorithmic stability (Rogers and Wagner, 1978; Bousquet and Elisseeff, 2002), to obtain generalization bounds for bipartite ranking algorithms that have good stability properties. [sent-49, score-1.219]

12 As can be noted from the above discussion, the question of generalization properties of ranking algorithms has so far been investigated mainly in the special setting of bipartite ranking. [sent-51, score-0.704]

13 (2005) derived a margin-based bound which is expressed in terms of covering numbers and relies ultimately on a uniform convergence result; this bound is derived for a non-bipartite setting of the ranking problem, but under the restrictive distributional assumption of a “truth” function. [sent-54, score-0.702]

14 (2007) consider a different setting of the ranking problem and derive stability-based generalization bounds for algorithms in this setting. [sent-56, score-0.686]

15 2 Our Results We use the notion of algorithmic stability to study generalization properties of ranking algorithms in a more general setting of the ranking problem than has been considered previously, and that arises frequently in applications. [sent-63, score-1.444]

16 Here we show that algorithmic stability can be useful also in analyzing generalization properties of ranking algorithms in the setting we consider; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. [sent-65, score-1.769]

17 We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space (RKHS) have such stability properties, and therefore our bounds can be applied to these algorithms. [sent-66, score-0.889]

18 We describe the general ranking problem and the setting we consider in detail in Section 2, and define notions of stability for ranking algorithms in this setting in Section 3. [sent-69, score-1.452]

19 Using these notions, we derive generalization bounds for stable ranking algorithms in Section 4. [sent-70, score-0.659]

20 In Section 5 we show stability of kernel-based ranking algorithms that perform regularization in an RKHS, and apply the results of Section 4 to obtain generalization bounds for these algorithms. [sent-71, score-0.91]

21 The Ranking Problem In the problem of ranking, one is given a finite number of examples of order relationships among instances in some instance space X , and the goal is to learn from these examples a ranking or ordering over X that ranks accurately future instances. [sent-74, score-0.635]

22 , (xm , xm , rm )), the goal is to learn a real-valued ranking function f : X →R that ranks accurately future instances; f is considered to rank an instance x ∈ X higher than an instance x′ ∈ X if f (x) > f (x′ ), and lower than x′ if f (x) < f (x′ ). [sent-80, score-0.618]

23 Thus, assuming that ties are broken uniformly at random, the expected penalty (or loss) incurred by a ranking function f on a pair of instances (x, x′ ) labeled by r can be written as 1 |r| I{r( f (x)− f (x′ ))<0} + I{ f (x)= f (x′ )} 2 , where I{φ} is 1 if φ is true and 0 otherwise. [sent-81, score-0.741]

24 A particular setting of the ranking problem that has been investigated in some detail in recent years is the bipartite setting (Freund et al. [sent-82, score-0.695]

25 In the bipartite ranking problem, instances come from two categories, positive and negative; the learner is given examples of instances labeled as positive or negative, and the goal is to learn a ranking in which positive instances are ranked higher than negative ones. [sent-85, score-1.453]

26 , x− ), the x+ and x− being instances in some instance space X , and the goal examples S n i j 1 is to learn a real-valued ranking function f : X →R that ranks future positive instances higher than negative ones. [sent-92, score-0.706]

27 In this paper, we consider a more general setting: the learner is given examples of instances labeled by real numbers, and the goal is to learn a ranking in which instances labeled by larger numbers are ranked higher than instances labeled by smaller numbers. [sent-95, score-0.812]

28 2 This problem can also be seen to be a special case of the general ranking problem described above; a training sample S = ((x1 , y1 ), . [sent-105, score-0.608]

29 1 The quality of a ranking function f : X →R can then be measured by its expected ranking error, which we denote by R( f ) and define as follows: 1 R( f ) = E((X,Y ),(X ′ ,Y ′ ))∼D ×D |Y −Y ′ | I{(Y −Y ′ )( f (X)− f (X ′ ))<0} + I{ f (X)= f (X ′ )} 2 . [sent-111, score-1.151]

30 In practice, since the distribution D is unknown, the expected error of a ranking function f must be estimated from an empirically observable quantity, such as its empirical ranking error with respect to a sample S = ((x1 , y1 ), . [sent-113, score-1.17]

31 A learning algorithm for the ranking problem described above takes as input a training sample S S ∈ ∞ (X × Y )m and returns as output a ranking function fS : X →R. [sent-118, score-1.172]

32 Let ℓ : RX × (X × Y ) × (X × Y ) → R+ ∪ {0} be a ranking loss function. [sent-128, score-0.611]

33 Let ℓ : RX × (X × Y ) × (X × Y ) → R+ ∪ {0} be a ranking loss function. [sent-134, score-0.611]

34 As mentioned above, one choice for a ranking loss function could be a 0-1 loss that simply assigns a constant penalty of 1 to any mis-ranked pair: 1 ℓ0-1 ( f , (x, y), (x′ , y′ )) = I{(y−y′ )( f (x)− f (x′ ))<0} + I{ f (x)= f (x′ )} . [sent-136, score-0.685]

35 While our focus will be on bounding the expected (ℓdisc -)ranking error of a learned ranking function, our results can be used also to bound the expected ℓ0-1 -error. [sent-141, score-0.651]

36 Several other ranking loss functions will be useful in our study; these will be introduced in the following sections as needed. [sent-142, score-0.611]

37 The input to a ranking algorithm in our setting is a training sample of the form S = ((x1 , y1 ), . [sent-145, score-0.635]

38 The notions of stability that we define below for ranking algorithms in our setting are based on those defined earlier for bipartite ranking algorithms (Agarwal and Niyogi, 2005) and are most closely related to the notions of stability used by Bousquet and Elisseeff (2002). [sent-152, score-1.772]

39 Definition 4 (Uniform loss stability) Let A be a ranking algorithm whose output on a training sample S we denote by fS , and let ℓ be a ranking loss function. [sent-153, score-1.281]

40 Definition 5 (Uniform score stability) Let A be a ranking algorithm whose output on a training sample S we denote by fS . [sent-156, score-0.631]

41 If a ranking algorithm has uniform loss stability β with respect to ℓ, then changing an input training sample of size m by a single example leads to a difference of at most β(m) in the ℓ-loss incurred by the output ranking function on any pair of examples (x, y), (x′ , y′ ). [sent-159, score-1.519]

42 Similarly, if a ranking algorithm has uniform score stability ν, then changing an input training sample of size m by a single example leads to a difference of at most ν(m) in the score assigned by the output ranking function to any instance x. [sent-161, score-1.499]

43 However, we do not consider such weaker notions of stability in this paper; we shall show later (Section 5) that several practical ranking algorithms in fact exhibit good uniform stability properties. [sent-165, score-1.135]

44 Generalization Bounds for Stable Ranking Algorithms In this section we give generalization bounds for ranking algorithms that exhibit good (uniform) stability properties. [sent-167, score-0.889]

45 Our main result is the following, which bounds the expected ℓ-error of a ranking function learned by an algorithm with good uniform loss stability in terms of its empirical ℓ-error on the training sample. [sent-169, score-1.019]

46 447 AGARWAL AND N IYOGI Theorem 6 Let A be a symmetric ranking algorithm2 whose output on a training sample S ∈ (X × Y )m we denote by fS , and let ℓ be a bounded ranking loss function such that 0 ≤ ℓ( f , (x, y), (x′ , y′ )) ≤ B for all f : X →R and (x, y), (x′ , y′ ) ∈ (X × Y ). [sent-170, score-1.25]

47 Indeed, this is the reason that in deriving uniform convergence bounds for the ranking error, the standard Hoeffding inequality used to obtain such bounds in classification and regression can no longer be applied (Agarwal et al. [sent-179, score-0.79]

48 This means the theorem cannot be applied directly to obtain bounds on the expected ranking error, since it is not possible to have non-trivial uniform loss stability with respect to the discrete ranking loss ℓdisc defined in Eq. [sent-186, score-1.602]

49 ) However, for any ranking loss ℓ that satisfies ℓdisc ≤ ℓ, 2. [sent-190, score-0.611]

50 448 S TABILITY AND G ENERALIZATION OF R ANKING A LGORITHMS Theorem 6 can be applied to ranking algorithms that have good uniform loss stability with respect to ℓ to obtain bounds on the expected ℓ-error; since in this case R ≤ Rℓ , these bounds apply also to the expected ranking error. [sent-192, score-1.62]

51 We consider below a specific ranking loss that satisfies this condition, and with respect to which we will be able to show good loss stability of certain ranking algorithms; other ranking losses which can also be used in this manner will be discussed in later sections. [sent-193, score-2.016]

52 Therefore, for any ranking algorithm that has good uniform loss stability with respect to ℓγ for some γ > 0, Theorem 6 can be applied to bound the expected ranking error of a learned ranking function in terms of its empirical ℓγ -error on the training sample. [sent-196, score-2.109]

53 The following lemma shows that, for every γ > 0, a ranking algorithm that has good uniform score stability also has good uniform loss stability with respect to ℓγ . [sent-197, score-1.214]

54 Then for every γ > 0, A has uniform loss stability βγ with respect to the γ ranking loss ℓγ , where for all m ∈ N, βγ (m) = 2ν(m) . [sent-200, score-0.939]

55 Putting everything together, we thus get the following result which bounds the expected ranking error of a learned ranking function in terms of its empirical ℓγ -error for any ranking algorithm that has good uniform score stability. [sent-216, score-1.882]

56 m Proof The result follows by applying Theorem 6 to A with the ranking loss ℓγ (using Lemma 7), which satisfies 0 ≤ ℓγ ≤ M, and from the fact that R ≤ Rℓγ . [sent-220, score-0.611]

57 Stable Ranking Algorithms In this section we show (uniform) stability of certain ranking algorithms that select a ranking function by minimizing a regularized objective function. [sent-222, score-1.382]

58 2 we use this result to show stability of kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space (RKHS). [sent-226, score-0.83]

59 We show, in particular, stability of two such ranking algorithms, and apply the results of Section 4 to obtain generalization bounds for these algorithms. [sent-227, score-0.889]

60 We also use these stability results to give a consistency theorem for kernel-based ranking algorithms (Section 5. [sent-229, score-0.811]

61 Let ℓ be a ranking loss such that ℓ( f , (x, y), (x′ , y′ )) is convex in f , and let σ > 0 be such that ℓ is σ-admissible with respect to F . [sent-241, score-0.642]

62 λ m j=i 2 The proof follows that of a similar result for regularization-based algorithms for classification and regression (Bousquet and Elisseeff, 2002); it is based crucially on the convexity of the ranking loss ℓ. [sent-250, score-0.641]

63 As we shall see below, the above result can be used to establish stability of certain regularizationbased ranking algorithms. [sent-252, score-0.814]

64 We show below that, if the kernel K is such that K(x, x) is bounded for all x ∈ X , then a ranking algorithm that minimizes an appropriate regularized error over F , with regularizer N as defined above, has good uniform score stability. [sent-262, score-0.678]

65 Let ℓ be a ranking loss such that ℓ( f , (x, y), (x′ , y′ )) is convex in f and ℓ is σ-admissible with respect to F . [sent-264, score-0.627]

66 Let λ > 0, and let A be a ranking algorithm that, given a training sample S, outputs a ranking function fS ∈ F that satisfies fS = arg min f ∈F Rℓ ( f ; S) + λ f 2 . [sent-265, score-1.187]

67 If A has uniform score stability ν and ℓ is a ranking loss that is σ-admissible with respect to F , then A has uniform loss stability β with respect to ℓ, where for all m ∈ N, β(m) = 2σν(m) . [sent-284, score-1.243]

68 In practice, the ranking losses ℓ minimized by kernel-based ranking algorithms tend to be larger than the loss ℓ1 , and therefore Corollary 12 tends to provide tighter bounds on the expected ranking error than Corollary 15. [sent-296, score-1.821]

69 Specifically, let Ah be a ranking algorithm which, given a training sample S ∈ (X × Y )m , outputs a ranking function fS ∈ F that satisfies (for some fixed λ > 0) fS = arg min f ∈F Rℓh ( f ; S) + λ f 455 2 K . [sent-302, score-1.187]

70 It can be verified that ℓh ( f , (x, y), (x′ , y′ )) is convex in f , and that ℓh is 1-admissible with respect to F (the proof of 1-admissibility is similar to the proof of Lemma 7 which, as noted earlier in Footnote 3, effectively shows 1 -admissibility of the γ ranking loss ℓγ ). [sent-307, score-0.657]

71 (9) Let Asq be a ranking algorithm that minimizes the regularized ℓsq -error in an RKHS F , that is, given a training sample S ∈ (X × Y )m , the algorithm Asq outputs a ranking function fS ∈ F that satisfies (for some fixed λ > 0) Rℓsq ( f ; S) + λ f fS = arg min f ∈F 2 K . [sent-312, score-1.196]

72 Then the least squares ranking loss ℓsq , defined in Eq. [sent-325, score-0.626]

73 To show this, we shall need the following simple lemma: Lemma 17 Let f : X →R be a fixed ranking function, and let ℓ be a bounded ranking loss function such that 0 ≤ ℓ( f , (x, y), (x′ , y′ )) ≤ B for all f : X →R and (x, y), (x′ , y′ ) ∈ (X × Y ). [sent-339, score-1.226]

74 Let ℓ be a ranking loss such that ℓ( f , (x, y), (x′ , y′ )) is convex in f and ℓ is σ-admissible with respect to F . [sent-343, score-0.627]

75 ) Let λ > 0, and let A be a ranking ℓ algorithm that, given a training sample S, outputs a ranking function fS ∈ F that satisfies fS = arg min f ∈F Rℓ ( f ; S) + λ f 2 . [sent-346, score-1.187]

76 Such a comparison shows that the bounds we obtain here for ranking are very similar to those obtained for classification and regression, differing only in constants (and, of course, in the precise definition of stability, which is problem-dependent). [sent-360, score-0.623]

77 459 AGARWAL AND N IYOGI It is also instructive to compare our bounds with those obtained for bipartite ranking algorithms (Agarwal and Niyogi, 2005). [sent-363, score-0.7]

78 In particular, the bounds for kernel-based ranking algorithms in the bipartite setting differ from our bounds in that the sample size m in our case is replaced with the term mn/(m + n), where m, n denote the numbers of positive and negative examples, respectively, in the bipartite setting. [sent-364, score-0.882]

79 This suggests that in general, the effective sample size in ranking is the ratio between the number of pairs in the training sample (mn in the bipartite setting) and the total number of examples (m + n); indeed, in our setting, this ratio is m /m = m/2, which fits our bounds. [sent-365, score-0.719]

80 2 Comparison with Uniform Convergence Bounds While uniform convergence bounds for ranking have not been derived explicitly in the setting we consider, it is not hard to see that the techniques used to derive such bounds in the settings of Agarwal et al. [sent-367, score-0.778]

81 (2007), who study “magnitude-preserving” ranking algorithms—most of which perform regularization in an RKHS— and also use algorithmic stability to derive generalization bounds for such algorithms. [sent-376, score-0.933]

82 do not actually bound the expected ranking error R( fS ) (which in our view is the primary performance measure of a ranking function in our setting); their results simply give bounds on the expected ℓ-error Rℓ ( fS ), where ℓ again is the loss minimized by the algorithm. [sent-393, score-1.301]

83 2, the ℓ1 loss tends to be smaller than many of the loss functions minimized by kernel-based ranking algorithms in practice (indeed, it is never greater than the hinge ranking loss ℓh ), thus leading to tighter bounds for the same algorithms. [sent-397, score-1.328]

84 In particular, they assume that the function class F searched by a ranking algorithm is bounded in the sense that ∃ M ′ > 0 such that for all f ∈ F and all x ∈ X , | f (x) − f ∗ (x)| ≤ M ′ (where, as mentioned above, f ∗ : X →R is the truth function in their setting). [sent-401, score-0.639]

85 We expect that an approach similar to that of separating out the effective search space, as we have done in the case of the least squares ranking loss in Section 5. [sent-405, score-0.626]

86 , xm , each drawn randomly and independently according to some (unknown) distribution on X , together with the corresponding ranking preferences π(xi , x j ) for i, j ∈ {1, . [sent-417, score-0.674]

87 Their bound, expressed in terms of covering numbers, is a margin-based bound, which is of interest when the learned ranking function has zero empirical error on the training sample. [sent-423, score-0.609]

88 Noting that the labels yi , y j corresponding to a pair of instances xi , x j in our setting are used in our results only in the form of ranking preferences (yi − y j ), we can clearly view the setting of Rudin et al. [sent-424, score-0.818]

89 as a special case of ours: the ranking preferences (yi − y j ), which are random variables in our setting, get replaced by the deterministic preferences π(xi , x j ). [sent-425, score-0.69]

90 All quantities can be adapted accordingly; for example, the expected ranking error of a ranking function f : X →R in this setting (with respect to a distribution D on X ) becomes 1 R( f ) = E(X,X ′ )∼D ×D |π(X, X ′ )| I{π(X,X ′ )( f (X)− f (X ′ ))<0} + I{ f (X)= f (X ′ )} 2 . [sent-426, score-1.178]

91 It is interesting then to consider a model in which not only the instances in S but also the pairs in E for which the ranking preferences are provided may be chosen randomly. [sent-433, score-0.706]

92 Then we can show the following bound for ranking algorithms with good loss stability in this setting: Theorem 19 Let π : X × X →[−M, M] be a pair-wise truth function such that π(x, x′ ) = −π(x′ , x) for all x, x′ ∈ X . [sent-442, score-0.895]

93 Let A be a symmetric ranking algorithm in the pair-wise truth function setting whose output on a training sample T = (S, E, π|(S,E) ) we denote by fT , and let ℓ be a bounded ranking loss function such that 0 ≤ ℓ( f , x, x′ , r) ≤ B for all f : X →R, x, x′ ∈ X and r ∈ [−M, M]. [sent-443, score-1.31]

94 Discussion Our goal in this paper has been to study generalization properties of ranking algorithms in a setting where ranking preferences among instances are indicated by real-valued labels on the instances. [sent-471, score-1.318]

95 The main difference in the mathematical formulation of ranking problems as compared to classification or regression problems is that the loss function in ranking is ‘pair-wise’ rather than ‘pointwise’. [sent-475, score-1.19]

96 For the sake of simplicity and to keep the focus on the main ideas involved in applying stability techniques to ranking, we have focused in this paper on bounding the expected ranking error in terms of the empirical ranking error. [sent-486, score-1.381]

97 An open question concerns the analysis of other ranking algorithms using the algorithmic stability framework. [sent-493, score-0.817]

98 Finally, it is also an open question to analyze generalization properties of ranking algorithms in even more general settings of the ranking problem. [sent-497, score-1.164]

99 , xm ∈ X are drawn randomly and independently according to some distribution D on X , and then pair-wise ranking preferences ri j ∈ [−M, M] for i < j are drawn from a conditional distribution, conditioned on the corresponding pair of instances (xi , x j ). [sent-501, score-0.745]

100 Lemma 21 Let A be a symmetric ranking algorithm whose output on a training sample S ∈ (X × Y )m we denote by fS , and let ℓ be a ranking loss function. [sent-540, score-1.234]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ranking', 0.564), ('fs', 0.53), ('fsi', 0.356), ('stability', 0.23), ('agarwal', 0.152), ('iyogi', 0.114), ('eneralization', 0.107), ('ft', 0.101), ('mcdiarmid', 0.098), ('elisseeff', 0.096), ('tability', 0.091), ('anking', 0.084), ('ms', 0.081), ('bipartite', 0.077), ('disc', 0.077), ('lgorithms', 0.075), ('sgn', 0.075), ('instances', 0.071), ('bousquet', 0.07), ('fsk', 0.06), ('bounds', 0.059), ('preferences', 0.056), ('rkhs', 0.054), ('niyogi', 0.054), ('xm', 0.054), ('uniform', 0.051), ('cortes', 0.05), ('xi', 0.05), ('sq', 0.048), ('zm', 0.048), ('loss', 0.047), ('partha', 0.04), ('notions', 0.04), ('kutin', 0.04), ('rudin', 0.039), ('ym', 0.037), ('generalization', 0.036), ('asq', 0.034), ('sk', 0.033), ('truth', 0.033), ('wagner', 0.031), ('clemencon', 0.031), ('xk', 0.03), ('shivani', 0.028), ('penalty', 0.027), ('setting', 0.027), ('searched', 0.026), ('training', 0.025), ('em', 0.025), ('dk', 0.025), ('regularized', 0.024), ('inequality', 0.024), ('expected', 0.023), ('rogers', 0.023), ('yi', 0.023), ('score', 0.023), ('algorithmic', 0.023), ('pm', 0.022), ('bound', 0.021), ('regularization', 0.021), ('broken', 0.021), ('edge', 0.02), ('draw', 0.02), ('herbrich', 0.02), ('caen', 0.02), ('corollary', 0.02), ('shall', 0.02), ('ln', 0.02), ('learned', 0.02), ('incurred', 0.019), ('sample', 0.019), ('kx', 0.019), ('si', 0.018), ('ranked', 0.018), ('convergence', 0.018), ('lemma', 0.018), ('theorem', 0.017), ('ah', 0.017), ('freund', 0.017), ('learner', 0.017), ('ck', 0.017), ('ron', 0.016), ('devroye', 0.016), ('preference', 0.016), ('ties', 0.016), ('bounded', 0.016), ('convex', 0.016), ('ps', 0.015), ('kearns', 0.015), ('cossock', 0.015), ('rx', 0.015), ('thore', 0.015), ('squares', 0.015), ('reproducing', 0.015), ('proof', 0.015), ('let', 0.015), ('regression', 0.015), ('pairs', 0.015), ('degrees', 0.015), ('get', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

Author: Shivani Agarwal, Partha Niyogi

Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability

2 0.19875382 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve

3 0.10459022 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

Author: Ting Hu, Ding-Xuan Zhou

Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space

4 0.088487357 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

Author: Cynthia Rudin

Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval

5 0.078079008 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

6 0.067587845 67 jmlr-2009-Online Learning with Sample Path Constraints

7 0.059562393 13 jmlr-2009-Bounded Kernel-Based Online Learning

8 0.057817146 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

9 0.05634968 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

10 0.043081444 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

11 0.038756132 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination    (Special Topic on Model Selection)

12 0.037245546 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

13 0.036826354 23 jmlr-2009-Discriminative Learning Under Covariate Shift

14 0.036033046 78 jmlr-2009-Refinement of Reproducing Kernels

15 0.035765857 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

16 0.034810118 16 jmlr-2009-Classification with Gaussians and Convex Loss

17 0.033095248 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

18 0.03265705 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures

19 0.032029741 29 jmlr-2009-Estimating Labels from Label Proportions

20 0.025340227 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.182), (1, 0.049), (2, 0.011), (3, -0.029), (4, -0.175), (5, 0.186), (6, -0.03), (7, 0.351), (8, 0.013), (9, 0.009), (10, -0.207), (11, 0.03), (12, 0.079), (13, -0.039), (14, 0.023), (15, 0.019), (16, 0.135), (17, -0.003), (18, -0.067), (19, 0.134), (20, -0.025), (21, 0.139), (22, -0.104), (23, -0.043), (24, 0.044), (25, -0.112), (26, 0.04), (27, 0.003), (28, -0.069), (29, 0.003), (30, -0.029), (31, 0.04), (32, -0.137), (33, -0.042), (34, 0.032), (35, 0.027), (36, 0.084), (37, 0.096), (38, 0.13), (39, -0.087), (40, -0.077), (41, 0.033), (42, -0.045), (43, 0.101), (44, 0.08), (45, 0.079), (46, -0.065), (47, 0.071), (48, -0.018), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97415322 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

Author: Shivani Agarwal, Partha Niyogi

Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability

2 0.71968901 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve

3 0.56489635 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

Author: Cynthia Rudin

Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval

4 0.40000799 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

5 0.32048073 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

Author: Ting Hu, Ding-Xuan Zhou

Abstract: Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H¨ lder space. For regression with least square or insensitive loss, learning rates are given o in both the RKHS norm and the L2 norm. For classification with hinge loss and support vector machine q-norm loss, rates are explicitly stated with respect to the excess misclassification error. Keywords: sampling with non-identical distributions, online learning, classification with a general convex loss, regression with insensitive loss and least square loss, reproducing kernel Hilbert space

6 0.26578265 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination    (Special Topic on Model Selection)

7 0.26042792 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

8 0.25151616 67 jmlr-2009-Online Learning with Sample Path Constraints

9 0.24914889 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

10 0.24868549 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

11 0.23576005 13 jmlr-2009-Bounded Kernel-Based Online Learning

12 0.22976665 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

13 0.22803853 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

14 0.21893738 29 jmlr-2009-Estimating Labels from Label Proportions

15 0.19989713 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

16 0.1775509 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

17 0.17733087 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

18 0.16673012 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models    (Machine Learning Open Source Software Paper)

19 0.16618085 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

20 0.16377468 16 jmlr-2009-Classification with Gaussians and Convex Loss


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.012), (11, 0.022), (19, 0.036), (21, 0.02), (26, 0.01), (38, 0.02), (47, 0.01), (51, 0.373), (52, 0.06), (55, 0.047), (58, 0.034), (66, 0.134), (90, 0.079), (96, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70053422 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

Author: Shivani Agarwal, Partha Niyogi

Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability

2 0.42497543 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

3 0.41551617 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang

Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems

4 0.41526625 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve

5 0.41184261 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

Author: Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray, David Page

Abstract: A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze Skewing, a heuristic method that was developed to improve the ability of greedy decision tree algorithms to identify relevant variables of correlation immune Boolean functions, given examples drawn from the uniform distribution (Page and Ray, 2003). We present theoretical results revealing both the capabilities and limitations of skewing. Second, we explore the problem of identifying relevant variables in the Product Distribution Choice (PDC) learning model, a model in which the learner can choose product distributions and obtain examples from them. We prove a lemma establishing a property of Boolean functions that may be of independent interest. Using this lemma, we give two new algorithms for finding relevant variables of correlation immune functions in the PDC model. Keywords: correlation immune functions, skewing, relevant variables, Boolean functions, product distributions c 2009 Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray and David Page. H ELLERSTEIN , ROSELL , BACH , R AY AND PAGE

6 0.41127941 78 jmlr-2009-Refinement of Reproducing Kernels

7 0.4101592 18 jmlr-2009-Consistency and Localizability

8 0.40995231 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

9 0.40950665 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

10 0.40866709 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

11 0.406537 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

12 0.40520814 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

13 0.40419877 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

14 0.40416953 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

15 0.40218455 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

16 0.40189582 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

17 0.40183917 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

18 0.40175951 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations

19 0.40123305 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

20 0.40047261 38 jmlr-2009-Hash Kernels for Structured Data