jmlr jmlr2009 jmlr2009-52 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
Reference: text
sentIndex sentText sentNum sentScore
1 We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. [sent-6, score-0.325]
2 Our bound suggests that algorithms that maximize the ranking margin will generalize well. [sent-7, score-0.425]
3 ” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. [sent-10, score-0.358]
4 This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. [sent-12, score-0.339]
5 A special case of the general ranking problem is the “bipartite” ranking problem, where there are only two classes: a positive class (good movies) and a negative class (bad movies). [sent-38, score-0.456]
6 Our ranking margin is defined in analogy with the classification margin, and the complexity measure for the hypothesis space is a “sloppy covering number,” which yields, as a corollary, a bound in terms of the L∞ covering number. [sent-45, score-0.54]
7 • Smooth margin ranking algorithm: We present a ranking algorithm in Section 4 designed to maximize the margin. [sent-47, score-0.624]
8 • An equivalence between AdaBoost and RankBoost: A remarkable property of AdaBoost is that it not only solves the classification problem, but simultaneously solves the same problem of bipartite ranking as its counterpart, RankBoost. [sent-49, score-0.311]
9 The limited bipartite case has this symmetry, but not the more general ranking problem that we have described. [sent-58, score-0.311]
10 Since AdaBoost (for the classification setting) and RankBoost (for the ranking setting) were derived analogously for the two settings, RankBoost does not directly maximize the ranking margin, and it does not necessarily increase the margin at every iteration. [sent-84, score-0.624]
11 1 we introduce a “smooth margin” ranking algorithm, and prove that it makes progress towards increasing the smooth margin for ranking at every iteration; this is the main step needed in proving convergence and convergence rates. [sent-86, score-0.713]
12 This algorithm is analogous to the smooth margin classification algorithm “approximate coordinate ascent boosting” (Rudin et al. [sent-87, score-0.36]
13 If the F-skew vanishes, AdaBoost minimizes the exponentiated ranking loss, which is the same loss that RankBoost explicitly minimizes; thus, the two algorithms will produce equally good solutions to the exponentiated problem. [sent-96, score-0.317]
14 Here, we discuss RankBoost, AdaBoost and other coordinate-based ranking algorithms, and introduce the smooth margin ranking algorithm. [sent-107, score-0.691]
15 In Section 5, we focus on the bipartite ranking problem, and discuss conditions for AdaBoost to act as a bipartite ranking algorithm by minimizing the exponentiated loss associated with the AUC. [sent-108, score-0.655]
16 In the case of the movie ranking problem, the xi ’s are the movies and X is the set of all possible movies. [sent-125, score-0.355]
17 The total number of crucial training pairs can be no larger than m(m − 1)/2 based on the rules of π, and should intuitively be of the order m2 in order for us to perform ranking with sufficient accuracy. [sent-133, score-0.33]
18 The margin of classifier f is defined to be the minimum margin over all training instances, mini yi f (xi ). [sent-147, score-0.318]
19 The margin of crucial pair xi ,xk (with respect to ranking function f ) will be defined as f (xi ) − f (xk ). [sent-151, score-0.459]
20 The margin of ranking function f , is defined to be the minimum margin over all crucial pairs, margin f := µ f := min {i,k|π(xi ,xk )=1} f (xi ) − f (xk ). [sent-152, score-0.72]
21 Intuitively, the margin tells us how much the ranking function can change before one of the crucial pairs is misranked. [sent-153, score-0.453]
22 Bipartite ranking is a subset of the general ranking framework we have introduced. [sent-156, score-0.456]
23 In the bipartite ranking problem, every training 2197 RUDIN AND S CHAPIRE instance falls into one of two categories, the positive class Y+ and the negative class Y− . [sent-157, score-0.335]
24 Furthermore, the bipartite ranking problem does not have the same goal as classification even though the labels are {−1, +1}. [sent-166, score-0.311]
25 A change in the position of one example can change the bipartite ranking loss without changing the misclassification error and vice versa. [sent-168, score-0.311]
26 In this section, we provide a margin-based bound for ranking, which gives us an intuition for separable-case ranking and yields theoretical encouragement for margin-based ranking algorithms. [sent-171, score-0.485]
27 In particular, Corollary 4 implies that PD {misrank f } ≤ PS {margin f ≤ θ} + ε with probability at least 1 − δ if ε= 8 ln |H | 4 4mE 2 θ2 ln mE 2 θ2 ln |H | + 2 ln 2 δ . [sent-206, score-0.496]
28 Then we present the “smooth margin ranking” algorithm, and prove that it makes significant progress towards increasing this smooth ranking margin at each iteration, and converges to a maximum margin solution. [sent-230, score-0.779]
29 G(λ) : For classification limited to the separable case, the algorithms “coordinate ascent boosting” and “approximate coordinate ascent boosting” are known to maximize the margin (Rudin et al. [sent-238, score-0.382]
30 ˜ G(λ) : For ranking limited to the separable case, “smooth margin ranking” is an approximate coordinate ascent algorithm that maximizes the ranking margin. [sent-243, score-0.773]
31 Formally, the elements of the twodimensional matrix M are defined as follows, for index ik corresponding to crucial pair xi , xk : Mik, j := h j (xi ) − h j (xk ). [sent-247, score-0.37]
32 The direction chosen at iteration t (corresponding to the jt in which F choice of weak ranker jt ) in the “optimal” case (where the best weak ranker is chosen at each iteration) is given as follows. [sent-259, score-0.923]
33 The step size our coordinate descent algorithm chooses at iteration t is αt , where αt satisfies the following equation for the line search along direction jt . [sent-262, score-0.426]
34 Define It+ := {ik : Mik, jt = 1}, and similarly, It− := {ik : Mik, jt = −1}. [sent-263, score-0.56]
35 The line search is: ˜ ∂F(λt + αe jt ) 0 = − = ∑ e−(M(λt +αt e jt ))ik Mik, jt ∂α α=αt ik∈C p ∑ = ik∈It+ e−(Mλt )ik e−αt − ∑ e−(Mλt )ik eαt ik∈It− 0 = dt+ e−αt − dt− eαt dt+ 1 ln . [sent-265, score-0.964]
36 They both minimize the same ob˜ jective F, but they differ by the ordering of steps: for coordinate descent RankBoost, jt is calculated first, then αt . [sent-272, score-0.401]
37 In other words, at each step RankBoost selects the weak ranker that yields the largest decrease in the loss function, whereas coordinate descent RankBoost selects the weak ranker of steepest slope. [sent-274, score-0.459]
38 Define the following for iteration t (eliminating the t subscript): I+ j := {ik : Mik, j = 1}, I− j := {ik : Mik, j = −1}, I0 j := {ik : Mik, j = 0}, d+ j := ∑ dt,ik , ik∈I+ j d− j := ∑ dt,ik , = 1 d+ j ˜ argmin F λt + ln 2 d− j j argmin ∑ dt,ik j = ik d+ j d− j ∑ dt,ik . [sent-276, score-0.362]
39 That is: jt : = d0 j := ik∈I− j e jt 1 2 d + ln d− jj , and choose the jt which makes the = argmin j ∑ d −Mik, j 1 ln d+ jj 2 e−(Mλt )ik e − ik∈C p 1 − 2 Mik, j argmin 2(d+ j d− j )1/2 + d0 j . [sent-278, score-1.088]
40 ,tmax (a) jt ∈ argmax j (dtT M) j (b) dt+ = ∑{ik:Mik, jt =1} dt,ik , 1 (c) αt = 2 ln “optimal” case choice of weak classifier dt− = ∑{ik:Mik, jt =−1} dt,ik dt+ dt− (d) dt+1,ik = dt,ik e−Mik, jt αt /normaliz. [sent-289, score-1.399]
41 for each crucial pair ik in C p (e) λt+1 = λt + αt e jt , where e jt is 1 in position jt and 0 elsewhere. [sent-290, score-1.104]
42 d + After we make the choice of jt , then we can plug back into the formula for αt , yielding αt = 1 ln d− jjt . [sent-293, score-0.404]
43 Note that for AdaBoost’s objective function, choosing the weak classifier with the steepest slope (plain coordinate descent) yields the same as choosing the weak classifier with the largest decrease in the loss function: both yield AdaBoost. [sent-296, score-0.296]
44 Thus d0 j = 0, and from plain coordinate descent: jt = argmax d+ j − d− j = argmax 2d+ j − 1, j j that is, jt = argmax d+ j . [sent-299, score-0.839]
45 2004) that it is possible for AdaBoost not to converge to a maximum margin solution, nor even to make progress towards increasing the margin at every iteration. [sent-307, score-0.316]
46 Theorem 6 (RankBoost does not always converge to a maximum margin solution) There exist matrices M for which RankBoost converges to a margin that is strictly less than the maximum margin. [sent-309, score-0.294]
47 It is rare in the separable case to be able to solve for the asymptotic margin that AdaBoost or RankBoost converges to; for this 8 × 8 example, AdaBoost’s weight vectors exhibit cyclic behavior, which allowed convergence of the margin to be completely determined. [sent-314, score-0.339]
48 In earlier work, we have introduced a smooth margin function, which one can maximize in order to achieve a maximum margin solution for the classification problem (Rudin et al. [sent-317, score-0.403]
49 A coordinate ascent algorithm on this function makes progress towards increasing the smooth margin at every iteration. [sent-319, score-0.382]
50 Here, we present the analogous smooth ranking function and the smooth margin ranking algorithm. [sent-320, score-0.779]
51 ˜ The smooth ranking function G is defined as follows: ˜ − ln F(λ) ˜ G(λ) := . [sent-322, score-0.44]
52 i In other words, the smooth ranking margin is always less than the true margin, although the two quantities become closer as ||λ||1 increases. [sent-331, score-0.484]
53 We now define the smooth margin ranking algorithm, which is approximately coordinate ascent ˜ on G. [sent-333, score-0.588]
54 At iteration t, we have calculated λt , at which point the following quantities can be calculated: gt ˜ := G(λt ) ˜ weights on crucial pairs dt,ik := e−(Mλt )ik /F(λt ) direction jt = argmax (dtT M) j edge rt j T (dt M) jt . [sent-337, score-0.995]
55 := The choice of jt is the same as for coordinate descent RankBoost (also see Rudin et al. [sent-338, score-0.401]
56 ˜ We also have st = ||λt ||1 and st+1 = st + αt , and gt+1 = G(λt + αt e jt ), where αt has not yet been defined. [sent-341, score-0.576]
57 As before, It+ := {i, k|Mik jt = 1, π(xi , xk ) = 1}, It− := {i, k|Mik jt = −1, π(xi , xk ) = 1}, and now, It0 := {i, k|Mik jt = 0, π(xi , xk ) = 1}. [sent-342, score-1.059]
58 ˜ F(λt + αe jt ) = = ∑ e(−Mλt )ik e−Mik jt α {i,k|π(xi ,xk )=1} ˜ F(λt )(dt+ e−α + dt− eα + dt0 ). [sent-348, score-0.56]
59 ˜ ˜ − ln(F(λt + αe jt )) − ln(F(λt )) − ln (dt+ e−α + dt− eα + dt0 ) = st + α st + α −α + d eα + d ) st ln (dt+ e t− t0 = gt − . [sent-351, score-1.181]
60 st + α st + α ˜ G(λt + αe jt ) = For our algorithm, we set α = αt in the above expression and use the notation defined earlier: ln τt st − st + αt st + αt gt st − gt st − gt αt ln τt 1 − =− [gt αt + ln τt ] . [sent-352, score-2.315]
61 st + αt st + αt st+1 gt+1 = gt gt+1 − gt = (9) Now we have gathered enough notation to write the equation for αt for smooth margin ranking. [sent-353, score-0.949]
62 For plain coordinate ascent, the update α∗ solves: 0 = = = ˜ ˜ ∂G(λt + αe jt ) ∂ − ln F(λt + αe jt ) = ∂α ∂α st + α α=α∗ α=α∗ ˜ −∂F(λt + αe jt )/∂α ˜ − ln F(λt + α∗ e jt ) 1 α=α∗ + − ˜ st + α∗ st + α∗ F(λt + α∗ e jt ) ˜ −∂F(λt + αe jt )/∂α 1 ˜ α=α∗ . [sent-354, score-2.453]
63 −G(λt + α∗ e jt ) + ˜ t + α∗ e jt ) st + α∗ F(λ (10) We could solve this equation numerically for α∗ to get a smooth margin coordinate ascent algorithm; however, we avoid this line search for α∗ in smooth margin ranking. [sent-355, score-1.303]
64 To get the update rule for smooth margin ranking, we set αt to solve: ˜ −∂F(λt + αe jt )/∂α 1 ˜ α=αt 0 = −G(λt ) + ˜ st + αt F(λt + αt e jt ) 1 st + αt = −τt′ . [sent-357, score-1.091]
65 αt = ln (12) (1 + gt )2dt− We are done defining the algorithm and in the process we have derived some useful recursive relationships. [sent-359, score-0.333]
66 In summary: 2207 RUDIN AND S CHAPIRE Smooth margin ranking is the same as described in Figure 1, except that (3c) is replaced by (12), where dt0 = 1 − dt+ − dt− and gt = G(λt ). [sent-360, score-0.584]
67 As usual, the weak learning algorithm must always achieve an edge rt of at least ρ for the calculation to hold, where recall rt = (dtT M) jt = dt+ − dt− . [sent-366, score-0.441]
68 At every iteration, there is always a weak ranker which achieves edge at least ρ, so this requirement is always met in the “optimal case,” where we choose the best possible weak ranker at every iteration (i. [sent-367, score-0.363]
69 Theorem 7 (Progress according to the smooth margin) For 0 ≤ gt < rt < 1 and 0 ≤ dt0 < 2 (1 − 3 rt )(1 − rt2 ) the algorithm makes progress at iteration t: gt+1 − gt ≥ 1 αt (rt − gt ) . [sent-372, score-0.834]
70 This theorem tells us that the value of the smooth ranking margin increases significantly when the condition on d0 holds. [sent-374, score-0.491]
71 This theorem is the main step in proving convergence theorems, for example: Theorem 8 (Convergence for smooth margin ranking) If dt0 < 2 (1−rt )(1−rt2 ) for all t, the smooth 3 margin ranking algorithm converges to a maximum margin solution, that is, limt→∞ gt = ρ. [sent-375, score-1.082]
72 Besides Theorem 7, the only other key step in the proof of Theorem 8 is the following lemma, proved in Section 7: Lemma 9 (Step-size does not increase too quickly for smooth margin ranking) lim αt t→∞ st+1 = 0. [sent-377, score-0.421]
73 t→∞ Using this fact along with (11), we find: g∞ = lim gt = lim inf gt = lim inf t→∞ t→∞ t→∞ = lim inf rt ≥ ρ. [sent-393, score-1.098]
74 Thus, the smooth ranking algorithm converges to a maximum margin solution. [sent-396, score-0.463]
75 In the bipartite ranking problem, the focus of this section, recall that every training instance falls into one of two categories, the positive class Y+ and the negative class Y− . [sent-404, score-0.335]
76 |Y+ ||Y− | ˜ In the bipartite ranking problem, the function F becomes an exponentiated version of the AUC, that −x , we have: is, since 1[x≤0] ≤ e |Y+ ||Y− |(1 − AUC(λ)) = ∑ ∑ 1[(Mλ) ik ≤0] i∈Y+ k∈Y− ≤ ∑ ∑ e−(Mλ) ˜ = F(λ). [sent-409, score-0.557]
77 ∞ Theorem 10 (Equivalence between AdaBoost and RankBoost’s objectives) Let {λt }t=1 be any sequence for which AdaBoost’s objective is minimized, lim F(λt ) = inf F(λ), λ t→∞ (16) and lim F-skew(λt ) = 0. [sent-434, score-0.359]
78 The vector lim 1[Mλt ≤0] may not be uniquely defined; for some i, k pair we may t→∞ have lim (Mλt )ik = 0, and in that case, values of lim 1[(Mλt )ik ≤0] may take on the values 0, 1, or t→∞ t→∞ the limit may not exist, depending on the algorithm. [sent-450, score-0.483]
79 λ t→∞ t→∞ Then, if each positive example has a final score distinct from each negative example, that is, ∀ ik, lim (Mλt )ik = 0, lim (Mλt′ )ik = 0, then both sequences will asymptotically achieve the same t→∞ AUC value. [sent-455, score-0.322]
80 That is: t→∞ lim t→∞ ∑ ∑ 1[(Mλ ) i∈Y+ k∈Y− t ik ≤0] = lim t→∞ ∑ ∑ 1[(Mλ ) i∈Y+ k∈Y− ′ t ik ≤0] . [sent-456, score-0.748]
81 That is: lim t→∞ ∑ 1[(M i∈Y+ = lim t→∞ Ada λcorrected ) ≤0] i t ∑ 1[(M i∈Y+ + ∑ 1[(M k∈Y− Ada λ′corrected ) ≤0] i t + Ada λcorrected ) ≤0] k t ∑ 1[(M k∈Y− Ada λ′corrected ) ≤0] k t . [sent-473, score-0.322]
82 2 Proof (of Lemma 9) There are two possibilities; either limt→∞ st = ∞ or limt→∞ st < ∞. [sent-639, score-0.296]
83 From (9), st+1 (gt+1 − gt ) = −gt αt − ln τt st (gt+1 − gt ) = −gt αt − αt (gt+1 − gt ) − ln τt st (gt+1 − gt ) = −αt gt+1 − ln τt st (gt+1 − gt ) + ln τt + αt gt+1 − gt ln τt + αt + 1−ρ st (1 − ρ) = αt (1 − gt+1 ) ≥ αt (1 − ρ) αt αt ≥ ≥ . [sent-641, score-2.466]
84 st st+1 2218 M ARGIN -BASED R ANKING AND A DA B OOST Since the gt ’s constitute a nondecreasing sequence bounded by 1, (gt+1 − gt ) → 0 as t → ∞, so the first term on the left vanishes. [sent-642, score-0.566]
85 The second term will vanish as long as we can bound ln τt + αt by a constant, since by assumption, st → ∞. [sent-643, score-0.301]
86 ˜ In order to bound ln τt + αt , we use Equation (11): = ln(−τt′ ) − ln gt + αt = ln[dt+ e−αt − dt− eαt ] − ln gt + αt ln τt + αt = ln[dt+ − dt− e2αt ] + ln e−αt − ln gt + αt ≤ ln dt+ − ln gt ≤ ln 1 − ln g1 = − ln g1 < ∞. [sent-645, score-2.229]
87 Consider αt ∑ st+1 ˜ t=1 T T T st+1 − st = ∑ =∑ st+1 ˜ ˜ t=1 t=1 T ≤ ∑ ˜ t=1 Z st+1 1 st u du = Z st+1 1 st+1 st Z sT +1 1 s1 ˜ u du du = ln sT +1 . [sent-648, score-0.568]
88 2 Using this relation at time t and incorporating the recursive equation, Equation (9), gt+1 − gt = 1 st+1 [−gt αt − ln τt ] > (rt + gt ) 1 αt (rt − gt ) αt −gt + = . [sent-667, score-0.751]
89 To see this, assume first that we are in the separable case, that is, lim F+ (λt ) = lim F− (λt ) = 0, t→∞ t→∞ 2220 M ARGIN -BASED R ANKING AND A DA B OOST thus ˜ lim F(λt ) = lim F(λt ) = 0 t→∞ t→∞ and we are done. [sent-677, score-0.689]
90 Thus, we have that for all crucial pairs i, k such that i ∈ Y+ and k ∈ Y− : ′ lim e−(Mλt )ik = lim e−(Mλt )ik = q∗ . [sent-723, score-0.4]
91 ˜ik t→∞ t→∞ 2222 M ARGIN -BASED R ANKING AND A DA B OOST For each crucial pair i, k, if q∗ > 1 then lim (Mλt )ik < 0, that is, ˜ik t→∞ lim 1[(Mλt )ik ≤0] = 1, t→∞ and conversely, if q∗ < 1 then ˜ik lim 1[(Mλt )ik ≤0] = 0. [sent-724, score-0.534]
92 Summing ˜ik over i, k pairs yields: ∑ ∑ 1[(Mλ ) lim t→∞ = lim t ik ≤0] i∈Y+ k∈Y− t→∞ ∑ ∑ 1[(Mλ ) ′ t ik ≤0] i∈Y+ k∈Y− . [sent-727, score-0.775]
93 Using the same argument as in Theorem 12 substituting AdaBoost for RankBoost, we have that Ada lim e−(M λtcorrected )i t→∞ 2223 =: q∗ i RUDIN AND S CHAPIRE exists for all i and Ada lim e−(M λtcorrected )k t→∞ =: q∗ k exists for all k. [sent-739, score-0.322]
94 Now, we have that for each example i, if q∗ > 1 then lim (MAda λt )i < 0, that is, i t→∞ lim 1[(MAda λtcorrected )i ≤0] = 1, t→∞ and conversely, if q∗ < 1 then ˜i lim 1[(MAda λtcorrected )i ≤0] = 0. [sent-740, score-0.483]
95 Summing ˜i over i and k yields: lim t→∞ ∑ 1[(M Ada i∈Y+ = lim t→∞ λtcorrected )i ≤0] + ∑ 1[(M Ada i∈Y+ ∑ 1[(M Ada k∈Y− λt′corrected )i ≤0] + λtcorrected )k ≤0] ∑ 1[(M Ada k∈Y− λt′corrected )k ≤0] . [sent-744, score-0.322]
96 The second main result is an algorithm, smooth margin ranking, that maximizes the ranking margin. [sent-749, score-0.463]
97 Our third result is that under very general conditions, AdaBoost solves classification and ranking problems simultaneously, performing just as well for the ranking problem as RankBoost. [sent-750, score-0.456]
98 We described a new ranking algorithm, smooth margin ranking, that maximizes the margin. [sent-757, score-0.463]
99 It would be natural to compare the empirical performance of the smooth margin ranking algorithm and RankBoost. [sent-758, score-0.463]
100 τ′ (r,g,d0 ) ∂α(r,g,d0 ) − τ(r,g,d0 ) ∂g − ln τ(r, g, d0 ) = lim lim Γ(r, g, d0 ) = lim ∂α(r,g,d0 ) g→r g→r g→r α(r, g, d0 ) ∂g = lim g = r. [sent-796, score-0.768]
wordName wordTfidf (topN-words)
[('rankboost', 0.428), ('adaboost', 0.305), ('jt', 0.28), ('ranking', 0.228), ('ik', 0.213), ('rudin', 0.211), ('gt', 0.209), ('lim', 0.161), ('oost', 0.151), ('st', 0.148), ('margin', 0.147), ('chapire', 0.144), ('botd', 0.137), ('dt', 0.137), ('mik', 0.129), ('ln', 0.124), ('tcorrected', 0.123), ('ada', 0.117), ('px', 0.117), ('da', 0.116), ('anking', 0.096), ('topd', 0.094), ('weak', 0.089), ('smooth', 0.088), ('mada', 0.087), ('misrank', 0.087), ('argin', 0.084), ('bipartite', 0.083), ('coordinate', 0.081), ('ranker', 0.08), ('xk', 0.073), ('bots', 0.072), ('ps', 0.07), ('covering', 0.068), ('movies', 0.066), ('argmax', 0.066), ('corrected', 0.06), ('auc', 0.058), ('tops', 0.055), ('pd', 0.053), ('crucial', 0.051), ('miada', 0.05), ('limt', 0.049), ('misranking', 0.049), ('rankers', 0.049), ('roc', 0.048), ('separable', 0.045), ('ascent', 0.044), ('sylvia', 0.043), ('ranked', 0.042), ('ex', 0.042), ('gn', 0.041), ('lemma', 0.041), ('descent', 0.04), ('cortes', 0.04), ('objective', 0.037), ('rt', 0.036), ('dtt', 0.036), ('misclassi', 0.035), ('mohri', 0.035), ('exp', 0.034), ('schapire', 0.034), ('exponentiated', 0.033), ('boosting', 0.033), ('qt', 0.033), ('xi', 0.033), ('instances', 0.033), ('clemencon', 0.033), ('sloppy', 0.033), ('cynthia', 0.03), ('mkada', 0.029), ('bound', 0.029), ('movie', 0.028), ('area', 0.028), ('theorem', 0.028), ('bt', 0.028), ('curve', 0.028), ('della', 0.027), ('pietra', 0.027), ('pairs', 0.027), ('fm', 0.026), ('mcdiarmid', 0.026), ('freund', 0.026), ('vanishes', 0.026), ('corinna', 0.025), ('iteration', 0.025), ('classi', 0.025), ('proof', 0.025), ('nonseparable', 0.025), ('training', 0.024), ('iv', 0.023), ('equally', 0.023), ('progress', 0.022), ('agarwal', 0.022), ('derivative', 0.022), ('limg', 0.022), ('quantity', 0.021), ('quantities', 0.021), ('maximize', 0.021), ('mehryar', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
2 0.28734815 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
Author: Cynthia Rudin
Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval
3 0.19875382 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
Author: Shivani Agarwal, Partha Niyogi
Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability
4 0.11373696 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
5 0.073927164 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
6 0.073853433 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
7 0.06694822 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
8 0.066818111 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
9 0.062434167 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
10 0.057611194 50 jmlr-2009-Learning When Concepts Abound
11 0.052017342 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
13 0.048735783 67 jmlr-2009-Online Learning with Sample Path Constraints
14 0.042675771 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
15 0.041708514 18 jmlr-2009-Consistency and Localizability
16 0.040871557 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
17 0.038856406 49 jmlr-2009-Learning Permutations with Exponential Weights
18 0.037884865 23 jmlr-2009-Discriminative Learning Under Covariate Shift
19 0.037636627 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
20 0.035637274 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
topicId topicWeight
[(0, 0.229), (1, -0.001), (2, 0.056), (3, -0.048), (4, -0.099), (5, 0.33), (6, -0.171), (7, 0.477), (8, -0.188), (9, -0.085), (10, -0.211), (11, 0.178), (12, 0.048), (13, 0.063), (14, 0.049), (15, 0.037), (16, 0.081), (17, 0.091), (18, -0.005), (19, 0.087), (20, -0.001), (21, -0.057), (22, 0.014), (23, 0.021), (24, -0.054), (25, 0.019), (26, -0.06), (27, -0.004), (28, -0.018), (29, -0.035), (30, -0.008), (31, 0.015), (32, -0.004), (33, -0.034), (34, 0.047), (35, 0.043), (36, 0.038), (37, 0.023), (38, 0.036), (39, 0.025), (40, 0.036), (41, 0.014), (42, -0.041), (43, 0.013), (44, 0.013), (45, 0.029), (46, -0.01), (47, 0.036), (48, -0.066), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.96126533 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
2 0.86491001 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
Author: Cynthia Rudin
Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval
3 0.71310967 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
Author: Shivani Agarwal, Partha Niyogi
Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability
4 0.36530811 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
5 0.33044216 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options
6 0.25100991 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
7 0.22625823 50 jmlr-2009-Learning When Concepts Abound
9 0.21137686 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
10 0.19047029 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
11 0.18843462 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
12 0.18470463 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
13 0.17918827 18 jmlr-2009-Consistency and Localizability
14 0.17234942 46 jmlr-2009-Learning Halfspaces with Malicious Noise
15 0.17099553 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
16 0.16841173 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
17 0.16677928 67 jmlr-2009-Online Learning with Sample Path Constraints
18 0.16578178 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
19 0.16495109 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
20 0.16388713 73 jmlr-2009-Prediction With Expert Advice For The Brier Game
topicId topicWeight
[(8, 0.015), (11, 0.021), (19, 0.031), (38, 0.023), (47, 0.029), (52, 0.061), (53, 0.307), (55, 0.044), (58, 0.028), (66, 0.127), (68, 0.051), (90, 0.154), (96, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.75763178 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
2 0.58226562 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
Author: Cynthia Rudin
Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval
3 0.55200279 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg
Abstract: We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classifier systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG
4 0.54629004 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
5 0.5462665 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
6 0.53776014 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
7 0.52450955 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
10 0.51714492 18 jmlr-2009-Consistency and Localizability
11 0.5169487 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
12 0.51613253 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
13 0.5150972 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
14 0.51458436 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
15 0.51147473 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
16 0.51015937 78 jmlr-2009-Refinement of Reproducing Kernels
17 0.50884867 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
18 0.50667286 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
19 0.50585562 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
20 0.5052166 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)