jmlr jmlr2011 jmlr2011-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Şeyda Ertekin, Cynthia Rudin
Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression
Reference: text
sentIndex sentText sentNum sentScore
1 Specifically, we provide such an equivalence relationship between a generalization of Freund et al. [sent-6, score-0.193]
2 There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. [sent-9, score-0.56]
3 Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression 1. [sent-10, score-0.193]
4 In this work, we show that a particular equivalence relationship exists between two algorithms: a ranking algorithm called the P-Norm Push (Rudin, 2009), which is a generalization of RankBoost (Freund et al. [sent-21, score-0.409]
5 Specifically, we show that P-Classification not only optimizes its objective, but also optimizes the objective of the P-Norm Push (and vice versa, the P-Norm Push can be made to optimize P-Classification’s objective function). [sent-23, score-0.148]
6 Thus, P-Classification and the P-Norm Push perform equally well on both of their objectives; P-Classification can be used as a ranking algorithm, and the P-Norm Push can be made into a classification algorithm. [sent-24, score-0.216]
7 It is not clear that such an equivalence relationship holds between logistic regression and its ranking counterpart; in fact, our experiments indicate that no such relationship can hold. [sent-27, score-0.631]
8 Bipartite ranking problems are similar to but distinct from binary classification problems. [sent-28, score-0.216]
9 In both bipartite ranking and classification problems, the learner is given a training set of examples {(x1 , y1 ), . [sent-29, score-0.293]
10 The goal of bipartite ranking is to learn a real-valued ranking function f : X → Ê that ranks future positive instances higher than negative ones. [sent-33, score-0.531]
11 Bipartite ranking algorithms optimize rank statistics, such as the AUC. [sent-34, score-0.234]
12 They showed that AdaBoost, which iteratively minimizes the exponential misclassification loss, also iteratively minimizes RankBoost’s exponential ranking loss, and vice versa, that RankBoost with trivial modifications can be made to minimize the exponential misclassification loss. [sent-39, score-0.276]
13 The first result of our work, provided in Section 3, is to broaden that proof to handle more general ranking losses and classification losses. [sent-40, score-0.216]
14 The more general ranking loss is that of the P-Norm Push, which concentrates on “pushing” negatives away from the top of the ranked list. [sent-41, score-0.319]
15 Also in Section 3, we consider another simple cost-sensitive version of AdaBoost, and show the forward direction of its equivalence to RankBoost; in this case, the cost-sensitive version of AdaBoost minimizes RankBoost’s objective no matter what the cost parameter is. [sent-43, score-0.254]
16 In Section 4, we will verify the equivalence relationship empirically and provide evidence that no such relationship holds for logistic regression and its ranking counterpart. [sent-44, score-0.631]
17 In Section 5 we discuss the first two main benefits gained by this equivalence relationship described above, namely obtaining probability esti2906 O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS mates, and computing solutions faster. [sent-45, score-0.193]
18 , 2008), since in this work, the same set of features are used for both the classification and ranking problems without any modification. [sent-52, score-0.216]
19 Another recent work that addresses the use of classification algorithms for solving ranking problems is that of Kotlowski et al. [sent-53, score-0.216]
20 (2011), who show loss and regret bounds on the ranking performance of classifiers. [sent-54, score-0.259]
21 This y-intercept term is important in the equivalence proof; in order to turn the P-Norm Push into a classification algorithm, we need to place a decision boundary by adjusting the y-intercept. [sent-73, score-0.139]
22 The quality of a ranking function is often measured by the area under the ROC curve (AUC). [sent-75, score-0.216]
23 The associated misranking loss, related to 1 − AUC, is the number of positive instances that are ranked below negative instances: I K standard misranking loss ( f ) = ∑ ∑ 1[ f (xi )≤ f (xk )] . [sent-76, score-0.243]
24 ˜ (1) i=1 k=1 The ranking loss is zero when all negative instances are ranked below the positives instances. [sent-77, score-0.382]
25 (3) The ranking counterpart of AdaBoost is RankBoost (Freund et al. [sent-82, score-0.216]
26 We also investigate the equivalence relationship between logistic regression and its ranking counterpart, which we call “Logistic Regression Ranking” (LRR). [sent-89, score-0.56]
27 LRR’s objective (6) is an upper bound on the 0-1 ranking loss in (1), using the logistic loss log(1 + e−z ) to upper bound the 0-1 loss 1z≤0 . [sent-94, score-0.512]
28 Unlike the P-Norm Push, it minimizes a weighted misclassification error (rather than a weighted misranking error), though we will show that minimization of either objective yields an equally good result for either problem. [sent-98, score-0.155]
29 Lozano and Abe (2008) also use ℓ p norms within a boosting-style algorithm, but for the problem of label ranking instead of example ranking. [sent-143, score-0.216]
30 However, their objective is totally different than ours; for instance, it does not correspond directly to a 0-1 misranking loss like (1). [sent-144, score-0.168]
31 We will now show that the P-Norm Push is equivalent to P-Classification, meaning that minimizers of P-Classification’s objective are also minimizers of the P-Norm Push’s objective, and that there is a trivial transformation of the P-Norm Push’s output that will minimize P-Classification’s objective. [sent-145, score-0.181]
32 The forward direction of the equivalence relationship is as follows: Theorem 1 (P-Classification minimizes P-Norm Push’s objective) If λPC ∈ argminλ R PC (λ) (assuming minimizers exist), then λPC ∈ argminλ R PN (λ). [sent-152, score-0.318]
33 , n: 0= ∂R PC (λ) ∂λ j K λ=λPC = ∑ ep ∑ λ j PC h (x ) j ˜k j k=1 K = I h j (xk ) + ∑ e− ∑ j λ j ˜ PC h j (xi ) (−h j (xi )) i=1 I ˜ ∑ vkp h j (xk ) + ∑ qi (−h j (xi )) (8) i=1 k=1 ˜ where vk := e fλPC (xk ) and qi := e− fλPC (xi ) . [sent-159, score-0.233]
34 Using this, we can derive the skew condition: ˜ K 0= I k=1 i=1 PC PC ∑ vkp − ∑ qi := pR− (λPC ) − R+ (λPC ). [sent-162, score-0.196]
35 2910 (skew condition) (9) O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS Step 3 is to characterize the conditions to be at a minimizer of the ranking loss. [sent-163, score-0.259]
36 ∑ qi = p i=1 Using the skew condition (9), and then (8), ∂R PN (λ) ∂λ j p−1 I λ=λPC = p p I ∑ qi i=1 I k=1 i=1 ∂R PC (λ) ∂λ j λ=λPC I ∑ qi i=1 = p K i=1 ∑ qi ˜ ∑ h j (xk )vkp − ∑ h j (xi )qi = 0. [sent-167, score-0.347]
37 The backwards direction of the equivalence relationship is as follows: Theorem 2 (The P-Norm Push can be trivially altered to minimize P-Classification’s objective. [sent-169, score-0.211]
38 Proof We will first show that the skew condition is satisfied for corrected vector λPN,corr . [sent-176, score-0.141]
39 The condition we need to prove is: I K i=1 k=1 ∑ qcorr = ∑ (vcorr ) p i k (skew condition), (13) ˜ where vcorr := e fλPN,corr (xk ) and qcorr := e− fλPN,corr (xi ) . [sent-177, score-0.229]
40 According to (10), at λ = λPN,corr , ∂R PN (λ) ∂λ j p−1 λ=λPN,corr =p ∑ ˜ ∑ h j (xk )(vcorr ) p ∑ qcorr i k qcorr i i i k − ∑(vcorr ) p ∑ h j (xi )qcorr . [sent-182, score-0.176]
41 To show this, we prove the forward direction of the equivalence relationship between Cost-Sensitive AdaBoost (for any C) and RankBoost. [sent-194, score-0.23]
42 Here is the forward direction of the equivalence relationship: Theorem 3 (Cost-Sensitive AdaBoost minimizes RankBoost’s objective. [sent-196, score-0.189]
43 At λCSA we have: 0= ∂R CSA (λ) ∂λ j λ=λCSA = AB ∂R AB (λ) ∂R+ (λ) +C − ∂λ j ∂λ j I K i=1 = k=1 ˜ ∑ qi (−h j (xi )) +C ∑ vk h j (xk ) (16) ˜ where vk := e fλCSA (xk ) and qi := e− fλCSA (xi ) . [sent-200, score-0.248]
44 Using this, the skew condition ˜ can be derived as follows: 2913 E RTEKIN AND RUDIN K I k=1 i=1 AB AB 0 = C ∑ vk − ∑ qi := CR− (λCSA ) − R+ (λCSA ). [sent-203, score-0.211]
45 (skew condition) We now characterize the conditions to be at a minimizer of the ranking loss. [sent-204, score-0.259]
46 We further investigated whether a similar equivalence property holds for logistic regression and Logistic Regression Ranking (“LRR,” defined in Section 2). [sent-214, score-0.273]
47 4 suggesting that such an equivalence relationship does not hold between these two algorithms. [sent-216, score-0.193]
48 Coordinate descent was first suggested for logistic regression by Friedman et al. [sent-218, score-0.151]
49 Namely, we used the mapping: ′ xi = +1 if xi > −2 −1 otherwise and ′′ xi = +1 if xi > −4 . [sent-244, score-0.176]
50 75 0 10 20 30 40 50 Iterations Figure 2: Verifying the forward direction of the equivalence relationship for AdaBoost and RankBoost regime of convergence is reached. [sent-290, score-0.23]
51 We present empirical evidence to support the forward direction of the theoretical equivalence relationship for p = 1, on both the training and test splits, for all data sets described above. [sent-291, score-0.263]
52 In Figure 2, {λr }r and {λc }c denote sequences of coefficients produced by RankBoost and AdaBoost respectively; the subscripts r and c stand for ranking and classification. [sent-292, score-0.216]
53 One important distinction between training and testing phases is that the ranking loss is required to decrease monotonically on the training set, but not on the test set. [sent-304, score-0.308]
54 2916 O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS The next experiment verifies the backwards direction of the equivalence relationship. [sent-306, score-0.14]
55 We demonstrate that a sequence of corrected λ’s minimizing P-Norm Push’s objective also minimizes P-Classification’s objective. [sent-307, score-0.149]
56 Note that the pseudocode for minimizing logistic regression’s objective function would be similar, with the only change being that the definition of the matrix M is the same as in Figure 1. [sent-312, score-0.188]
57 Figure 6 provides evidence that no such equivalence relationship holds for logistic regression and Logistic Regression Ranking. [sent-313, score-0.344]
58 For this experiment, {λr }r and {λc }c denote sequences of coefficients produced by LRR and logistic regression respectively. [sent-314, score-0.151]
59 6 0 10 20 30 40 Iterations Figure 3: Verifying the forward direction of the equivalence theorem for P-Classification and the P-Norm Push (p=4) 2917 E RTEKIN AND RUDIN 5. [sent-354, score-0.159]
60 This result relies on properties of the loss functions, including smoothness, and the equivalence relationship of Theorem 2. [sent-359, score-0.236]
61 , 2007), even though AdaBoost generally provides models with high classification and ranking accuracy (e. [sent-363, score-0.216]
62 Input: Examples {(xi , yi )}m , where xi ∈ X , yi ∈ {−1, 1}, features {h j }n , h j : X → R , j=1 i=1 number of iterations tmax . [sent-370, score-0.147]
63 Proof This proof (in some sense) generalizes results from the analysis of AdaBoost and logistic regression (see Friedman et al. [sent-398, score-0.151]
64 2 0 50 Iterations 100 0 Iterations 100 200 300 Iterations Figure 6: Doubt on equivalence relationship for logistic regression and LRR and solving this equation for P(y = 1|x), 1 P(y = 1|x) = ′ l (1, f ∗ (x)) ′ −l (−1, f ∗ (x)) . [sent-436, score-0.344]
65 1 + e− fλ (x)(1+p) This expression was obtained for P-Classification, and extends to the P-Norm Push (with λ corrected) by the equivalence relationship of Theorem 2. [sent-441, score-0.193]
66 2 Runtime Performances Faster runtime is a major practical benefit of the equivalence relationship proved in Section 3. [sent-508, score-0.193]
67 3, when comparing how ranking algorithms and classification algorithms approach the minimizers of the misranking error, the ranking algorithms tend to converge more rapidly in terms of the number of iterations. [sent-511, score-0.55]
68 For RankBoost, note that a more efficient implementation is possible for bipartite ranking (see Section 3. [sent-516, score-0.258]
69 , 2003), though a more efficient implementation has not previously been explored in general for the P-Norm Push; in fact, the equivalence relationship allows us to use P-Classification instead, making it somewhat redundant to derive one. [sent-518, score-0.193]
70 Table 2 presents the number of iterations and the amount of time required to train each of the algorithms (AdaBoost, RankBoost, P-Classification for p=4, P-Norm Push for p=4, logistic regression, and Logistic Regression Ranking) in an experiment, using a 2. [sent-519, score-0.215]
71 For each algorithm, we report the mean and variance (over 10 training sets) of the number of iterations and the time elapsed (in seconds) for the ranking loss to be within 0. [sent-525, score-0.348]
72 The asymptotic value was obtained using 200 iterations of the corresponding ranking algorithm (for AdaBoost and RankBoost, we used RankBoost; for P-Classification and the P-Norm Push, we used the P-Norm Push; and for logistic regression and LRR, we used LRR). [sent-527, score-0.44]
73 Note that logistic regression may never converge to within 0. [sent-528, score-0.151]
74 05% of the ranking loss obtained by LRR (there is no established equivalence relationship), so “N/A” has been placed in the table when this occurs. [sent-529, score-0.381]
75 Comparing the runtime performances of classification and ranking algorithms in Table 2, AdaBoost and P-Classification yield dramatic improvement over their ranking counterparts. [sent-530, score-0.432]
76 Thus, the equivalence relationship between classification and ranking algorithms enables us to pass the efficiency of classification algorithms to their ranking counterparts, which leads to significant speed improvement for ranking tasks. [sent-541, score-0.841]
77 Experiments on Prediction Performance When evaluating the prediction performance of the P-Classification algorithm, we chose precision as our performance metric, motivated by a specific relationship between precision and PClassification’s objective that we derive in this section. [sent-543, score-0.268]
78 In a classification task, 100% precision means that every instance labeled as belonging to the positive class does indeed belong to the positive class, whereas 0% precision means that all positive instances are misclassified. [sent-546, score-0.164]
79 Through the equivalence of the P-Norm Push and P-Classification, a similar relationship with precision also exists for the P-Norm Push. [sent-554, score-0.259]
80 Another metric for evaluating the performance of a ranking function is recall, which is defined as the number of true positives divided by the total number of positive examples. [sent-570, score-0.276]
81 We chose training sets as follows: for MAGIC, 1000 randomly chosen examples, for Yeast, 500 randomly chosen examples, for Banana, 1000 randomly chosen examples and for Letter Recognition, 1200 examples with 200 positives and 1000 positives to achieve an imbalance ratio of 1:5. [sent-577, score-0.14]
82 Using C > 1 is equivalent to giving higher misclassification penalty to the negative examples, which can result in a stronger downward push on these examples, raising precision. [sent-906, score-0.296]
83 In comparing P-Classification (p=4) with logistic regression, P-Classification performed worse than logistic regression in only one comparison out of 12 (3 γ levels and 4 data sets). [sent-909, score-0.253]
84 Considering their cost#neg sensitive variants with C = #pos , P-Classification and logistic regression generally outperformed their original (non-cost-sensitive) formulations (10 out of 12 comparisons for P-Classification vs. [sent-910, score-0.151]
85 Cost-Sensitive P-Classification with p=4, and 11 out of 12 comparisons for logistic regression vs cost-sensitive logistic regression). [sent-911, score-0.253]
86 Furthermore, Cost-Sensitive P-Classification performed worse than cost-sensitive logistic regression in only 2 out of 12 comparisons. [sent-912, score-0.151]
87 4, logistic regression and LRR do not seem to exhibit the equivalence property that we have established for boosting-style algorithms. [sent-915, score-0.273]
88 Consequently, neither logistic regression or LRR may have the benefit of low classification loss and ranking loss simultaneously. [sent-916, score-0.453]
89 This limitation can be mitigated to some degree, through combining the benefits of logistic regression and LRR into a single hybrid algorithm that aims to solve both classification and ranking problems simultaneously. [sent-917, score-0.394]
90 β = 0 reduces the hybrid loss to that of logistic regression, whereas increasing β tilts the balance towards LRR. [sent-919, score-0.172]
91 The trade-off between classification and ranking accuracy is shown explicitly in Figure 7, which presents the 0-1 classification loss and 0-1 ranking loss of this hybrid approach at various β settings. [sent-920, score-0.545]
92 the baseline performance of AdaBoost, logistic regression and LRR. [sent-923, score-0.151]
93 For this particular experiment, logistic regression was able to achieve a better misclassification result than AdaBoost (see Figure 7(a)), but at the expense of a very large misranking error (see Figure 7(b)). [sent-924, score-0.211]
94 The value of β should be chosen based on the desired performance criteria for the specific application, determining the balance between desired classification vs ranking accuracy. [sent-926, score-0.216]
95 Conclusion We showed an equivalence relationship between two algorithms for two different tasks, based on a relationship between the minimizers of their objective functions. [sent-928, score-0.387]
96 This equivalence relationship provides an explanation for why these algorithms perform well with respect to multiple evaluation metrics, it allows us to compute conditional probability estimates for ranking algorithms, and permits the solution of ranking problems an order of magnitude faster. [sent-929, score-0.625]
97 We showed that our new classification algorithm is related to a performance metric used for ranking, and studied empirically how aspects of our new classification algorithm influence ranking performance. [sent-931, score-0.233]
98 Finally, we presented a new algorithm inspired by logistic regression that solves a task that is somewhere between classification and ranking, with the goal of providing solutions to both problems. [sent-933, score-0.151]
99 The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. [sent-1005, score-0.234]
100 Margin-based ranking and an equivalence between AdaBoost and RankBoost. [sent-1009, score-0.338]
wordName wordTfidf (topN-words)
[('adaboost', 0.418), ('lrr', 0.406), ('rankboost', 0.278), ('push', 0.271), ('ranking', 0.216), ('rp', 0.198), ('pc', 0.173), ('rrb', 0.15), ('pn', 0.147), ('rudin', 0.144), ('csa', 0.135), ('banana', 0.132), ('magic', 0.128), ('equivalence', 0.122), ('xk', 0.121), ('elationships', 0.106), ('etween', 0.106), ('quivalence', 0.106), ('rtekin', 0.106), ('yeast', 0.103), ('letter', 0.102), ('logistic', 0.102), ('neg', 0.097), ('qcorr', 0.088), ('skew', 0.087), ('misclassi', 0.074), ('anking', 0.074), ('rb', 0.074), ('iterations', 0.073), ('relationship', 0.071), ('lassification', 0.069), ('precision', 0.066), ('qi', 0.065), ('objective', 0.065), ('pos', 0.063), ('misranking', 0.06), ('vk', 0.059), ('minimizers', 0.058), ('ab', 0.055), ('corrected', 0.054), ('vcorr', 0.053), ('freund', 0.051), ('regression', 0.049), ('schapire', 0.049), ('mik', 0.045), ('lgorithms', 0.044), ('vkp', 0.044), ('xi', 0.044), ('positives', 0.043), ('loss', 0.043), ('minimizer', 0.043), ('bipartite', 0.042), ('jt', 0.041), ('train', 0.04), ('cynthia', 0.038), ('cutoff', 0.038), ('recognition', 0.036), ('exponent', 0.036), ('classi', 0.035), ('instances', 0.032), ('ik', 0.032), ('pushing', 0.031), ('objectives', 0.031), ('minimizes', 0.03), ('tmax', 0.03), ('boosting', 0.029), ('cation', 0.029), ('scoring', 0.028), ('hybrid', 0.027), ('pnormpush', 0.026), ('rcs', 0.026), ('dt', 0.026), ('yoav', 0.025), ('negative', 0.025), ('lr', 0.023), ('ranked', 0.023), ('seyda', 0.023), ('pseudocode', 0.021), ('linesearch', 0.02), ('judged', 0.02), ('forward', 0.019), ('examples', 0.019), ('negatives', 0.019), ('robert', 0.019), ('optimize', 0.018), ('concentrates', 0.018), ('direction', 0.018), ('adacost', 0.018), ('duffy', 0.018), ('kotlowski', 0.018), ('lrcs', 0.018), ('mason', 0.018), ('ranknet', 0.018), ('mi', 0.017), ('argmin', 0.017), ('test', 0.017), ('hypothesis', 0.017), ('boundary', 0.017), ('metric', 0.017), ('training', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
Author: Şeyda Ertekin, Cynthia Rudin
Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression
2 0.10148596 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
3 0.066254705 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
4 0.049933124 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
5 0.048508342 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
6 0.046001062 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
7 0.045632441 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
8 0.043462895 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
9 0.040776335 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
10 0.036501944 35 jmlr-2011-Forest Density Estimation
11 0.035447229 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
12 0.033812579 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
13 0.030264787 66 jmlr-2011-Multiple Kernel Learning Algorithms
14 0.028873261 55 jmlr-2011-Learning Multi-modal Similarity
15 0.028617267 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.028042641 28 jmlr-2011-Double Updating Online Learning
17 0.027384905 58 jmlr-2011-Learning from Partial Labels
18 0.02511367 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
19 0.024981763 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
20 0.02411432 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
topicId topicWeight
[(0, 0.143), (1, -0.011), (2, -0.019), (3, -0.038), (4, 0.036), (5, -0.079), (6, -0.068), (7, -0.008), (8, -0.074), (9, 0.033), (10, -0.199), (11, -0.006), (12, 0.006), (13, -0.103), (14, -0.02), (15, 0.164), (16, -0.107), (17, -0.045), (18, 0.08), (19, 0.029), (20, 0.082), (21, 0.089), (22, -0.258), (23, 0.08), (24, 0.175), (25, 0.01), (26, 0.142), (27, -0.098), (28, 0.039), (29, -0.128), (30, -0.026), (31, 0.23), (32, 0.313), (33, 0.203), (34, -0.006), (35, 0.036), (36, 0.171), (37, -0.016), (38, -0.091), (39, 0.03), (40, 0.007), (41, -0.027), (42, 0.056), (43, -0.102), (44, -0.007), (45, -0.025), (46, 0.047), (47, 0.15), (48, 0.116), (49, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.95394498 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
Author: Şeyda Ertekin, Cynthia Rudin
Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression
2 0.60936457 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
3 0.28571826 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
4 0.2600475 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
5 0.24310152 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama
Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel
6 0.20739686 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
7 0.20225547 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
8 0.19876806 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
9 0.18523282 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
10 0.16481204 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.1624673 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
12 0.15718478 6 jmlr-2011-A Simpler Approach to Matrix Completion
13 0.15342127 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
14 0.15272275 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
15 0.15117881 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
16 0.14740947 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
17 0.14010656 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
18 0.13792162 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
19 0.13758881 35 jmlr-2011-Forest Density Estimation
20 0.13679083 66 jmlr-2011-Multiple Kernel Learning Algorithms
topicId topicWeight
[(4, 0.039), (9, 0.025), (10, 0.038), (24, 0.047), (31, 0.072), (32, 0.045), (41, 0.021), (60, 0.011), (71, 0.012), (73, 0.029), (78, 0.175), (88, 0.376)]
simIndex simValue paperId paperTitle
Author: Jim C. Huang, Brendan J. Frey
Abstract: We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs. In order to perform inference in such models, we describe the ‘derivative-sum-product’ (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF. We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research. Keywords: graphical models, cumulative distribution function, message-passing algorithm, inference
same-paper 2 0.73483741 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
Author: Şeyda Ertekin, Cynthia Rudin
Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression
3 0.47384152 22 jmlr-2011-Differentially Private Empirical Risk Minimization
Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate
Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classification, optimization, empirical risk minimization, support vector machines, logistic regression
4 0.4694196 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
5 0.46120009 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering
Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efficiency of model-based methods with the space efficiency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efficiency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse
6 0.45982361 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
7 0.44643912 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
8 0.44165021 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
9 0.44078568 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
10 0.44061038 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
11 0.43689695 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
12 0.43208879 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
13 0.43086571 91 jmlr-2011-The Sample Complexity of Dictionary Learning
14 0.42711365 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
15 0.42627898 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
16 0.42548645 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
17 0.42345351 104 jmlr-2011-X-Armed Bandits
18 0.42249942 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
19 0.42248088 36 jmlr-2011-Generalized TD Learning
20 0.42140788 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination