jmlr jmlr2011 jmlr2011-71 knowledge-graph by maker-knowledge-mining

71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms


Source: pdf

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Specifically, we provide such an equivalence relationship between a generalization of Freund et al. [sent-6, score-0.193]

2 There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. [sent-9, score-0.56]

3 Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression 1. [sent-10, score-0.193]

4 In this work, we show that a particular equivalence relationship exists between two algorithms: a ranking algorithm called the P-Norm Push (Rudin, 2009), which is a generalization of RankBoost (Freund et al. [sent-21, score-0.409]

5 Specifically, we show that P-Classification not only optimizes its objective, but also optimizes the objective of the P-Norm Push (and vice versa, the P-Norm Push can be made to optimize P-Classification’s objective function). [sent-23, score-0.148]

6 Thus, P-Classification and the P-Norm Push perform equally well on both of their objectives; P-Classification can be used as a ranking algorithm, and the P-Norm Push can be made into a classification algorithm. [sent-24, score-0.216]

7 It is not clear that such an equivalence relationship holds between logistic regression and its ranking counterpart; in fact, our experiments indicate that no such relationship can hold. [sent-27, score-0.631]

8 Bipartite ranking problems are similar to but distinct from binary classification problems. [sent-28, score-0.216]

9 In both bipartite ranking and classification problems, the learner is given a training set of examples {(x1 , y1 ), . [sent-29, score-0.293]

10 The goal of bipartite ranking is to learn a real-valued ranking function f : X → Ê that ranks future positive instances higher than negative ones. [sent-33, score-0.531]

11 Bipartite ranking algorithms optimize rank statistics, such as the AUC. [sent-34, score-0.234]

12 They showed that AdaBoost, which iteratively minimizes the exponential misclassification loss, also iteratively minimizes RankBoost’s exponential ranking loss, and vice versa, that RankBoost with trivial modifications can be made to minimize the exponential misclassification loss. [sent-39, score-0.276]

13 The first result of our work, provided in Section 3, is to broaden that proof to handle more general ranking losses and classification losses. [sent-40, score-0.216]

14 The more general ranking loss is that of the P-Norm Push, which concentrates on “pushing” negatives away from the top of the ranked list. [sent-41, score-0.319]

15 Also in Section 3, we consider another simple cost-sensitive version of AdaBoost, and show the forward direction of its equivalence to RankBoost; in this case, the cost-sensitive version of AdaBoost minimizes RankBoost’s objective no matter what the cost parameter is. [sent-43, score-0.254]

16 In Section 4, we will verify the equivalence relationship empirically and provide evidence that no such relationship holds for logistic regression and its ranking counterpart. [sent-44, score-0.631]

17 In Section 5 we discuss the first two main benefits gained by this equivalence relationship described above, namely obtaining probability esti2906 O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS mates, and computing solutions faster. [sent-45, score-0.193]

18 , 2008), since in this work, the same set of features are used for both the classification and ranking problems without any modification. [sent-52, score-0.216]

19 Another recent work that addresses the use of classification algorithms for solving ranking problems is that of Kotlowski et al. [sent-53, score-0.216]

20 (2011), who show loss and regret bounds on the ranking performance of classifiers. [sent-54, score-0.259]

21 This y-intercept term is important in the equivalence proof; in order to turn the P-Norm Push into a classification algorithm, we need to place a decision boundary by adjusting the y-intercept. [sent-73, score-0.139]

22 The quality of a ranking function is often measured by the area under the ROC curve (AUC). [sent-75, score-0.216]

23 The associated misranking loss, related to 1 − AUC, is the number of positive instances that are ranked below negative instances: I K standard misranking loss ( f ) = ∑ ∑ 1[ f (xi )≤ f (xk )] . [sent-76, score-0.243]

24 ˜ (1) i=1 k=1 The ranking loss is zero when all negative instances are ranked below the positives instances. [sent-77, score-0.382]

25 (3) The ranking counterpart of AdaBoost is RankBoost (Freund et al. [sent-82, score-0.216]

26 We also investigate the equivalence relationship between logistic regression and its ranking counterpart, which we call “Logistic Regression Ranking” (LRR). [sent-89, score-0.56]

27 LRR’s objective (6) is an upper bound on the 0-1 ranking loss in (1), using the logistic loss log(1 + e−z ) to upper bound the 0-1 loss 1z≤0 . [sent-94, score-0.512]

28 Unlike the P-Norm Push, it minimizes a weighted misclassification error (rather than a weighted misranking error), though we will show that minimization of either objective yields an equally good result for either problem. [sent-98, score-0.155]

29 Lozano and Abe (2008) also use ℓ p norms within a boosting-style algorithm, but for the problem of label ranking instead of example ranking. [sent-143, score-0.216]

30 However, their objective is totally different than ours; for instance, it does not correspond directly to a 0-1 misranking loss like (1). [sent-144, score-0.168]

31 We will now show that the P-Norm Push is equivalent to P-Classification, meaning that minimizers of P-Classification’s objective are also minimizers of the P-Norm Push’s objective, and that there is a trivial transformation of the P-Norm Push’s output that will minimize P-Classification’s objective. [sent-145, score-0.181]

32 The forward direction of the equivalence relationship is as follows: Theorem 1 (P-Classification minimizes P-Norm Push’s objective) If λPC ∈ argminλ R PC (λ) (assuming minimizers exist), then λPC ∈ argminλ R PN (λ). [sent-152, score-0.318]

33 , n: 0= ∂R PC (λ) ∂λ j K λ=λPC = ∑ ep ∑ λ j PC h (x ) j ˜k j k=1 K = I h j (xk ) + ∑ e− ∑ j λ j ˜ PC h j (xi ) (−h j (xi )) i=1 I ˜ ∑ vkp h j (xk ) + ∑ qi (−h j (xi )) (8) i=1 k=1 ˜ where vk := e fλPC (xk ) and qi := e− fλPC (xi ) . [sent-159, score-0.233]

34 Using this, we can derive the skew condition: ˜ K 0= I k=1 i=1 PC PC ∑ vkp − ∑ qi := pR− (λPC ) − R+ (λPC ). [sent-162, score-0.196]

35 2910 (skew condition) (9) O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS Step 3 is to characterize the conditions to be at a minimizer of the ranking loss. [sent-163, score-0.259]

36 ∑ qi = p i=1 Using the skew condition (9), and then (8), ∂R PN (λ) ∂λ j p−1 I λ=λPC = p p I ∑ qi i=1 I k=1 i=1 ∂R PC (λ) ∂λ j λ=λPC I ∑ qi i=1 = p K i=1 ∑ qi ˜ ∑ h j (xk )vkp − ∑ h j (xi )qi = 0. [sent-167, score-0.347]

37 The backwards direction of the equivalence relationship is as follows: Theorem 2 (The P-Norm Push can be trivially altered to minimize P-Classification’s objective. [sent-169, score-0.211]

38 Proof We will first show that the skew condition is satisfied for corrected vector λPN,corr . [sent-176, score-0.141]

39 The condition we need to prove is: I K i=1 k=1 ∑ qcorr = ∑ (vcorr ) p i k (skew condition), (13) ˜ where vcorr := e fλPN,corr (xk ) and qcorr := e− fλPN,corr (xi ) . [sent-177, score-0.229]

40 According to (10), at λ = λPN,corr , ∂R PN (λ) ∂λ j p−1 λ=λPN,corr =p ∑ ˜ ∑ h j (xk )(vcorr ) p ∑ qcorr i k qcorr i i i k − ∑(vcorr ) p ∑ h j (xi )qcorr . [sent-182, score-0.176]

41 To show this, we prove the forward direction of the equivalence relationship between Cost-Sensitive AdaBoost (for any C) and RankBoost. [sent-194, score-0.23]

42 Here is the forward direction of the equivalence relationship: Theorem 3 (Cost-Sensitive AdaBoost minimizes RankBoost’s objective. [sent-196, score-0.189]

43 At λCSA we have: 0= ∂R CSA (λ) ∂λ j λ=λCSA = AB ∂R AB (λ) ∂R+ (λ) +C − ∂λ j ∂λ j I K i=1 = k=1 ˜ ∑ qi (−h j (xi )) +C ∑ vk h j (xk ) (16) ˜ where vk := e fλCSA (xk ) and qi := e− fλCSA (xi ) . [sent-200, score-0.248]

44 Using this, the skew condition ˜ can be derived as follows: 2913 E RTEKIN AND RUDIN K I k=1 i=1 AB AB 0 = C ∑ vk − ∑ qi := CR− (λCSA ) − R+ (λCSA ). [sent-203, score-0.211]

45 (skew condition) We now characterize the conditions to be at a minimizer of the ranking loss. [sent-204, score-0.259]

46 We further investigated whether a similar equivalence property holds for logistic regression and Logistic Regression Ranking (“LRR,” defined in Section 2). [sent-214, score-0.273]

47 4 suggesting that such an equivalence relationship does not hold between these two algorithms. [sent-216, score-0.193]

48 Coordinate descent was first suggested for logistic regression by Friedman et al. [sent-218, score-0.151]

49 Namely, we used the mapping: ′ xi = +1 if xi > −2 −1 otherwise and ′′ xi = +1 if xi > −4 . [sent-244, score-0.176]

50 75 0 10 20 30 40 50 Iterations Figure 2: Verifying the forward direction of the equivalence relationship for AdaBoost and RankBoost regime of convergence is reached. [sent-290, score-0.23]

51 We present empirical evidence to support the forward direction of the theoretical equivalence relationship for p = 1, on both the training and test splits, for all data sets described above. [sent-291, score-0.263]

52 In Figure 2, {λr }r and {λc }c denote sequences of coefficients produced by RankBoost and AdaBoost respectively; the subscripts r and c stand for ranking and classification. [sent-292, score-0.216]

53 One important distinction between training and testing phases is that the ranking loss is required to decrease monotonically on the training set, but not on the test set. [sent-304, score-0.308]

54 2916 O N E QUIVALENCE R ELATIONSHIPS B ETWEEN C LASSIFICATION AND R ANKING A LGORITHMS The next experiment verifies the backwards direction of the equivalence relationship. [sent-306, score-0.14]

55 We demonstrate that a sequence of corrected λ’s minimizing P-Norm Push’s objective also minimizes P-Classification’s objective. [sent-307, score-0.149]

56 Note that the pseudocode for minimizing logistic regression’s objective function would be similar, with the only change being that the definition of the matrix M is the same as in Figure 1. [sent-312, score-0.188]

57 Figure 6 provides evidence that no such equivalence relationship holds for logistic regression and Logistic Regression Ranking. [sent-313, score-0.344]

58 For this experiment, {λr }r and {λc }c denote sequences of coefficients produced by LRR and logistic regression respectively. [sent-314, score-0.151]

59 6 0 10 20 30 40 Iterations Figure 3: Verifying the forward direction of the equivalence theorem for P-Classification and the P-Norm Push (p=4) 2917 E RTEKIN AND RUDIN 5. [sent-354, score-0.159]

60 This result relies on properties of the loss functions, including smoothness, and the equivalence relationship of Theorem 2. [sent-359, score-0.236]

61 , 2007), even though AdaBoost generally provides models with high classification and ranking accuracy (e. [sent-363, score-0.216]

62 Input: Examples {(xi , yi )}m , where xi ∈ X , yi ∈ {−1, 1}, features {h j }n , h j : X → R , j=1 i=1 number of iterations tmax . [sent-370, score-0.147]

63 Proof This proof (in some sense) generalizes results from the analysis of AdaBoost and logistic regression (see Friedman et al. [sent-398, score-0.151]

64 2 0 50 Iterations 100 0 Iterations 100 200 300 Iterations Figure 6: Doubt on equivalence relationship for logistic regression and LRR and solving this equation for P(y = 1|x), 1 P(y = 1|x) = ′ l (1, f ∗ (x)) ′ −l (−1, f ∗ (x)) . [sent-436, score-0.344]

65 1 + e− fλ (x)(1+p) This expression was obtained for P-Classification, and extends to the P-Norm Push (with λ corrected) by the equivalence relationship of Theorem 2. [sent-441, score-0.193]

66 2 Runtime Performances Faster runtime is a major practical benefit of the equivalence relationship proved in Section 3. [sent-508, score-0.193]

67 3, when comparing how ranking algorithms and classification algorithms approach the minimizers of the misranking error, the ranking algorithms tend to converge more rapidly in terms of the number of iterations. [sent-511, score-0.55]

68 For RankBoost, note that a more efficient implementation is possible for bipartite ranking (see Section 3. [sent-516, score-0.258]

69 , 2003), though a more efficient implementation has not previously been explored in general for the P-Norm Push; in fact, the equivalence relationship allows us to use P-Classification instead, making it somewhat redundant to derive one. [sent-518, score-0.193]

70 Table 2 presents the number of iterations and the amount of time required to train each of the algorithms (AdaBoost, RankBoost, P-Classification for p=4, P-Norm Push for p=4, logistic regression, and Logistic Regression Ranking) in an experiment, using a 2. [sent-519, score-0.215]

71 For each algorithm, we report the mean and variance (over 10 training sets) of the number of iterations and the time elapsed (in seconds) for the ranking loss to be within 0. [sent-525, score-0.348]

72 The asymptotic value was obtained using 200 iterations of the corresponding ranking algorithm (for AdaBoost and RankBoost, we used RankBoost; for P-Classification and the P-Norm Push, we used the P-Norm Push; and for logistic regression and LRR, we used LRR). [sent-527, score-0.44]

73 Note that logistic regression may never converge to within 0. [sent-528, score-0.151]

74 05% of the ranking loss obtained by LRR (there is no established equivalence relationship), so “N/A” has been placed in the table when this occurs. [sent-529, score-0.381]

75 Comparing the runtime performances of classification and ranking algorithms in Table 2, AdaBoost and P-Classification yield dramatic improvement over their ranking counterparts. [sent-530, score-0.432]

76 Thus, the equivalence relationship between classification and ranking algorithms enables us to pass the efficiency of classification algorithms to their ranking counterparts, which leads to significant speed improvement for ranking tasks. [sent-541, score-0.841]

77 Experiments on Prediction Performance When evaluating the prediction performance of the P-Classification algorithm, we chose precision as our performance metric, motivated by a specific relationship between precision and PClassification’s objective that we derive in this section. [sent-543, score-0.268]

78 In a classification task, 100% precision means that every instance labeled as belonging to the positive class does indeed belong to the positive class, whereas 0% precision means that all positive instances are misclassified. [sent-546, score-0.164]

79 Through the equivalence of the P-Norm Push and P-Classification, a similar relationship with precision also exists for the P-Norm Push. [sent-554, score-0.259]

80 Another metric for evaluating the performance of a ranking function is recall, which is defined as the number of true positives divided by the total number of positive examples. [sent-570, score-0.276]

81 We chose training sets as follows: for MAGIC, 1000 randomly chosen examples, for Yeast, 500 randomly chosen examples, for Banana, 1000 randomly chosen examples and for Letter Recognition, 1200 examples with 200 positives and 1000 positives to achieve an imbalance ratio of 1:5. [sent-577, score-0.14]

82 Using C > 1 is equivalent to giving higher misclassification penalty to the negative examples, which can result in a stronger downward push on these examples, raising precision. [sent-906, score-0.296]

83 In comparing P-Classification (p=4) with logistic regression, P-Classification performed worse than logistic regression in only one comparison out of 12 (3 γ levels and 4 data sets). [sent-909, score-0.253]

84 Considering their cost#neg sensitive variants with C = #pos , P-Classification and logistic regression generally outperformed their original (non-cost-sensitive) formulations (10 out of 12 comparisons for P-Classification vs. [sent-910, score-0.151]

85 Cost-Sensitive P-Classification with p=4, and 11 out of 12 comparisons for logistic regression vs cost-sensitive logistic regression). [sent-911, score-0.253]

86 Furthermore, Cost-Sensitive P-Classification performed worse than cost-sensitive logistic regression in only 2 out of 12 comparisons. [sent-912, score-0.151]

87 4, logistic regression and LRR do not seem to exhibit the equivalence property that we have established for boosting-style algorithms. [sent-915, score-0.273]

88 Consequently, neither logistic regression or LRR may have the benefit of low classification loss and ranking loss simultaneously. [sent-916, score-0.453]

89 This limitation can be mitigated to some degree, through combining the benefits of logistic regression and LRR into a single hybrid algorithm that aims to solve both classification and ranking problems simultaneously. [sent-917, score-0.394]

90 β = 0 reduces the hybrid loss to that of logistic regression, whereas increasing β tilts the balance towards LRR. [sent-919, score-0.172]

91 The trade-off between classification and ranking accuracy is shown explicitly in Figure 7, which presents the 0-1 classification loss and 0-1 ranking loss of this hybrid approach at various β settings. [sent-920, score-0.545]

92 the baseline performance of AdaBoost, logistic regression and LRR. [sent-923, score-0.151]

93 For this particular experiment, logistic regression was able to achieve a better misclassification result than AdaBoost (see Figure 7(a)), but at the expense of a very large misranking error (see Figure 7(b)). [sent-924, score-0.211]

94 The value of β should be chosen based on the desired performance criteria for the specific application, determining the balance between desired classification vs ranking accuracy. [sent-926, score-0.216]

95 Conclusion We showed an equivalence relationship between two algorithms for two different tasks, based on a relationship between the minimizers of their objective functions. [sent-928, score-0.387]

96 This equivalence relationship provides an explanation for why these algorithms perform well with respect to multiple evaluation metrics, it allows us to compute conditional probability estimates for ranking algorithms, and permits the solution of ranking problems an order of magnitude faster. [sent-929, score-0.625]

97 We showed that our new classification algorithm is related to a performance metric used for ranking, and studied empirically how aspects of our new classification algorithm influence ranking performance. [sent-931, score-0.233]

98 Finally, we presented a new algorithm inspired by logistic regression that solves a task that is somewhere between classification and ranking, with the goal of providing solutions to both problems. [sent-933, score-0.151]

99 The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. [sent-1005, score-0.234]

100 Margin-based ranking and an equivalence between AdaBoost and RankBoost. [sent-1009, score-0.338]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adaboost', 0.418), ('lrr', 0.406), ('rankboost', 0.278), ('push', 0.271), ('ranking', 0.216), ('rp', 0.198), ('pc', 0.173), ('rrb', 0.15), ('pn', 0.147), ('rudin', 0.144), ('csa', 0.135), ('banana', 0.132), ('magic', 0.128), ('equivalence', 0.122), ('xk', 0.121), ('elationships', 0.106), ('etween', 0.106), ('quivalence', 0.106), ('rtekin', 0.106), ('yeast', 0.103), ('letter', 0.102), ('logistic', 0.102), ('neg', 0.097), ('qcorr', 0.088), ('skew', 0.087), ('misclassi', 0.074), ('anking', 0.074), ('rb', 0.074), ('iterations', 0.073), ('relationship', 0.071), ('lassification', 0.069), ('precision', 0.066), ('qi', 0.065), ('objective', 0.065), ('pos', 0.063), ('misranking', 0.06), ('vk', 0.059), ('minimizers', 0.058), ('ab', 0.055), ('corrected', 0.054), ('vcorr', 0.053), ('freund', 0.051), ('regression', 0.049), ('schapire', 0.049), ('mik', 0.045), ('lgorithms', 0.044), ('vkp', 0.044), ('xi', 0.044), ('positives', 0.043), ('loss', 0.043), ('minimizer', 0.043), ('bipartite', 0.042), ('jt', 0.041), ('train', 0.04), ('cynthia', 0.038), ('cutoff', 0.038), ('recognition', 0.036), ('exponent', 0.036), ('classi', 0.035), ('instances', 0.032), ('ik', 0.032), ('pushing', 0.031), ('objectives', 0.031), ('minimizes', 0.03), ('tmax', 0.03), ('boosting', 0.029), ('cation', 0.029), ('scoring', 0.028), ('hybrid', 0.027), ('pnormpush', 0.026), ('rcs', 0.026), ('dt', 0.026), ('yoav', 0.025), ('negative', 0.025), ('lr', 0.023), ('ranked', 0.023), ('seyda', 0.023), ('pseudocode', 0.021), ('linesearch', 0.02), ('judged', 0.02), ('forward', 0.019), ('examples', 0.019), ('negatives', 0.019), ('robert', 0.019), ('optimize', 0.018), ('concentrates', 0.018), ('direction', 0.018), ('adacost', 0.018), ('duffy', 0.018), ('kotlowski', 0.018), ('lrcs', 0.018), ('mason', 0.018), ('ranknet', 0.018), ('mi', 0.017), ('argmin', 0.017), ('test', 0.017), ('hypothesis', 0.017), ('boundary', 0.017), ('metric', 0.017), ('training', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

2 0.10148596 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier

3 0.066254705 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

4 0.049933124 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel

Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c

5 0.048508342 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

6 0.046001062 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

7 0.045632441 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

8 0.043462895 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

9 0.040776335 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

10 0.036501944 35 jmlr-2011-Forest Density Estimation

11 0.035447229 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

12 0.033812579 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

13 0.030264787 66 jmlr-2011-Multiple Kernel Learning Algorithms

14 0.028873261 55 jmlr-2011-Learning Multi-modal Similarity

15 0.028617267 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

16 0.028042641 28 jmlr-2011-Double Updating Online Learning

17 0.027384905 58 jmlr-2011-Learning from Partial Labels

18 0.02511367 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

19 0.024981763 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

20 0.02411432 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.143), (1, -0.011), (2, -0.019), (3, -0.038), (4, 0.036), (5, -0.079), (6, -0.068), (7, -0.008), (8, -0.074), (9, 0.033), (10, -0.199), (11, -0.006), (12, 0.006), (13, -0.103), (14, -0.02), (15, 0.164), (16, -0.107), (17, -0.045), (18, 0.08), (19, 0.029), (20, 0.082), (21, 0.089), (22, -0.258), (23, 0.08), (24, 0.175), (25, 0.01), (26, 0.142), (27, -0.098), (28, 0.039), (29, -0.128), (30, -0.026), (31, 0.23), (32, 0.313), (33, 0.203), (34, -0.006), (35, 0.036), (36, 0.171), (37, -0.016), (38, -0.091), (39, 0.03), (40, 0.007), (41, -0.027), (42, 0.056), (43, -0.102), (44, -0.007), (45, -0.025), (46, 0.047), (47, 0.15), (48, 0.116), (49, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95394498 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

2 0.60936457 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier

3 0.28571826 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

4 0.2600475 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel

Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c

5 0.24310152 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

6 0.20739686 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

7 0.20225547 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

8 0.19876806 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

9 0.18523282 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

10 0.16481204 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

11 0.1624673 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

12 0.15718478 6 jmlr-2011-A Simpler Approach to Matrix Completion

13 0.15342127 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

14 0.15272275 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

15 0.15117881 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

16 0.14740947 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

17 0.14010656 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

18 0.13792162 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

19 0.13758881 35 jmlr-2011-Forest Density Estimation

20 0.13679083 66 jmlr-2011-Multiple Kernel Learning Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.039), (9, 0.025), (10, 0.038), (24, 0.047), (31, 0.072), (32, 0.045), (41, 0.021), (60, 0.011), (71, 0.012), (73, 0.029), (78, 0.175), (88, 0.376)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.75486481 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

Author: Jim C. Huang, Brendan J. Frey

Abstract: We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs. In order to perform inference in such models, we describe the ‘derivative-sum-product’ (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF. We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research. Keywords: graphical models, cumulative distribution function, message-passing algorithm, inference

same-paper 2 0.73483741 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

3 0.47384152 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classification, optimization, empirical risk minimization, support vector machines, logistic regression

4 0.4694196 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

5 0.46120009 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efficiency of model-based methods with the space efficiency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efficiency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

6 0.45982361 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

7 0.44643912 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.44165021 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

9 0.44078568 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

10 0.44061038 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

11 0.43689695 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

12 0.43208879 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

13 0.43086571 91 jmlr-2011-The Sample Complexity of Dictionary Learning

14 0.42711365 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

15 0.42627898 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

16 0.42548645 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

17 0.42345351 104 jmlr-2011-X-Armed Bandits

18 0.42249942 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

19 0.42248088 36 jmlr-2011-Generalized TD Learning

20 0.42140788 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination