jmlr jmlr2010 jmlr2010-21 knowledge-graph by maker-knowledge-mining

21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Source: pdf

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classiﬁcation with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassiﬁcation. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufﬁcient condition for inﬁnite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classiﬁcation, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. [sent-7, score-0.739]

2 A necessary and sufﬁcient condition for inﬁnite sample consistency (both risks share the same minimizer) is provided. [sent-8, score-0.186]

3 Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. [sent-9, score-0.915]

4 These bounds can be tightened by a generalized margin condition. [sent-10, score-0.153]

5 The impact of the results is illustrated on several commonly used surrogate loss functions. [sent-11, score-0.226]

6 Keywords: classiﬁcation, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option 1. [sent-12, score-0.827]

7 In such situations, a less speciﬁc response that reserves the right of not making a decision, sometimes referred to as a reject option (see, e. [sent-19, score-0.43]

8 But in the event that there are ambiguities, it would be more desirable to take a rejection option and seek more expensive studies to identify a subject’s disease status. [sent-24, score-0.322]

9 Similar approaches are often adopted in DNA sequencing or genotyping applications, where the rejection option is commonly referred to as a “no-call”. [sent-25, score-0.286]

10 ˜ ˜ To accommodate the reject option, we now seek a classiﬁcation rule g : X → Y where Y = {−1, 1, 0} is an augmented response space and g(X) = 0 indicates that no deﬁnitive classiﬁcation will be made for X or a reject option is taken. [sent-29, score-0.759]

11 To measure the performance of a classiﬁcation rule, we employ the following loss function that generalizes the usual 0-1 loss to account for reject option:   1 if g(X) = Y and g(X) = 0 d if g(X) = 0 . [sent-30, score-0.546]

12 ℓ[g(X),Y ] =  0 if g(X) = Y In other words, an ambiguous response (g(X) = 0) incurs a loss of d whereas misclassiﬁcation incurs a loss of 1. [sent-31, score-0.441]

13 Otherwise, rather than taking a rejection option with a loss d, we can always ﬂip a fair coin to randomly assign ±1 as the value of g(X), which incurs an average loss of 1/2 ≤ d. [sent-33, score-0.62]

14 For this reason, we shall assume that d < 1/2 in what follows. [sent-34, score-0.104]

15 ˜ For any classiﬁcation rule g : X → Y , the risk function is then given by R(g) = E(ℓ[g(X),Y ]) where the expectation is taken over the joint distribution of X and Y . [sent-35, score-0.281]

16 It is not hard to show that the optimal classiﬁcation rule g∗ := arg min R(g) is given by (see, e. [sent-36, score-0.168]

17 The corresponding risk is R∗ := inf R(g) = R(g∗ ) = E (min{η(X), 1 − η(X), d}) . [sent-39, score-0.287]

18 ˜ Thus, the performance of any classiﬁcation rule g : X → Y can be measured by the excess risk ∗. [sent-40, score-0.499]

19 ∆R(g) := R(g) − R Appealing to the general empirical risk minimization strategy, one could attempt to derive a classiﬁcation rule from the training data by minimizing the empirical risk Rn (g) = 1 n ∑ ℓ[g(Xi),Yi ]. [sent-41, score-0.513]

20 A common remedy is to consider a surrogate convex loss function. [sent-43, score-0.276]

21 Denote by Q( f ) = E[φ(Y f (X))] the corresponding risk for a discriminant function f : X → R. [sent-45, score-0.232]

22 fˆn can be conveniently converted to a classiﬁcation rule C( fˆn ; δ) as follows:  if f (X) > δ  1 0 if | f (X)| ≤ δ C( f (X); δ) =  −1 if f (X) < −δ where δ > 0 is a parameter that as we shall see plays a critical role in determining the performance of C( fˆn , δ). [sent-47, score-0.211]

23 In this paper, we investigate the statistical properties of this general convex risk minimization technique. [sent-48, score-0.282]

24 To what extent C( fˆn , δ) mimics the optimal classiﬁcation rule g∗ plays a critical role in ∗ the success of this technique. [sent-49, score-0.107]

25 We shall assume throughout the ∗ ∗ paper that fφ is uniquely deﬁned. [sent-51, score-0.104]

26 A surrogate loss function φ that satisﬁes this property is often called inﬁnite sample consistent (see, e. [sent-54, score-0.287]

27 A second question further concerns the relationship between the excess risk ∆R[C( f , δ)] and the excess φ risk ∆Q( f ) = Q( f ) − inf Q( f ): Can we ﬁnd an increasing function ρ : R → R such that for all f , ∆R[C( f , δ)] ≤ ρ (∆Q( f )) ? [sent-59, score-0.925]

28 (1) Clearly the inﬁnite sample consistency of φ implies that ρ(0) = 0. [sent-60, score-0.14]

29 Such a bound on the excess risk provides useful tools in bounding the excess risk of fˆn . [sent-61, score-0.84]

30 In particular, (1) indicates that ∆R[C( fˆn , δ)] ≤ ρ ∆Q( fˆn ) = ρ ∆Q( f¯) + Q( fˆn ) − Q( f¯) , where f¯ = arg min f ∈F Q( f ). [sent-62, score-0.089]

31 In the case when there is no reject option, these problems have been well investigated in recent years (Lin, 2002; Zhang, 2004; Bartlett, Jordan and McAuliffe, 2006). [sent-64, score-0.22]

32 In this paper, we establish similar results when there is a reject option. [sent-65, score-0.22]

33 The most signiﬁcant difference between the two situations, with or without the reject option, is the role of δ. [sent-66, score-0.22]

34 As we shall see, for some loss functions such as least squares, exponential or logistic, a good choice of δ yields classiﬁers that are inﬁnite sample consistent. [sent-67, score-0.316]

35 For other loss functions, however, such as the hinge loss, no matter how δ is chosen, the classiﬁcation rule C( f , δ) cannot be inﬁnite sample consistent. [sent-68, score-0.483]

36 We ﬁrst examine in Section 2 the inﬁnite sample consistency for classiﬁcation with reject option. [sent-70, score-0.327]

37 After establishing a general result, we consider its implication on several commonly used loss functions. [sent-71, score-0.151]

38 In Section 3, we establish bounds on the excess risk in the form of (1), followed by applications to the popular loss functions. [sent-72, score-0.571]

39 We also show that under an additional assumption on the behavior of η(X) near d and 1 − d as in Herbei and Wegkamp (2006), generalizing the condition in the case of d = 1/2 of Mammen and Tsybakov (1999) and Tsybakov (2004), the bound (1) can be tightened considerably. [sent-73, score-0.105]

40 Section 4 discusses rates of convergence of the empirical risk minimizer fn that minimizes the empirical risk Qn ( f ) over a bounded class F . [sent-74, score-0.579]

41 Section 5 considers extension to asymmetric loss where one type of misclassiﬁcation may be more costly than the other. [sent-75, score-0.181]

42 Inﬁnite Sample Consistency ∗ We ﬁrst give a general result on the inﬁnite sample consistency of the classiﬁcation rule C( fφ , δ). [sent-78, score-0.186]

43 Then the classiﬁcation rule C( fφ , δ) for some δ > 0 is inﬁnite ∗ , δ) = g∗ if and only if φ′ (δ) and φ′ (−δ) both exist, φ′ (δ) < 0, and sample consistent, that is, C( fφ φ′ (δ) = d. [sent-80, score-0.114]

44 φ′ (δ) + φ′ (−δ) (2) When there is no reject option, it is known that the necessary and sufﬁcient condition for the inﬁnite sample consistency is that φ is differentiable at 0 and φ′ (0) < 0 (see, e. [sent-81, score-0.405]

45 As indicated by Theorem 1, the differentiability of φ at ±δ plays a more prominent role in the general case when there is a reject option. [sent-84, score-0.283]

46 From Theorem 1 it is also evident that the inﬁnite sample consistency depends on both φ and the choice of thresholding parameter δ. [sent-85, score-0.107]

47 1 Least Squares Loss We ﬁrst examine the least squares loss φ(z) = (1 − z)2 . [sent-97, score-0.214]

48 3 Logistic Loss Logisitic regression employs loss φ(z) = ln(1 + exp(−z)). [sent-106, score-0.179]

49 4 Squared Hinge Loss Squared hinge loss, φ(z) = (1 − z)2 , is another popular choice for which + φ′ (δ) 1−δ = . [sent-110, score-0.218]

50 Corollary 6 For the squared hinge loss, ∗ C fφ , 1 − 2d = g∗ . [sent-112, score-0.249]

51 5 Distance Weighted Discrimination Marron, Todd and Ahn (2007) recently introduced the so-called distance weighted discrimination method where the following loss function (see, e. [sent-114, score-0.281]

52 if δ > γ In other words, we have the following result for the distance weighted discrimination loss. [sent-120, score-0.13]

53 Corollary 7 For the loss (3), ∗ C fφ , [(1 − d)/d]1/2 γ = g∗ . [sent-121, score-0.151]

54 6 Hinge Loss The popular support vector machine employs the hinge loss, φ(z) = (1 − z)+ . [sent-123, score-0.246]

55 Motivated by this observation, Bartlett and Wegkamp (2008) introduce the following modiﬁcation to the hinge loss:   1 − az if z ≤ 0 1 − z if 0 < z ≤ 1 , φ(z) = (4)  0 if z > 1 where a > 1. [sent-128, score-0.218]

56 Corollary 8 For the modiﬁed hinge loss (4) and any δ < 1, if a = (1 − d)/d, then ∗ C fφ , δ = g∗ . [sent-131, score-0.369]

57 Whereas for the modiﬁed hinge loss, a range of choice of δ can serve the same purpose. [sent-133, score-0.218]

58 However, as we shall see in the next section, different choices of δ for the modiﬁed hinge loss may result in slightly different bound on the excess risk with δ = 1/2 appearing to be more preferable in that it yields the smallest upper bound of the excess risk. [sent-134, score-1.139]

59 Excess Risk We now turn to the excess risk ∆R [C( f , δ)] and show how it can be bounded through the excess φ risk ∗ ∆Q( f ) := Q( f ) − Q( fφ ). [sent-136, score-0.84]

60 Recall that the inﬁnite sample consistency established in the previous section means that ∆Q( f ) = 0 implies throughout this section that ∆R(C( f , δ)) = 0. [sent-137, score-0.14]

61 For brevity, we shall assume implicitly that δ is chosen in accordance with Theorem 1 to ensure inﬁnite sample consistency. [sent-138, score-0.139]

62 In other words, consistency in terms of φ risk implies the consistency in terms of loss ℓ. [sent-146, score-0.53]

63 We can improve the bounds even further by the following margin condition. [sent-150, score-0.097]

64 This assumption was introduced in Herbei and Wegkamp (2006) and generalizes the margin condition of Mammen and Tsybakov (1999) and Tsybakov (2004). [sent-152, score-0.17]

65 We now examine the consequences of Theorems 9, 10 and 11 on several common loss functions. [sent-159, score-0.175]

66 1 Least Squares Note that for the least squares loss ∆Qη ( f ) = (2η − 1 − f )2 . [sent-161, score-0.214]

67 Furthermore, if the margin condition (6) holds, then 1+α ∆R[C( f , 1 − 2d)] ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-164, score-0.146]

68 Furthermore, if the margin condition (6) holds, then 1 1 f , log −1 2 d ∆R C 1+α ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-171, score-0.173]

69 Furthermore, if the margin condition (6) holds, then ∆R C f , log 1 −1 d 1+α ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-176, score-0.173]

70 By Theorems 9 and 11, Corollary 15 For the squared hinge loss, ∆R [C( f , 1 − 2d)] ≤ [∆Q( f )]1/2 . [sent-180, score-0.249]

71 Furthermore, if the margin condition (6) holds, then 1+α ∆R[C( f , 1 − 2d)] ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-181, score-0.146]

72 From Theorems 9 and 11, we conclude that Corollary 16 For the distance weighted discrimination loss, ∆R C( f , ((1 − d)/d)1/2 γ) ≤ γ1/2 (1 − d)1/4 d 1/4 [∆Q( f )]1/2 . [sent-190, score-0.13]

73 Furthermore, if the margin condition (6) holds, then 1+α ∆R C( f , ((1 − d)/d)1/2 γ) ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-191, score-0.146]

74 6 Hinge Loss with Rejection Option As shown by Bartlett and Wegkamp (2008), for the modiﬁed hinge loss (4),   −1 if η ≤ d 0 if d < η < 1 − d . [sent-193, score-0.369]

75 arg min Qη (z) =  z 1 if η > 1 − d Simple algebraic manipulations lead to  if η ≤ d  (1 − δ)(d − η)/d (η − d)δ/d if d < η < 1 − d , ∆Qη (−δ) =  1 − (1 − η)/d + (η − d)δ/d if η > 1 − d and Therefore,   1 − η/d + (1 − η − d)δ/d if η ≤ d (1 − η − d)δ/d if d < η < 1 − d . [sent-194, score-0.155]

76 d From Theorems 10 and 11, we conclude that Corollary 17 For the modiﬁed hinge loss and any δ < 1, ∆R[C( f , δ)] ≤ d ∆Q( f ). [sent-197, score-0.369]

77 min{δ, 1 − δ} (8) Furthermore, if the margin condition (6) holds, then ∆R[C( f , δ)] ≤ K∆Q( f ) (9) for some constant K > 0. [sent-198, score-0.146]

78 Notice that the corollary also suggests that δ = 1/2 yields the best constant 2d in the upper bound. [sent-199, score-0.093]

79 It is also interesting to see that (8) cannot be further improved by the generalized margin condition (6) as the bounds (8) and (9) only differ by a constant. [sent-201, score-0.146]

80 The analysis of the generalized hinge loss is complicated and is treated in detail in Wegkamp (2007) and Bartlett and Wegkamp (2008). [sent-204, score-0.369]

81 The other loss functions φ considered in this paper have in common that the modulus of convexity of Q, δ(ε) = inf Q( f ) + Q(g) −Q 2 f +g 2 : E[( f − g)2 (X)] ≥ ε2 satisﬁes δ(ε) ≥ cε2 for some c > 0 and that, for some L < ∞, |φ(x) − φ(x′ )| ≤ L|x − x′ | for all x, x′ ∈ R. [sent-205, score-0.267]

82 Furthermore, if the generalized margin condition (6) holds, then with probability at least 1 − γ, ∆R(C( fˆn , δ)) ≤ K 3L L2 LB inf ∆Q( f ) + +8 + f ∈F n 2c 3 log(Nn /γ) n 1/(s+β−βs) for some constant K > 0. [sent-209, score-0.231]

83 In many applications, however, one type of misclassiﬁcation may incur a heavier loss than the other. [sent-217, score-0.151]

84 Such situations naturally arise in risk management or medical diagonsis. [sent-218, score-0.226]

85 To this end, the following loss function can be adopted in place of ℓ:   1 if g(X) = −1 and Y = 1   θ if g(X) = 1 and Y = −1 . [sent-219, score-0.151]

86 ℓθ [g(X),Y ] =  d if g(X) = 0   0 if g(X) = Y We shall assume that θ < 1 without loss of generality. [sent-220, score-0.255]

87 It can be shown that the rejection option is only available if d < θ/(1 + θ) (see, e. [sent-221, score-0.262]

88 , Herbei and Wegkamp, 2006), which we shall assume throughout the section. [sent-223, score-0.104]

89 For brevity, we shall abbreviate the dependence of η and fφ on X in the reminder of the proof when no confusion occurs. [sent-241, score-0.203]

90 From the previous discussion, ∗ fφ = arg min Qη (z) = − arg min Q1−η (−z) > δ. [sent-255, score-0.178]

91 The inﬁnite sample consistency implies that ∗ ∗ for any η > 1 − d, fφ > δ. [sent-266, score-0.14]

92 Hence Qη ( f ) − Qη (−δ) ≥ [ηa− − (1 − η)b+ ] ( f + δ) > 0, 125 Y UAN AND W EGKAMP which implies that arg min Qη (z) ≥ −δ. [sent-279, score-0.122]

93 Following a similar argument as before, one can show that Q1−η ( f ) − Q1−η (δ) > 0 for any f > δ, which implies that arg min Qη (z) ≤ δ. [sent-284, score-0.122]

94 This again contradicts inﬁnite sample consistency because 1 − η < 1 − d. [sent-285, score-0.132]

95 Also write ∆Qη ( f ) = Qη ( f ) − inf Qη ( f ) and ∆Rη ( f ) = Rη ( f ) − inf Rη ( f ). [sent-291, score-0.17]

96 For brevity, we shall abbreviate their dependence on X in what follows. [sent-295, score-0.147]

97 Let g = C( f , δ) be the classiﬁcation rule with reject option based on ∗ f : X → R and set g∗ = C( fφ , δ). [sent-333, score-0.482]

98 Since fn minimizes Qn ( f ), we have Q( fn ) − Q( f¯) = Ph(y fn (x)) = 2Pn h(Y fn (X)) + (P − 2Pn )h(Y fn (X)) ≤ 2Pn h(Y f¯(X)) + (P − 2Pn )h(Y fn (X)) ≤ sup (P − 2Pn )h(Y f (X)) f ∈F where Ph(Y f (X)) = E[h(Y f (X))] and Pn h(Y f (X)) = (1/n) ∑n h(Yi f (Xi )) for any f ∈ F . [sent-342, score-0.684]

99 Classiﬁcation with a reject option using a hinge loss. [sent-361, score-0.621]

100 Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. [sent-409, score-0.324]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wegkamp', 0.389), ('egkamp', 0.28), ('ejection', 0.252), ('ption', 0.252), ('reject', 0.22), ('hinge', 0.218), ('excess', 0.218), ('risk', 0.202), ('uan', 0.199), ('herbei', 0.192), ('option', 0.183), ('mcauliffe', 0.168), ('loss', 0.151), ('bartlett', 0.128), ('lassification', 0.121), ('fn', 0.114), ('shall', 0.104), ('margin', 0.097), ('corollary', 0.093), ('inf', 0.085), ('mammen', 0.084), ('theorem', 0.08), ('ph', 0.08), ('rejection', 0.079), ('rule', 0.079), ('theorems', 0.079), ('discrimination', 0.075), ('surrogate', 0.075), ('consistency', 0.072), ('marten', 0.072), ('ming', 0.072), ('nn', 0.066), ('tsybakov', 0.063), ('squares', 0.063), ('classi', 0.062), ('minimizer', 0.061), ('qn', 0.061), ('misclassi', 0.058), ('marron', 0.056), ('tightened', 0.056), ('incurs', 0.056), ('cs', 0.055), ('met', 0.054), ('lb', 0.053), ('convex', 0.05), ('condition', 0.049), ('min', 0.047), ('jordan', 0.046), ('yuan', 0.045), ('abbreviate', 0.043), ('todd', 0.043), ('arg', 0.042), ('observe', 0.039), ('logistic', 0.038), ('brevity', 0.036), ('differentiability', 0.035), ('minimizers', 0.035), ('manipulations', 0.035), ('sample', 0.035), ('implies', 0.033), ('proof', 0.032), ('convexity', 0.031), ('algebraic', 0.031), ('squared', 0.031), ('exp', 0.031), ('discriminant', 0.03), ('minimization', 0.03), ('disease', 0.03), ('seek', 0.03), ('asymmetric', 0.03), ('risks', 0.03), ('distance', 0.03), ('differentiable', 0.029), ('cation', 0.029), ('furthermore', 0.028), ('plays', 0.028), ('exponent', 0.028), ('employs', 0.028), ('preferable', 0.028), ('ces', 0.027), ('modi', 0.027), ('log', 0.027), ('response', 0.027), ('exponential', 0.026), ('consistent', 0.026), ('contradicts', 0.025), ('weighted', 0.025), ('recall', 0.025), ('ce', 0.024), ('medical', 0.024), ('consequences', 0.024), ('generalizes', 0.024), ('therefore', 0.024), ('nite', 0.024), ('sequencing', 0.024), ('isye', 0.024), ('milton', 0.024), ('myuan', 0.024), ('reminder', 0.024), ('ripley', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

2 0.12783651 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

Author: Vladimir Koltchinskii

Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classiﬁcation problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefﬁcient

3 0.12044732 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

Abstract: We study losses for binary classiﬁcation and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classiﬁcation calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassiﬁcation noise for binary losses and show that all convex proper losses are non-robust to misclassiﬁcation noise. Keywords: surrogate loss, convexity, probability estimation, classiﬁcation, Fisher consistency, classiﬁcation-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassiﬁcation noise

4 0.11510136 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

Author: Ran El-Yaniv, Yair Wiener

Abstract: We consider selective classiﬁcation, a term we adopt here to refer to ‘classiﬁcation with a reject option.’ The essence in selective classiﬁcation is to trade-off classiﬁer coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classiﬁcation including characterizations of RC trade-offs in various interesting settings. Keywords: classiﬁcation with a reject option, selective classiﬁcation, perfect learning, high performance classiﬁcation, risk-coverage trade-off

5 0.10164212 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassiﬁcation costs or class prevalences is a common occurrence in the pattern classiﬁcation domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the speciﬁcation of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspeciﬁed. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

6 0.10137545 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

7 0.068081677 22 jmlr-2010-Classification Using Geometric Level Sets

8 0.065333247 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.059251495 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

10 0.058934279 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

11 0.058211766 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

12 0.051804475 27 jmlr-2010-Consistent Nonparametric Tests of Independence

13 0.051762886 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

14 0.050314654 60 jmlr-2010-Learnability, Stability and Uniform Convergence

15 0.049448077 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

16 0.048903514 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

17 0.04862085 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

18 0.048610032 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

19 0.048262417 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

20 0.04736349 84 jmlr-2010-On Spectral Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.195), (1, -0.112), (2, 0.136), (3, 0.055), (4, -0.061), (5, 0.044), (6, 0.056), (7, 0.21), (8, -0.074), (9, -0.015), (10, 0.031), (11, -0.222), (12, -0.115), (13, -0.077), (14, -0.157), (15, -0.047), (16, -0.18), (17, -0.125), (18, -0.167), (19, 0.039), (20, -0.119), (21, -0.108), (22, -0.116), (23, 0.062), (24, 0.039), (25, -0.093), (26, 0.099), (27, -0.047), (28, 0.082), (29, 0.178), (30, 0.095), (31, 0.074), (32, 0.106), (33, -0.024), (34, -0.072), (35, -0.041), (36, 0.011), (37, -0.004), (38, -0.026), (39, -0.063), (40, 0.019), (41, -0.004), (42, 0.004), (43, 0.047), (44, 0.101), (45, 0.006), (46, -0.006), (47, -0.116), (48, -0.003), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95812577 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

2 0.7019434 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

3 0.61778182 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

Author: Ran El-Yaniv, Yair Wiener

4 0.61029142 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

Author: Vladimir Koltchinskii

5 0.52583724 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

6 0.41289207 22 jmlr-2010-Classification Using Geometric Level Sets

7 0.38292027 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

8 0.34194478 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

9 0.31895131 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

10 0.31131509 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

11 0.2980417 77 jmlr-2010-Model-based Boosting 2.0

12 0.29632083 60 jmlr-2010-Learnability, Stability and Uniform Convergence

13 0.29129481 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

14 0.2666117 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

15 0.26505283 102 jmlr-2010-Semi-Supervised Novelty Detection

16 0.25280431 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

17 0.2519148 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

18 0.2507658 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

19 0.24974081 84 jmlr-2010-On Spectral Learning

20 0.24901235 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.507), (3, 0.026), (8, 0.023), (21, 0.018), (32, 0.098), (36, 0.024), (37, 0.046), (75, 0.111), (85, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79016459 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

2 0.5619064 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu

Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efﬁcient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classiﬁer. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classiﬁcation, ı tree decomposition, support vector machine, VC theory

3 0.36528713 60 jmlr-2010-Learnability, Stability and Uniform Convergence

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classiﬁcation and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufﬁcient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are signiﬁcantly more complex than in supervised classiﬁcation and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

4 0.35334286 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.34811634 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to speciﬁc learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the ﬁrst theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory

6 0.34106129 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

7 0.33567804 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

8 0.33230138 102 jmlr-2010-Semi-Supervised Novelty Detection

9 0.32962143 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

10 0.32720968 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

11 0.32166868 25 jmlr-2010-Composite Binary Losses

12 0.32133418 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

13 0.32072926 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

14 0.31794515 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

15 0.31357482 84 jmlr-2010-On Spectral Learning

16 0.3133637 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

17 0.31286702 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

18 0.31116366 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

19 0.31075975 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

20 0.30889323 109 jmlr-2010-Stochastic Composite Likelihood