jmlr jmlr2010 jmlr2010-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
Reference: text
sentIndex sentText sentNum sentScore
1 Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. [sent-7, score-0.739]
2 A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. [sent-8, score-0.186]
3 Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. [sent-9, score-0.915]
4 These bounds can be tightened by a generalized margin condition. [sent-10, score-0.153]
5 The impact of the results is illustrated on several commonly used surrogate loss functions. [sent-11, score-0.226]
6 Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option 1. [sent-12, score-0.827]
7 In such situations, a less specific response that reserves the right of not making a decision, sometimes referred to as a reject option (see, e. [sent-19, score-0.43]
8 But in the event that there are ambiguities, it would be more desirable to take a rejection option and seek more expensive studies to identify a subject’s disease status. [sent-24, score-0.322]
9 Similar approaches are often adopted in DNA sequencing or genotyping applications, where the rejection option is commonly referred to as a “no-call”. [sent-25, score-0.286]
10 ˜ ˜ To accommodate the reject option, we now seek a classification rule g : X → Y where Y = {−1, 1, 0} is an augmented response space and g(X) = 0 indicates that no definitive classification will be made for X or a reject option is taken. [sent-29, score-0.759]
11 To measure the performance of a classification rule, we employ the following loss function that generalizes the usual 0-1 loss to account for reject option: 1 if g(X) = Y and g(X) = 0 d if g(X) = 0 . [sent-30, score-0.546]
12 ℓ[g(X),Y ] = 0 if g(X) = Y In other words, an ambiguous response (g(X) = 0) incurs a loss of d whereas misclassification incurs a loss of 1. [sent-31, score-0.441]
13 Otherwise, rather than taking a rejection option with a loss d, we can always flip a fair coin to randomly assign ±1 as the value of g(X), which incurs an average loss of 1/2 ≤ d. [sent-33, score-0.62]
14 For this reason, we shall assume that d < 1/2 in what follows. [sent-34, score-0.104]
15 ˜ For any classification rule g : X → Y , the risk function is then given by R(g) = E(ℓ[g(X),Y ]) where the expectation is taken over the joint distribution of X and Y . [sent-35, score-0.281]
16 It is not hard to show that the optimal classification rule g∗ := arg min R(g) is given by (see, e. [sent-36, score-0.168]
17 The corresponding risk is R∗ := inf R(g) = R(g∗ ) = E (min{η(X), 1 − η(X), d}) . [sent-39, score-0.287]
18 ˜ Thus, the performance of any classification rule g : X → Y can be measured by the excess risk ∗. [sent-40, score-0.499]
19 ∆R(g) := R(g) − R Appealing to the general empirical risk minimization strategy, one could attempt to derive a classification rule from the training data by minimizing the empirical risk Rn (g) = 1 n ∑ ℓ[g(Xi),Yi ]. [sent-41, score-0.513]
20 A common remedy is to consider a surrogate convex loss function. [sent-43, score-0.276]
21 Denote by Q( f ) = E[φ(Y f (X))] the corresponding risk for a discriminant function f : X → R. [sent-45, score-0.232]
22 fˆn can be conveniently converted to a classification rule C( fˆn ; δ) as follows: if f (X) > δ 1 0 if | f (X)| ≤ δ C( f (X); δ) = −1 if f (X) < −δ where δ > 0 is a parameter that as we shall see plays a critical role in determining the performance of C( fˆn , δ). [sent-47, score-0.211]
23 In this paper, we investigate the statistical properties of this general convex risk minimization technique. [sent-48, score-0.282]
24 To what extent C( fˆn , δ) mimics the optimal classification rule g∗ plays a critical role in ∗ the success of this technique. [sent-49, score-0.107]
25 We shall assume throughout the ∗ ∗ paper that fφ is uniquely defined. [sent-51, score-0.104]
26 A surrogate loss function φ that satisfies this property is often called infinite sample consistent (see, e. [sent-54, score-0.287]
27 A second question further concerns the relationship between the excess risk ∆R[C( f , δ)] and the excess φ risk ∆Q( f ) = Q( f ) − inf Q( f ): Can we find an increasing function ρ : R → R such that for all f , ∆R[C( f , δ)] ≤ ρ (∆Q( f )) ? [sent-59, score-0.925]
28 (1) Clearly the infinite sample consistency of φ implies that ρ(0) = 0. [sent-60, score-0.14]
29 Such a bound on the excess risk provides useful tools in bounding the excess risk of fˆn . [sent-61, score-0.84]
30 In particular, (1) indicates that ∆R[C( fˆn , δ)] ≤ ρ ∆Q( fˆn ) = ρ ∆Q( f¯) + Q( fˆn ) − Q( f¯) , where f¯ = arg min f ∈F Q( f ). [sent-62, score-0.089]
31 In the case when there is no reject option, these problems have been well investigated in recent years (Lin, 2002; Zhang, 2004; Bartlett, Jordan and McAuliffe, 2006). [sent-64, score-0.22]
32 In this paper, we establish similar results when there is a reject option. [sent-65, score-0.22]
33 The most significant difference between the two situations, with or without the reject option, is the role of δ. [sent-66, score-0.22]
34 As we shall see, for some loss functions such as least squares, exponential or logistic, a good choice of δ yields classifiers that are infinite sample consistent. [sent-67, score-0.316]
35 For other loss functions, however, such as the hinge loss, no matter how δ is chosen, the classification rule C( f , δ) cannot be infinite sample consistent. [sent-68, score-0.483]
36 We first examine in Section 2 the infinite sample consistency for classification with reject option. [sent-70, score-0.327]
37 After establishing a general result, we consider its implication on several commonly used loss functions. [sent-71, score-0.151]
38 In Section 3, we establish bounds on the excess risk in the form of (1), followed by applications to the popular loss functions. [sent-72, score-0.571]
39 We also show that under an additional assumption on the behavior of η(X) near d and 1 − d as in Herbei and Wegkamp (2006), generalizing the condition in the case of d = 1/2 of Mammen and Tsybakov (1999) and Tsybakov (2004), the bound (1) can be tightened considerably. [sent-73, score-0.105]
40 Section 4 discusses rates of convergence of the empirical risk minimizer fn that minimizes the empirical risk Qn ( f ) over a bounded class F . [sent-74, score-0.579]
41 Section 5 considers extension to asymmetric loss where one type of misclassification may be more costly than the other. [sent-75, score-0.181]
42 Infinite Sample Consistency ∗ We first give a general result on the infinite sample consistency of the classification rule C( fφ , δ). [sent-78, score-0.186]
43 Then the classification rule C( fφ , δ) for some δ > 0 is infinite ∗ , δ) = g∗ if and only if φ′ (δ) and φ′ (−δ) both exist, φ′ (δ) < 0, and sample consistent, that is, C( fφ φ′ (δ) = d. [sent-80, score-0.114]
44 φ′ (δ) + φ′ (−δ) (2) When there is no reject option, it is known that the necessary and sufficient condition for the infinite sample consistency is that φ is differentiable at 0 and φ′ (0) < 0 (see, e. [sent-81, score-0.405]
45 As indicated by Theorem 1, the differentiability of φ at ±δ plays a more prominent role in the general case when there is a reject option. [sent-84, score-0.283]
46 From Theorem 1 it is also evident that the infinite sample consistency depends on both φ and the choice of thresholding parameter δ. [sent-85, score-0.107]
47 1 Least Squares Loss We first examine the least squares loss φ(z) = (1 − z)2 . [sent-97, score-0.214]
48 3 Logistic Loss Logisitic regression employs loss φ(z) = ln(1 + exp(−z)). [sent-106, score-0.179]
49 4 Squared Hinge Loss Squared hinge loss, φ(z) = (1 − z)2 , is another popular choice for which + φ′ (δ) 1−δ = . [sent-110, score-0.218]
50 Corollary 6 For the squared hinge loss, ∗ C fφ , 1 − 2d = g∗ . [sent-112, score-0.249]
51 5 Distance Weighted Discrimination Marron, Todd and Ahn (2007) recently introduced the so-called distance weighted discrimination method where the following loss function (see, e. [sent-114, score-0.281]
52 if δ > γ In other words, we have the following result for the distance weighted discrimination loss. [sent-120, score-0.13]
53 Corollary 7 For the loss (3), ∗ C fφ , [(1 − d)/d]1/2 γ = g∗ . [sent-121, score-0.151]
54 6 Hinge Loss The popular support vector machine employs the hinge loss, φ(z) = (1 − z)+ . [sent-123, score-0.246]
55 Motivated by this observation, Bartlett and Wegkamp (2008) introduce the following modification to the hinge loss: 1 − az if z ≤ 0 1 − z if 0 < z ≤ 1 , φ(z) = (4) 0 if z > 1 where a > 1. [sent-128, score-0.218]
56 Corollary 8 For the modified hinge loss (4) and any δ < 1, if a = (1 − d)/d, then ∗ C fφ , δ = g∗ . [sent-131, score-0.369]
57 Whereas for the modified hinge loss, a range of choice of δ can serve the same purpose. [sent-133, score-0.218]
58 However, as we shall see in the next section, different choices of δ for the modified hinge loss may result in slightly different bound on the excess risk with δ = 1/2 appearing to be more preferable in that it yields the smallest upper bound of the excess risk. [sent-134, score-1.139]
59 Excess Risk We now turn to the excess risk ∆R [C( f , δ)] and show how it can be bounded through the excess φ risk ∗ ∆Q( f ) := Q( f ) − Q( fφ ). [sent-136, score-0.84]
60 Recall that the infinite sample consistency established in the previous section means that ∆Q( f ) = 0 implies throughout this section that ∆R(C( f , δ)) = 0. [sent-137, score-0.14]
61 For brevity, we shall assume implicitly that δ is chosen in accordance with Theorem 1 to ensure infinite sample consistency. [sent-138, score-0.139]
62 In other words, consistency in terms of φ risk implies the consistency in terms of loss ℓ. [sent-146, score-0.53]
63 We can improve the bounds even further by the following margin condition. [sent-150, score-0.097]
64 This assumption was introduced in Herbei and Wegkamp (2006) and generalizes the margin condition of Mammen and Tsybakov (1999) and Tsybakov (2004). [sent-152, score-0.17]
65 We now examine the consequences of Theorems 9, 10 and 11 on several common loss functions. [sent-159, score-0.175]
66 1 Least Squares Note that for the least squares loss ∆Qη ( f ) = (2η − 1 − f )2 . [sent-161, score-0.214]
67 Furthermore, if the margin condition (6) holds, then 1+α ∆R[C( f , 1 − 2d)] ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-164, score-0.146]
68 Furthermore, if the margin condition (6) holds, then 1 1 f , log −1 2 d ∆R C 1+α ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-171, score-0.173]
69 Furthermore, if the margin condition (6) holds, then ∆R C f , log 1 −1 d 1+α ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-176, score-0.173]
70 By Theorems 9 and 11, Corollary 15 For the squared hinge loss, ∆R [C( f , 1 − 2d)] ≤ [∆Q( f )]1/2 . [sent-180, score-0.249]
71 Furthermore, if the margin condition (6) holds, then 1+α ∆R[C( f , 1 − 2d)] ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-181, score-0.146]
72 From Theorems 9 and 11, we conclude that Corollary 16 For the distance weighted discrimination loss, ∆R C( f , ((1 − d)/d)1/2 γ) ≤ γ1/2 (1 − d)1/4 d 1/4 [∆Q( f )]1/2 . [sent-190, score-0.13]
73 Furthermore, if the margin condition (6) holds, then 1+α ∆R C( f , ((1 − d)/d)1/2 γ) ≤ K [∆Q( f )] 2+α for some constant K > 0. [sent-191, score-0.146]
74 6 Hinge Loss with Rejection Option As shown by Bartlett and Wegkamp (2008), for the modified hinge loss (4), −1 if η ≤ d 0 if d < η < 1 − d . [sent-193, score-0.369]
75 arg min Qη (z) = z 1 if η > 1 − d Simple algebraic manipulations lead to if η ≤ d (1 − δ)(d − η)/d (η − d)δ/d if d < η < 1 − d , ∆Qη (−δ) = 1 − (1 − η)/d + (η − d)δ/d if η > 1 − d and Therefore, 1 − η/d + (1 − η − d)δ/d if η ≤ d (1 − η − d)δ/d if d < η < 1 − d . [sent-194, score-0.155]
76 d From Theorems 10 and 11, we conclude that Corollary 17 For the modified hinge loss and any δ < 1, ∆R[C( f , δ)] ≤ d ∆Q( f ). [sent-197, score-0.369]
77 min{δ, 1 − δ} (8) Furthermore, if the margin condition (6) holds, then ∆R[C( f , δ)] ≤ K∆Q( f ) (9) for some constant K > 0. [sent-198, score-0.146]
78 Notice that the corollary also suggests that δ = 1/2 yields the best constant 2d in the upper bound. [sent-199, score-0.093]
79 It is also interesting to see that (8) cannot be further improved by the generalized margin condition (6) as the bounds (8) and (9) only differ by a constant. [sent-201, score-0.146]
80 The analysis of the generalized hinge loss is complicated and is treated in detail in Wegkamp (2007) and Bartlett and Wegkamp (2008). [sent-204, score-0.369]
81 The other loss functions φ considered in this paper have in common that the modulus of convexity of Q, δ(ε) = inf Q( f ) + Q(g) −Q 2 f +g 2 : E[( f − g)2 (X)] ≥ ε2 satisfies δ(ε) ≥ cε2 for some c > 0 and that, for some L < ∞, |φ(x) − φ(x′ )| ≤ L|x − x′ | for all x, x′ ∈ R. [sent-205, score-0.267]
82 Furthermore, if the generalized margin condition (6) holds, then with probability at least 1 − γ, ∆R(C( fˆn , δ)) ≤ K 3L L2 LB inf ∆Q( f ) + +8 + f ∈F n 2c 3 log(Nn /γ) n 1/(s+β−βs) for some constant K > 0. [sent-209, score-0.231]
83 In many applications, however, one type of misclassification may incur a heavier loss than the other. [sent-217, score-0.151]
84 Such situations naturally arise in risk management or medical diagonsis. [sent-218, score-0.226]
85 To this end, the following loss function can be adopted in place of ℓ: 1 if g(X) = −1 and Y = 1 θ if g(X) = 1 and Y = −1 . [sent-219, score-0.151]
86 ℓθ [g(X),Y ] = d if g(X) = 0 0 if g(X) = Y We shall assume that θ < 1 without loss of generality. [sent-220, score-0.255]
87 It can be shown that the rejection option is only available if d < θ/(1 + θ) (see, e. [sent-221, score-0.262]
88 , Herbei and Wegkamp, 2006), which we shall assume throughout the section. [sent-223, score-0.104]
89 For brevity, we shall abbreviate the dependence of η and fφ on X in the reminder of the proof when no confusion occurs. [sent-241, score-0.203]
90 From the previous discussion, ∗ fφ = arg min Qη (z) = − arg min Q1−η (−z) > δ. [sent-255, score-0.178]
91 The infinite sample consistency implies that ∗ ∗ for any η > 1 − d, fφ > δ. [sent-266, score-0.14]
92 Hence Qη ( f ) − Qη (−δ) ≥ [ηa− − (1 − η)b+ ] ( f + δ) > 0, 125 Y UAN AND W EGKAMP which implies that arg min Qη (z) ≥ −δ. [sent-279, score-0.122]
93 Following a similar argument as before, one can show that Q1−η ( f ) − Q1−η (δ) > 0 for any f > δ, which implies that arg min Qη (z) ≤ δ. [sent-284, score-0.122]
94 This again contradicts infinite sample consistency because 1 − η < 1 − d. [sent-285, score-0.132]
95 Also write ∆Qη ( f ) = Qη ( f ) − inf Qη ( f ) and ∆Rη ( f ) = Rη ( f ) − inf Rη ( f ). [sent-291, score-0.17]
96 For brevity, we shall abbreviate their dependence on X in what follows. [sent-295, score-0.147]
97 Let g = C( f , δ) be the classification rule with reject option based on ∗ f : X → R and set g∗ = C( fφ , δ). [sent-333, score-0.482]
98 Since fn minimizes Qn ( f ), we have Q( fn ) − Q( f¯) = Ph(y fn (x)) = 2Pn h(Y fn (X)) + (P − 2Pn )h(Y fn (X)) ≤ 2Pn h(Y f¯(X)) + (P − 2Pn )h(Y fn (X)) ≤ sup (P − 2Pn )h(Y f (X)) f ∈F where Ph(Y f (X)) = E[h(Y f (X))] and Pn h(Y f (X)) = (1/n) ∑n h(Yi f (Xi )) for any f ∈ F . [sent-342, score-0.684]
99 Classification with a reject option using a hinge loss. [sent-361, score-0.621]
100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-409, score-0.324]
wordName wordTfidf (topN-words)
[('wegkamp', 0.389), ('egkamp', 0.28), ('ejection', 0.252), ('ption', 0.252), ('reject', 0.22), ('hinge', 0.218), ('excess', 0.218), ('risk', 0.202), ('uan', 0.199), ('herbei', 0.192), ('option', 0.183), ('mcauliffe', 0.168), ('loss', 0.151), ('bartlett', 0.128), ('lassification', 0.121), ('fn', 0.114), ('shall', 0.104), ('margin', 0.097), ('corollary', 0.093), ('inf', 0.085), ('mammen', 0.084), ('theorem', 0.08), ('ph', 0.08), ('rejection', 0.079), ('rule', 0.079), ('theorems', 0.079), ('discrimination', 0.075), ('surrogate', 0.075), ('consistency', 0.072), ('marten', 0.072), ('ming', 0.072), ('nn', 0.066), ('tsybakov', 0.063), ('squares', 0.063), ('classi', 0.062), ('minimizer', 0.061), ('qn', 0.061), ('misclassi', 0.058), ('marron', 0.056), ('tightened', 0.056), ('incurs', 0.056), ('cs', 0.055), ('met', 0.054), ('lb', 0.053), ('convex', 0.05), ('condition', 0.049), ('min', 0.047), ('jordan', 0.046), ('yuan', 0.045), ('abbreviate', 0.043), ('todd', 0.043), ('arg', 0.042), ('observe', 0.039), ('logistic', 0.038), ('brevity', 0.036), ('differentiability', 0.035), ('minimizers', 0.035), ('manipulations', 0.035), ('sample', 0.035), ('implies', 0.033), ('proof', 0.032), ('convexity', 0.031), ('algebraic', 0.031), ('squared', 0.031), ('exp', 0.031), ('discriminant', 0.03), ('minimization', 0.03), ('disease', 0.03), ('seek', 0.03), ('asymmetric', 0.03), ('risks', 0.03), ('distance', 0.03), ('differentiable', 0.029), ('cation', 0.029), ('furthermore', 0.028), ('plays', 0.028), ('exponent', 0.028), ('employs', 0.028), ('preferable', 0.028), ('ces', 0.027), ('modi', 0.027), ('log', 0.027), ('response', 0.027), ('exponential', 0.026), ('consistent', 0.026), ('contradicts', 0.025), ('weighted', 0.025), ('recall', 0.025), ('ce', 0.024), ('medical', 0.024), ('consequences', 0.024), ('generalizes', 0.024), ('therefore', 0.024), ('nite', 0.024), ('sequencing', 0.024), ('isye', 0.024), ('milton', 0.024), ('myuan', 0.024), ('reminder', 0.024), ('ripley', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
2 0.12783651 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
3 0.12044732 25 jmlr-2010-Composite Binary Losses
Author: Mark D. Reid, Robert C. Williamson
Abstract: We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classification calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. Keywords: surrogate loss, convexity, probability estimation, classification, Fisher consistency, classification-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassification noise
4 0.11510136 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
5 0.10164212 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
6 0.10137545 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
7 0.068081677 22 jmlr-2010-Classification Using Geometric Level Sets
8 0.065333247 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.059251495 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
10 0.058934279 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
11 0.058211766 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
12 0.051804475 27 jmlr-2010-Consistent Nonparametric Tests of Independence
13 0.051762886 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
14 0.050314654 60 jmlr-2010-Learnability, Stability and Uniform Convergence
15 0.049448077 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
16 0.048903514 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
18 0.048610032 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
19 0.048262417 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
20 0.04736349 84 jmlr-2010-On Spectral Learning
topicId topicWeight
[(0, -0.195), (1, -0.112), (2, 0.136), (3, 0.055), (4, -0.061), (5, 0.044), (6, 0.056), (7, 0.21), (8, -0.074), (9, -0.015), (10, 0.031), (11, -0.222), (12, -0.115), (13, -0.077), (14, -0.157), (15, -0.047), (16, -0.18), (17, -0.125), (18, -0.167), (19, 0.039), (20, -0.119), (21, -0.108), (22, -0.116), (23, 0.062), (24, 0.039), (25, -0.093), (26, 0.099), (27, -0.047), (28, 0.082), (29, 0.178), (30, 0.095), (31, 0.074), (32, 0.106), (33, -0.024), (34, -0.072), (35, -0.041), (36, 0.011), (37, -0.004), (38, -0.026), (39, -0.063), (40, 0.019), (41, -0.004), (42, 0.004), (43, 0.047), (44, 0.101), (45, 0.006), (46, -0.006), (47, -0.116), (48, -0.003), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.95812577 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
2 0.7019434 25 jmlr-2010-Composite Binary Losses
Author: Mark D. Reid, Robert C. Williamson
Abstract: We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classification calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. Keywords: surrogate loss, convexity, probability estimation, classification, Fisher consistency, classification-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassification noise
3 0.61778182 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
4 0.61029142 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
Author: Vladimir Koltchinskii
Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient
5 0.52583724 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
6 0.41289207 22 jmlr-2010-Classification Using Geometric Level Sets
7 0.38292027 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
8 0.34194478 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
9 0.31895131 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
10 0.31131509 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
11 0.2980417 77 jmlr-2010-Model-based Boosting 2.0
12 0.29632083 60 jmlr-2010-Learnability, Stability and Uniform Convergence
13 0.29129481 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
14 0.2666117 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
15 0.26505283 102 jmlr-2010-Semi-Supervised Novelty Detection
16 0.25280431 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
17 0.2519148 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
18 0.2507658 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
19 0.24974081 84 jmlr-2010-On Spectral Learning
20 0.24901235 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
topicId topicWeight
[(1, 0.507), (3, 0.026), (8, 0.023), (21, 0.018), (32, 0.098), (36, 0.024), (37, 0.046), (75, 0.111), (85, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.79016459 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
2 0.5619064 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
Author: Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu
Abstract: To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy. Keywords: binary tree, generalization error ¨bound, margin-based theory, pattern classification, ı tree decomposition, support vector machine, VC theory
3 0.36528713 60 jmlr-2010-Learnability, Stability and Uniform Convergence
Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan
Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization
4 0.35334286 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
5 0.34811634 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
6 0.34106129 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
7 0.33567804 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.33230138 102 jmlr-2010-Semi-Supervised Novelty Detection
9 0.32962143 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
10 0.32720968 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.32166868 25 jmlr-2010-Composite Binary Losses
12 0.32133418 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
13 0.32072926 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
14 0.31794515 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
15 0.31357482 84 jmlr-2010-On Spectral Learning
16 0.3133637 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
18 0.31116366 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
19 0.31075975 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
20 0.30889323 109 jmlr-2010-Stochastic Composite Likelihood