jmlr jmlr2008 jmlr2008-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
Reference: text
sentIndex sentText sentNum sentScore
1 Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. [sent-8, score-0.05]
2 As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). [sent-9, score-0.185]
3 Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. [sent-10, score-0.109]
4 We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. [sent-12, score-0.097]
5 We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. [sent-14, score-0.09]
6 Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines 1. [sent-15, score-0.633]
7 A discriminant function f : X → R yields a classifier sgn( f (x)) ∈ {−1, +1} that represents our guess of the label Y of a future observation X and we err if the margin y · f (x) < 0. [sent-17, score-0.096]
8 The Bayes discriminant function P{Y = 1|X = x} − P{Y = −1|X = x} minimizes the probability of misclassification P{Y f (X) < 0}. [sent-18, score-0.078]
9 While it is our aim to classify the majority of future observations in an automatic way, it is often appropriate to instead report a warning for those observations that are hard to classify (the ones having conditional probability η(x) near the value 1/2). [sent-21, score-0.044]
10 This motivates the introduction of a reject option for classifiers, by allowing for a third decision, (reject), c 2008 Peter L. [sent-22, score-0.392]
11 For instance, in clinical trials it is important to be able to reject a tumor diagnostic classification since the consequences of misdiagnosis are severe and scientific expertise is required to make reliable determination. [sent-26, score-0.234]
12 In the engineering community on the other hand this option is more common and empirically shown to effectively reduce the misclassification rate (Chow, 1970; Fumera and Roli, 2002, 2004; Fumera ¨ et al. [sent-28, score-0.158]
13 We propose to incorporate the reject option into our classification scheme by using a threshold value 0 ≤ δ < 1 as follows. [sent-34, score-0.392]
14 Given a discriminant function f : X → R, we report sgn( f (x))) ∈ {−1, 1} if | f (x)| > δ, but we withhold decision if | f (x)| ≤ δ and report . [sent-35, score-0.047]
15 In this note, we assume that the cost of making a wrong decision is 1 and the cost of using the reject option is d > 0. [sent-36, score-0.442]
16 The appropriate risk function is then Ld,δ ( f ) = E d (Y f (X)) = P{Y f (X) < −δ} + dP{|Y f (X)| ≤ δ} (1) for the discontinuous loss 1 (z) = d d,δ 0 if z < −δ, if |z| ≤ δ, otherwise. [sent-37, score-0.206]
17 ∗ The classifier associated with the discriminant function f d (x) that minimizes the risk Ld,δ ( f ) assigns −1, 1 or depending on which of η(x), 1 − η(x) or d is smallest. [sent-38, score-0.214]
18 Since we never reject if d > 1/2, ∗ we restrict ourselves to the cases 0 ≤ d ≤ 1/2. [sent-39, score-0.234]
19 The generalized Bayes discriminant function f d (x) is then −1 if η(x) < d ∗ 0 if d ≤ η(x) ≤ 1 − d (2) fd (x) = +1 if η(x) > 1 − d with risk ∗ ∗ Ld = Ld,δ ( fd ) = E min{η(X), 1 − η(X), d}. [sent-40, score-1.563]
20 The case (δ, d) = (0, 1/2) reduces to the classical situation without the reject option. [sent-41, score-0.234]
21 We emphasize that the rejection cost d should be known a priori. [sent-42, score-0.058]
22 In a medical setting when determining whether a disease is present or absent, the reject option often leads to quantifiable costs for additional tests and perhaps in delays of treatment. [sent-43, score-0.392]
23 Plug-in classification rules replace the regression function η(x) by an estimate η(x) in the for∗ mula for fd (x) above. [sent-47, score-0.69]
24 It is shown by Herbei and Wegkamp (2006) that the rate of convergence of the ∗ risk (1) to the Bayes risk Ld of a general plug-in rule with reject option depends on how well η(X) estimates η(X) and on the behavior of η(X) near the values d and 1 − d. [sent-48, score-0.664]
25 This condition on η(X) nicely generalizes the margin condition of Tsybakov (2004) from the classical setting (d = 1/2) to our more general framework (0 ≤ d ≤ 1/2). [sent-49, score-0.109]
26 Despite its attractive theoretical properties, the naive empirical risk minimization method is often hard to implement. [sent-53, score-0.161]
27 This paper addresses this pitfall by considering a convex surrogate for the loss function akin to the hinge loss that is used in SVMs. [sent-54, score-0.251]
28 In the engineering literature, there are recently encouraging empirical results on SVMs with a reject option (Bounsiar et al. [sent-55, score-0.392]
29 The next section introduces a piecewise linear loss function φ d (x) that generalizes the hinge loss function max{0, 1 − x} in that it allows for the reject option and φ d (x) = max{0, 1 − x} for d = 1/2. [sent-58, score-0.564]
30 ∗ We prove that f d in (2) also minimizes the risk associated with this new loss and that the excess ∗ risk Ld,δ − Ld can be bounded by 2d times the excess risk based on the piecewise linear loss φ d if δ = 1/2. [sent-59, score-0.857]
31 Thus classifiers with small excess φd -risk automatically have small excess classification risk, providing theoretical justification of the more computationally appealing method. [sent-60, score-0.296]
32 In Section 3, we illustrate the computational convenience of the new loss, showing that the SVM classifier with reject option can be obtained by solving a standard convex optimization problem. [sent-61, score-0.441]
33 Finally, in Section 4, we show that fast rates (for instance, faster than n −1/2 ) of the SVM classifier with reject option are possible under the same noise conditions on η(X) used by Herbei and Wegkamp (2006). [sent-62, score-0.432]
34 As a side effect, for the standard SVM (the special case of d = 1/2), our results imply fast rates without an assumption that η(X) is unlikely to be near 0 and 1, a technical condition that has been imposed in the literature for that case (Blanchard et al. [sent-63, score-0.098]
35 Generalized Hinge Loss Instead of the discontinuous loss d,δ , we consider the convex surrogate loss 1 − az if z < 0, φd (z) = 1 − z if 0 ≤ z < 1, 0 otherwise where a = (1 − d)/d ≥ 1. [sent-66, score-0.228]
36 The next result states that the minimizer of the expectation of the discrete loss d,δ (z) and the convex loss φd (z) remains the same. [sent-67, score-0.162]
37 Proposition 1 The Bayes discriminant function (2) minimizes the risk Lφd ( f ) = Eφd (Y f (X)) over all measurable f : X → R. [sent-68, score-0.261]
38 Hence, for rη,φd (z) = ηφd (z) + (1 − η)φd (−z) 1825 (3) BARTLETT AND W EGKAMP it suffices to show that −1 if η < 1/(1 + a), ∗ z = 0 if 1/(1 + a) ≤ η ≤ a/(1 + a), 1 if η > a/(1 + a) minimizes rη,φd (z). [sent-71, score-0.031]
39 The function rη,φd (z) can be written as η − aηz if z ≤ −1, 1 + z(1 − (1 + a)η) if −1 ≤ z ≤ 0, rη,φd (z) = 1 + z(−η + a(1 − η)) if 0 ≤ z ≤ 1, z(a(1 − η)) + (1 − η) if z ≥ 1 and it is now a simple exercise to verify that z∗ indeed minimizes rη,φd (z). [sent-72, score-0.031]
40 Finally, since Lφd ( f ) = Erη,φd ( f (X)) and inf ηφd (z) + (1 − η)φd (−z) z = ηφd (z∗ ) + (1 − η)φd (z∗ ) 1−η η 1 [η < d] + 1 [d ≤ η ≤ 1 − d] + 1 [η > 1 − d] , = d d where 1 [A] denotes the indicator function of a set A, we find that ∗ ∗ dLφd ( fd ) = E [min(η(X), 1 − η(X), d)] = Ld . [sent-73, score-0.843]
41 The following comparison theorem shows that a relation like this holds not only for the risks, but for the excess risks as well. [sent-77, score-0.193]
42 Theorem 2 Let 0 ≤ d < 1/2 and a measurable function f be fixed. [sent-78, score-0.047]
43 For all 0 < δ ≤ 1/2, we have ∗ Ld,δ ( f ) − Ld ≤ d ∗ Lφ d ( f ) − L φ d , δ ∗ ∗ where Lφd = Lφd ( fd ). [sent-79, score-0.69]
44 1826 (4) C LASSIFICATION WITH A R EJECT O PTION USING A H INGE L OSS Remark 3 The optimal multiplicative constant (d/δ or 1 depending on the value of δ) in front of the φd -excess risk is achieved at δ = 1/2. [sent-82, score-0.19]
45 For all d ≤ δ ≤ 1 − d, the multiplicative constant in front of the φd -excess risk does not exceed 1. [sent-84, score-0.19]
46 The choice δ = 1 − d corresponds to the largest value of δ for which the piecewise constant function d,δ (z) is still majorized by the convex surrogate φ d (z). [sent-86, score-0.151]
47 For δ = d we will reject less frequently than for δ = 1 − d and δ = 1/2 can be seen as a compromise among these two extreme cases. [sent-87, score-0.234]
48 We define the functions ξ(η) = η1 [η < d] + d1 [d ≤ η ≤ 1 − d] + (1 − η)1 [η > 1 − d] and H(η) = inf ηφd (z) + (1 − η)φd (−z) z = η 1−η 1 [η < d] + 1 [d ≤ η ≤ 1 − d] + 1 [η > 1 − d] . [sent-90, score-0.153]
49 Furthermore, we define Lφ d H−1 (η) = inf (ηφd (z) + (1 − η)φd (−z)) , z<−δ H (η) = inf (ηφd (z) + (1 − η)φd (−z)) , |z|≤δ H1 (η) = inf (ηφd (z) + (1 − η)φd (−z)) ; z>δ ξ−1 (η) = η − ξ(η), ξ (η) = d − ξ(η), ξ1 (η) = 1 − η − ξ(η). [sent-93, score-0.459]
50 1827 BARTLETT AND W EGKAMP The proof is in the appendix. [sent-98, score-0.027]
51 Proof [Proof of Theorem 2] Recall that Ld,δ ( f ) = P(η1 [ f < −δ]+d1 [−δ ≤ f ≤ δ]+(1−η)1 [ f > δ]) and Lφd ( f ) = Prη,φd ( f ) Rwith rη,φd defined in the proof of Proposition 1. [sent-99, score-0.027]
52 By linearity of ψ, we have for any measurable function f , ∗ ψ(Ld,δ ( f ) − Ld ) = P (1 [ f < −δ] ψ(ξ−1 (η)) + 1 [−δ ≤ f ≤ δ] ψ(ξ (η)) +1 [ f > δ] ψ(ξ1 (η))) . [sent-103, score-0.047]
53 Invoke now Proposition 4 to deduce ∗ ψ(Ld,δ ( f ) − Ld ) ≤ P (1 [ f < −δ] [H−1 (η) − H(η)] + 1 [−δ ≤ f ≤ δ] [H (η) − H(η)] +1 [ f > δ] [H1 (η) − H(η)]) ≤ P rη,φd ( f ) − H(η) and conclude the proof by observing that the term on the right of the previous inequality equals ∗ Lφ d ( f ) − L φ d . [sent-104, score-0.027]
54 SVM Classifiers with Reject Option In this section, we consider an SVM-like classifier for classification with a reject option, and show that it can be obtained by solving a quadratically constrained quadratic program (QCQP). [sent-107, score-0.234]
55 Let K : X 2 → R be the kernel of a reproducing kernel Hilbert space (RKHS) H , and let f be the norm of f in H . [sent-108, score-0.096]
56 The SVM classifier with reject option is the minimizer of the empirical φ d -risk subject to a constraint on the RKHS norm. [sent-109, score-0.419]
57 1 The following theorem shows that this classifier is the solution to a QCQP, that is, it is the minimizer of a convex quadratic criterion on a convex subset of Euclidean space defined by quadratic inequalities. [sent-110, score-0.147]
58 It follows from Pythagoras’ theorem in Hilbert space: the squared RKHS norm can be split into the squared norm of the component in the space spanned by the kernel basis functions x → K(x i , x) and that of the component in the orthogonal subspace. [sent-131, score-0.099]
59 n i=1 1829 BARTLETT AND W EGKAMP For instance, to analyze the SVM classifier with reject option, we could consider classes F n = { f ∈ H : f ≤ cn } for some sequence of constants cn . [sent-141, score-0.312]
60 We are interested in bounds on the excess φd - risk, that is, the difference between the φd -risk of f and the minimal φd -risk over all measurable functions, of the form ∗ ∗ ELφd ( f ) − Lφd ≤ 2 inf Lφd ( f ) − Lφd + εn . [sent-142, score-0.348]
61 f ∈F Such bounds can be combined with an assumption on the rate of decrease of the approximation error ∗ inf f ∈Fn Lφd ( f ) − Lφd for a sequence of classes Fn used by a method of sieves, and thus provide ∗ bounds on the rate of convergence of risk Ld,δ ( f ) to the optimal Bayes risk Ld . [sent-143, score-0.425]
62 , 2008; Steinwart and Scovel, 2007; Tarigan and van de Geer, 2006; Tsybakov, 2004). [sent-146, score-0.031]
63 For plug-in rules, Herbei and Wegkamp (2006) showed an analogous result for classification with a reject option, where the corresponding condition concerns the probability that η(X) is close to the critical values of d and 1 − d. [sent-147, score-0.286]
64 In this section, we prove a bound on the excess φ d -risk of f that converges rapidly when a condition of this kind applies. [sent-148, score-0.178]
65 For d = 1/2, it is equivalent to the margin condition of Tsybakov (2004). [sent-150, score-0.079]
66 Definition 6 We say that η satisfies the margin condition at d with exponent α > 0 if there is a c ≥ 1 such that for all t > 0, P {|η(X) − d| ≤ t} ≤ ct α and P {|η(X) − (1 − d)| ≤ t} ≤ ct α . [sent-151, score-0.168]
67 The reason that conditions of this kind allow fast rates is related to the variance of the excess φd -loss, ∗ g f (x, y) = φd (y f (x)) − φd (y fd (x)), ∗ where fd minimizes the φd -risk. [sent-152, score-1.599]
68 Notice that the expectation of g f is precisely the excess risk of ∗ f , Eg f (X,Y ) = Lφd ( f ) − Lφd . [sent-153, score-0.284]
69 We will show that when η satisfies the margin condition at d with exponent α, the variance of each g f is bounded in terms of its expectation, and thus approaches zero as the φ-risk of f approaches the minimal value. [sent-154, score-0.126]
70 We say that G has a Bernstein exponent β with respect to P if there exists a constant B for which G is a (β, B)-Bernstein class. [sent-157, score-0.047]
71 Lemma 8 If η satisfies the margin condition at d with exponent α, then for any class F of measurable uniformly bounded functions, the class G = {g f : f ∈ F } has a Bernstein exponent β = α/(1 + α). [sent-158, score-0.22]
72 The first shows that the excess φ d -risk is at least ∗ linear in a certain pseudo-norm of the difference between f and f d . [sent-160, score-0.148]
73 For η ∈ [0, 1], define ∗ η| f − fd | if η < d and f < −1, ∗ ∗ | if η > 1 − d and f > 1, ρη ( f , fd ) = (1 − η)| f − fd ∗ | f − fd | otherwise, and recall the definition of the conditional φd -risk in (3). [sent-163, score-2.76]
74 Lemma 9 For η ∈ [0, 1], ∗ ∗ d rη,φd ( f ) − rη,φd ( fd ) ≥ (|η − d| ∧ |η − (1 − d)|) ρη ( f , fd ). [sent-164, score-1.38]
75 Proof Since rη,φd is convex, ∗ ∗ rη,φd ( f ) ≥ rη,φd ( fd ) + g( f − fd ) ∗ for any g in the subgradient of rη,φd ( f ) at fd . [sent-165, score-2.07]
76 In our case, rη,φd is piecewise linear, with four pieces, and the subgradients include ∗ η 1−d at fd = −1, d 1 ∗ = −1, 0, at fd |η − d| d 1 ∗ |1 − η − d| d at fd = 0, 1, 1−d ∗ (1 − η) d at fd = 1. [sent-166, score-2.796]
77 BARTLETT AND W EGKAMP Lemma 10 If f ∞ = B, then for η ∈ [0, 1], ∗ ∗ ρη ( f , fd ) ≤ | f − fd |, and ∗ ∗ η |φd ( f ) − φd ( fd )|2 + (1 − η) |φd (− f ) − φd (− fd )|2 ≤ 1−d d 2 ∗ (B + 1)ρη ( f , fd ). [sent-171, score-3.45]
78 To see the second, use the fact that φd is flat to the right of 1 to notice that ∗ ∗ η |φd ( f ) − φd ( fd )|2 + (1 − η) |φd (− f ) − φd (− fd )|2 = ∗ η φd ( f ) − φ d ( f d ) (1 − η) 2 if η < d and f < −1, ∗ 2 φd (− f ) − φd (− fd ) if η > 1 − d and f > 1. [sent-173, score-2.07]
79 Proof [Proof of Lemma 8] By Lemma 9, we have ∗ ∗ Lφd ( f ) − Lφd ≥ d −1 Eρη ( f , fd ) |η − (1 − d)|IE− + |η − d|IE+ , with E− = {|η − (1 − d)| ≤ |η − d|}, E+ = {|η − (1 − d)| > |η − d|}. [sent-175, score-0.69]
80 n i=1 By definition of fd , we have ∗ Lφ d ( f d ) − L φ d = Pg fd = 2Pn g fd + (P − 2Pn )g fd ≤ 2 inf Pn g f + sup (P − 2Pn )g f . [sent-179, score-2.938]
81 f ∈F f ∈F Taking expected values on both sides, yields, ∗ ∗ ELφd ( fd ) − Lφd ≤ 2 inf Lφd ( f ) − Lφd + E sup (P − 2Pn )g f . [sent-180, score-0.868]
82 f ∈F f ∈F Since |g f − g f | ≤ | f − f |(1 − d)/d, it follows that E sup (P − 2Pn )g f ≤ f ∈F 1−d 1−d εn + BP d d sup (P − 2Pn )g f ≥ εn , f ∈Fn where Fn is a minimal εn -covering net of F with εn = Mn−(1+α)/(2+p+α+pα) for some constant M to be selected later. [sent-181, score-0.05]
83 Conclude the proof by noting that c 2−β 2−β exp(Cε−p − cnεn ) = exp − nεn , n 2 and by choosing the constant M in εn such that Cε−p = cnεn n 2−β 1834 2−β /2 and exp(−nεn ) = o(εn ). [sent-183, score-0.048]
84 C LASSIFICATION WITH A R EJECT O PTION USING A H INGE L OSS Remark 13 The constant 2 in front of the minimal excess risk on the right could be made closer to 1, at the expense of increasing C . [sent-184, score-0.311]
85 Theorem 12 discusses minimizers of the empirical risk Lφd over classes F of uniformly bounded functions. [sent-185, score-0.136]
86 Then, if the margin condition holds for α = +∞, we obtain from the proof of Theorem 12 rates of convergence of order log |F |/n. [sent-188, score-0.146]
87 Remark 15 The entropy condition is satisfied for many classes. [sent-190, score-0.03]
88 Let X be a bounded, convex subset of Rd and for every k = (k1 , . [sent-192, score-0.049]
89 Such classes have covering numbers (Kolmogorov and Tichomirov, 1961; van der Vaart and Wellner, 1996) log N(ε, L∞ , F ) ≤ Cd 1 ε d/β , for every ε > 0 and some constant Cd depending on the dimension d and the constants c1 and c2 , but not on ε. [sent-210, score-0.031]
90 Applying the theorem with p = d/β, we obtain rates between n −β/(2β+d) (for α = 0) and n−β/(d+β) (for α = +∞). [sent-211, score-0.062]
91 For instance, let H be the RKHS corresponding to the Gaussian kernel K(x, y) = exp(− x − y 2 /σ2 ) and let f be the norm of f in H . [sent-213, score-0.05]
92 For F = FR = { f ∈ H : f ≤ R}, Zhou (2003) proves that, for X = [0, 1]d , fixed R and fixed scale parameter σ, the entropy bound log N(ε, L∞ , F ) ≤ Cd logd+1 for some Cd < ∞ and the rates of convergence range between (α = ∞). [sent-214, score-0.04]
93 Next, we compute H (η) = = inf ηφd (z) + (1 − η)φd (−z) |z|≤δ δ 1 − δ + η 1 [η < d] + 1 [d ≤ η ≤ 1 − d] d δ δ + 1 − δ + − η 1 [η > 1 − d] d d 1836 C LASSIFICATION WITH A R EJECT O PTION USING A H INGE L OSS and H (η) − H(η) = 1−δ η 1 [η < d] d 1−δ 1−δ + η 1 [η > 1 − d] . [sent-220, score-0.153]
94 Finally, we find that H1 (η) = inf ηφd (z) + (1 − η)φd (−z) z>δ = 1−η δ δ 1 [η > 1 − d] + + 1 − δ − η 1 [η ≤ 1 − d] d d d and consequently η δ δ − η− 1 [η < d] d d d δ δ − δ − η 1 [d ≤ η ≤ 1 − d] . [sent-222, score-0.153]
95 This concludes the proof of the second claim, since d ≤ δ ≤ 1 − d. [sent-239, score-0.027]
96 Fast learning rates for plug-in classifiers under margin conditions. [sent-246, score-0.089]
97 The interaction between classification and reject performance for distance-based reject-option classifiers. [sent-377, score-0.234]
98 Fast rates for support vector machines using gaussian kernels. [sent-391, score-0.04]
99 Reducing the classification cost of support vector classifiers through an ROC-based rejection rule. [sent-401, score-0.058]
100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-417, score-0.185]
wordName wordTfidf (topN-words)
[('fd', 0.69), ('ld', 0.305), ('reject', 0.234), ('option', 0.158), ('inf', 0.153), ('egkamp', 0.152), ('fumera', 0.152), ('excess', 0.148), ('risk', 0.136), ('eject', 0.133), ('ption', 0.133), ('inge', 0.113), ('oss', 0.113), ('wegkamp', 0.097), ('herbei', 0.095), ('pg', 0.093), ('lassification', 0.081), ('bartlett', 0.079), ('tsybakov', 0.072), ('bernstein', 0.072), ('surrogate', 0.066), ('fn', 0.064), ('ie', 0.058), ('logd', 0.057), ('roli', 0.057), ('dk', 0.052), ('hinge', 0.05), ('margin', 0.049), ('convex', 0.049), ('tarigan', 0.048), ('discriminant', 0.047), ('measurable', 0.047), ('exponent', 0.047), ('classi', 0.047), ('cd', 0.044), ('rkhs', 0.043), ('kolmogorov', 0.043), ('loss', 0.043), ('rates', 0.04), ('hb', 0.04), ('cn', 0.039), ('bounsiar', 0.038), ('golfarelli', 0.038), ('tichomirov', 0.038), ('blanchard', 0.037), ('boucheron', 0.037), ('piecewise', 0.036), ('tp', 0.033), ('rejection', 0.033), ('ers', 0.033), ('marten', 0.032), ('landgrebe', 0.032), ('guo', 0.032), ('kimeldorf', 0.032), ('bousquet', 0.031), ('minimizes', 0.031), ('van', 0.031), ('condition', 0.03), ('pn', 0.03), ('geer', 0.029), ('audibert', 0.029), ('cox', 0.029), ('qcqp', 0.029), ('unlikely', 0.028), ('er', 0.028), ('annals', 0.028), ('proposition', 0.028), ('front', 0.027), ('proof', 0.027), ('minimizer', 0.027), ('norm', 0.027), ('discontinuous', 0.027), ('multiplicative', 0.027), ('el', 0.026), ('minimization', 0.025), ('sup', 0.025), ('vaart', 0.025), ('hansen', 0.025), ('gy', 0.025), ('kd', 0.025), ('dl', 0.025), ('eg', 0.025), ('cost', 0.025), ('bayes', 0.025), ('svm', 0.025), ('recognition', 0.024), ('reproducing', 0.023), ('kernel', 0.023), ('remark', 0.023), ('risks', 0.023), ('xi', 0.023), ('classify', 0.022), ('claim', 0.022), ('critical', 0.022), ('theorem', 0.022), ('sgn', 0.022), ('misclassi', 0.022), ('yi', 0.021), ('ct', 0.021), ('exp', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
2 0.12522101 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
3 0.10911611 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
Author: Imhoi Koo, Rhee Man Kil
Abstract: This paper presents a new method of model selection for regression problems using the modulus of continuity. For this purpose, we suggest the prediction risk bounds of regression models using the modulus of continuity which can be interpreted as the complexity of functions. We also present the model selection criterion referred to as the modulus of continuity information criterion (MCIC) which is derived from the suggested prediction risk bounds. The suggested MCIC provides a risk estimate using the modulus of continuity for a trained regression model (or an estimation function) while other model selection criteria such as the AIC and BIC use structural information such as the number of training parameters. As a result, the suggested MCIC is able to discriminate the performances of trained regression models, even with the same structure of training models. To show the effectiveness of the proposed method, the simulation for function approximation using the multilayer perceptrons (MLPs) was conducted. Through the simulation for function approximation, it was demonstrated that the suggested MCIC provides a good selection tool for nonlinear regression models, even with the limited size of data. Keywords: regression models, multilayer perceptrons, model selection, information criteria, modulus of continuity
5 0.065339372 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
6 0.046353579 52 jmlr-2008-Learning from Multiple Sources
7 0.043721758 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
8 0.040900912 58 jmlr-2008-Max-margin Classification of Data with Absent Features
9 0.038753081 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
10 0.038402684 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
11 0.03772524 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
12 0.037509784 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
13 0.037135426 92 jmlr-2008-Universal Multi-Task Kernels
14 0.034573719 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
15 0.03418659 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
16 0.032075722 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
17 0.03137948 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
18 0.029036339 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.029014142 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
20 0.028623745 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
topicId topicWeight
[(0, 0.156), (1, -0.089), (2, 0.05), (3, -0.075), (4, 0.091), (5, -0.094), (6, 0.125), (7, -0.075), (8, 0.124), (9, -0.052), (10, 0.101), (11, -0.168), (12, -0.057), (13, -0.037), (14, -0.002), (15, -0.055), (16, -0.095), (17, -0.126), (18, 0.047), (19, 0.041), (20, -0.151), (21, 0.248), (22, 0.002), (23, 0.051), (24, -0.036), (25, -0.146), (26, -0.215), (27, -0.106), (28, 0.121), (29, 0.019), (30, 0.096), (31, -0.171), (32, -0.012), (33, -0.22), (34, -0.08), (35, -0.273), (36, 0.008), (37, -0.063), (38, 0.114), (39, 0.072), (40, -0.145), (41, -0.081), (42, 0.04), (43, -0.072), (44, 0.067), (45, -0.079), (46, 0.04), (47, -0.018), (48, 0.142), (49, -0.206)]
simIndex simValue paperId paperTitle
same-paper 1 0.94378871 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
2 0.62620205 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
3 0.36880443 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
Author: Imhoi Koo, Rhee Man Kil
Abstract: This paper presents a new method of model selection for regression problems using the modulus of continuity. For this purpose, we suggest the prediction risk bounds of regression models using the modulus of continuity which can be interpreted as the complexity of functions. We also present the model selection criterion referred to as the modulus of continuity information criterion (MCIC) which is derived from the suggested prediction risk bounds. The suggested MCIC provides a risk estimate using the modulus of continuity for a trained regression model (or an estimation function) while other model selection criteria such as the AIC and BIC use structural information such as the number of training parameters. As a result, the suggested MCIC is able to discriminate the performances of trained regression models, even with the same structure of training models. To show the effectiveness of the proposed method, the simulation for function approximation using the multilayer perceptrons (MLPs) was conducted. Through the simulation for function approximation, it was demonstrated that the suggested MCIC provides a good selection tool for nonlinear regression models, even with the limited size of data. Keywords: regression models, multilayer perceptrons, model selection, information criteria, modulus of continuity
5 0.26241311 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
6 0.20184007 52 jmlr-2008-Learning from Multiple Sources
7 0.18740806 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
8 0.16697502 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
9 0.16145587 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
10 0.14971823 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
11 0.1493963 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.14868687 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
13 0.14815816 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
14 0.14076446 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
15 0.14069894 92 jmlr-2008-Universal Multi-Task Kernels
16 0.13543475 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
17 0.12979098 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
18 0.12177083 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
19 0.1191879 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
20 0.11730763 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
topicId topicWeight
[(0, 0.095), (23, 0.354), (31, 0.015), (39, 0.018), (40, 0.032), (54, 0.036), (58, 0.04), (60, 0.012), (66, 0.05), (76, 0.018), (78, 0.014), (88, 0.101), (92, 0.053), (94, 0.035), (99, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.75792849 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
2 0.40635279 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
Author: Gerda Claeskens, Christophe Croux, Johan Van Kerckhoven
Abstract: Support vector machines for classification have the advantage that the curse of dimensionality is circumvented. It has been shown that a reduction of the dimension of the input space leads to even better results. For this purpose, we propose two information criteria which can be computed directly from the definition of the support vector machine. We assess the predictive performance of the models selected by our new criteria and compare them to existing variable selection techniques in a simulation study. The simulation results show that the new criteria are competitive in terms of generalization error rate while being much easier to compute. We arrive at the same findings for comparison on some real-world benchmark data sets. Keywords: information criterion, supervised classification, support vector machine, variable selection
4 0.40284169 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
5 0.37488645 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
6 0.37460214 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
7 0.37243938 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
8 0.37220275 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
9 0.36921477 57 jmlr-2008-Manifold Learning: The Price of Normalization
10 0.3645055 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
11 0.36167356 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
12 0.36093965 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
13 0.35918024 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
14 0.35878769 9 jmlr-2008-Active Learning by Spherical Subdivision
15 0.35680887 96 jmlr-2008-Visualizing Data using t-SNE
16 0.3555904 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
17 0.35240525 86 jmlr-2008-SimpleMKL
18 0.35116833 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
19 0.35043272 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
20 0.35032195 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning