jmlr jmlr2011 jmlr2011-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
Reference: text
sentIndex sentText sentNum sentScore
1 The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. [sent-5, score-0.272]
2 The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. [sent-6, score-0.197]
3 New techniques to handle such problems are developed and they have consequences on chance constrained programming. [sent-7, score-0.166]
4 We also evaluate the price to pay in terms of type II error for being conservative on type I error. [sent-8, score-0.444]
5 Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization 1. [sent-9, score-0.314]
6 With slight abuse of language, in verbal discussion we do not distinguish type I/II error from probability of type I/II error. [sent-11, score-0.25]
7 In the learning context, as true errors are inaccessible, we cannot enforce almost surely the desired upper bound for type I error. [sent-16, score-0.15]
8 The best we can hope is that a data dependent classifier has type I error bounded with high probability. [sent-17, score-0.132]
9 The first is that type I error of the classifier fˆ is bounded from above by a pre-specified level with pre-specified high probability; the second is that fˆ has good performance bounds for excess type II error. [sent-19, score-0.34]
10 R IGOLLET AND T ONG conservative attitude on type I error, and has good performance bound measured by the excess ϕ-type II error. [sent-23, score-0.375]
11 A parallel between binary classification and statistical hypothesis testing is drawn in Section 3 with emphasis on the NP paradigm in both frameworks. [sent-27, score-0.17]
12 Finally, Section 5 illustrates an application of our results to chance constrained optimization. [sent-29, score-0.137]
13 Clearly the indicator function is not convex, and for computational convenience, a common practice is to replace it by a convex surrogate (see, e. [sent-39, score-0.151]
14 Definition 1 A function ϕ : [−1, 1] → R+ is called a convex surrogate if it is non-decreasing, continuous and convex and if ϕ(0) = 1. [sent-45, score-0.257]
15 Commonly used examples of convex surrogates are the hinge loss ϕ(x) = (1 + x)+ , the logit loss ϕ(x) = log2 (1 + ex ) and the exponential loss ϕ(x) = ex . [sent-46, score-0.134]
16 In our subsequent analysis, this convex relaxation will also be the ground to analyze a stochastic convex 2832 N EYMAN -P EARSON C LASSIFICATION optimization problem subject to stochastic constraints. [sent-49, score-0.212]
17 While the h j ’s may have no satisfactory classifying power individually, for over two decades, boosting type of algorithms have successfully exploited the idea that a suitable weighted majority vote among these classifiers may result in low classification risk (Schapire, 1990). [sent-66, score-0.157]
18 Consequently, we restrict our search for classifiers to the set of functions consisting of convex combinations of the h j ’s: H conv = {hλ = M ∑ λ j h j , λ ∈ Λ}, j=1 where Λ denotes the flat simplex of IRM and is defined by Λ = {λ ∈ IRM : λ j ≥ 0, ∑M λ j = 1}. [sent-67, score-0.442]
19 In j=1 effect, classification rules given by the sign of h ∈ H conv are exactly the set of rules produced by the weighted majority votes among the base classifiers h1 , . [sent-68, score-0.372]
20 By restricting our search to classifiers in H conv , the best attainable ϕ-risk is called oracle risk and is abusively denoted by Rϕ (H conv ). [sent-72, score-0.784]
21 As a result, we have Rϕ (h) ≥ Rϕ (H conv ) for any h ∈ H conv and a natural measure of performance for a classifier h ∈ H conv is given by its excess risk defined by Rϕ (h) − Rϕ (H conv ). [sent-73, score-1.527]
22 The excess risk of a data driven classifier hn is a random quantity and we are interested in bounding it with high probability. [sent-74, score-0.289]
23 Formally, the statistical goal of binary classification is to construct a classifier hn such that the oracle inequality Rϕ (hn ) ≤ Rϕ (hH conv ) + ∆n (H conv , δ) holds with probability 1 − δ, where ∆n (·, ·) should be as small as possible. [sent-75, score-0.877]
24 In the scope of this paper, we focus on candidate classifiers in the class H conv . [sent-76, score-0.368]
25 Under this convention, the risk function in classical binary classification can be expressed as a convex combination of type I error R− (h) = IP (−Y h(X) ≥ 0|Y = −1) and type II error 2833 R IGOLLET AND T ONG R+ (h) = IP (−Y h(X) > 0|Y = 1): R(h) = IP(Y = −1)R− (h) + IP(Y = 1)R+ (h). [sent-82, score-0.488]
26 While the goal of classical binary classification is minh∈H R(h), where H is the set of candidate classifiers, the NP classification targets on min R+ (h) . [sent-83, score-0.139]
27 Two kinds of errors arise: type I error occurs when rejecting P− when it is true, and type II error occurs when accepting P− when it is false. [sent-97, score-0.323]
28 The Neyman-Pearson paradigm in hypothesis testing amounts to choosing φ that solves the following constrained optimization problem maximize subject to IE[φ(X)|Y = 1] , IE[φ(X)|Y = −1] ≤ α , where α ∈ (0, 1) is the significance level of the test. [sent-98, score-0.203]
29 In other words, we specify a significance level α on type I error, and minimize type II error. [sent-99, score-0.182]
30 1 Conservativeness on Type I Error While the binary classification problem has been extensively studied, theoretical proposition on how to implement the NP paradigm remains scarce. [sent-139, score-0.188]
31 (2002) initiated the theoretical treatment of the NP classification paradigm and an early empirical study can be found in Casasent and Chen (2003). [sent-141, score-0.209]
32 (2002) states that, simultaneously with high probability, the type II ˆ ˆ error R+ (h) is bounded from above by R+ (h∗ ) + ε1 , for some ε1 > 0 and the type I error of h is bounded from above by α + ε0 . [sent-148, score-0.264]
33 However, − ˆ it is our primary interest to make sure that R (h) ≤ α with high probability, following the original principle of the Neyman-Pearson paradigm that type I error should be controlled by a pre-specified ˆ level α. [sent-161, score-0.253]
34 As we follow an empirical risk minimization procedure, to control IP(R− (h) > α), it is ˆ be a solution to some program with a strengthened constraint on empirical necessary to have h type I error. [sent-162, score-0.334]
35 However, we also want to evaluate the excess type II error. [sent-164, score-0.208]
36 Our conservative attitude on type I error faces new technical challenges which we summarize here. [sent-165, score-0.299]
37 (2002) and of Scott and Nowak (2005), the relaxed constraint on the type I error is constructed such that the constraint ˆ R− (h) ≤ α + ε0 /2 on type I error in (2) is satisfied by h∗ (defined in (3)) with high probability, and that this classifier accommodates the excess type II error well. [sent-167, score-0.587]
38 As a result, the control of type II error mainly follows as a standard exercise to control suprema of empirical processes. [sent-168, score-0.238]
39 Such methods have consequences not only in NP classification but also on chance constraint programming as explained in Section 5. [sent-170, score-0.168]
40 2 Convexified NP Classifier Concerned about computational feasibility, our proposed classifier is the solution to a convex program, which is an empirical form NP classification problem (1) where the distribution of the observations is unknown. [sent-172, score-0.136]
41 The treatment of the convex constraint should be done carefully and we proceed as follows. [sent-174, score-0.178]
42 ˆϕ ˆϕ For any classifier h and a given convex surrogate ϕ, define R− and R+ to be the empirical coun+ − terparts of Rϕ and Rϕ respectively by − 1 n ˆϕ R− (h) = − ∑ ϕ(h(Xi− )) , n i=1 + and 1 n ˆϕ R+ (h) = + ∑ ϕ(−h(Xi+ )) . [sent-175, score-0.181]
43 n i=1 Moreover, for any a > 0, let H ϕ,a = {h ∈ H conv : R− (h) ≤ a} be the set of classifiers in H conv ϕ ϕ,a ˆϕ whose convexified type I errors are bounded from above by a, and let Hn− = {h ∈ H conv : R− (h) ≤ conv whose empirical convexified type I errors are bounded by a. [sent-176, score-1.674]
44 2836 N EYMAN -P EARSON C LASSIFICATION We are now in a position to construct a classifier in H conv according to the Neyman-Pearson √ ˜ paradigm. [sent-178, score-0.336]
45 For any τ > 0 such that τ ≤ α n− , define the convexified NP classifier hτ as any classifier that solves the following optimization problem min h∈H conv √ ˆϕ R− (h)≤α−τ/ n− ˆϕ R+ (h) . [sent-179, score-0.415]
46 (4) Note that this problem consists of minimizing a convex function subject to a convex constraint and can therefore be solved by standard algorithms (see, e. [sent-180, score-0.249]
47 In the next section, we present a series of results on type I and type II errors of classifiers that ˜ are more general than hτ . [sent-183, score-0.241]
48 As the true type I and type II errors are usually the main concern in statistical learning, we will also address the effect of convexification in terms of the excess type II error. [sent-188, score-0.475]
49 Interestingly, given that we want to be conservative on type I error, neither working on ϕ errors nor working on true errors leads to a most desirable type II error. [sent-189, score-0.418]
50 The price to pay for being conservative will be characterized explicitly. [sent-190, score-0.221]
51 κ = 4 2L log δ ˆϕ Then for any (random) classifier h ∈ H conv that satisfies R− (h) ≤ ακ , we have R− (h) ≤ R− (h) ≤ α . [sent-196, score-0.336]
52 2 Simultaneous Control of the Two Errors Theorem 3 guarantees that any classifier that satisfies the strengthened constraint on the empirical ϕ-type I error will have ϕ-type I error and true type I error bounded from above by α. [sent-200, score-0.323]
53 Indeed, an extremely small ακ would certainly ensure a good control of type I error but would deteriorate significantly the best achievable ϕ-type II error. [sent-202, score-0.17]
54 ν0 − ν This proposition ensures that if the convex surrogate ϕ is continuous, strengthening the constraint on type I error (ϕ-type I error) does not increase too much the best achievable ϕ-type II error. [sent-210, score-0.361]
55 This proposition has direct consequences on chance constrained programming as discussed in Section 5. [sent-212, score-0.207]
56 denote the NP classifier obtained with this sampling scheme by hn Let the event F be defined by ˜ ˜ F = {R− (hκ ) ≤ α} ∩ {R+ (hκ ) − min R+ (h) ≤ ϕ ϕ n ϕ n ϕ,α h∈H 2κ 4ϕ(1)κ √ + √ }. [sent-248, score-0.206]
57 Rϕ (hn ) − min Rϕ (h) ≤ np h∈H ϕ,α ¯ (1 − ε)α n(1 − p) (13) 4. [sent-256, score-0.33]
58 4 Price to Pay For Being Conservative ˜ We have shown that the the computational feasible classifier hκ satisfies oracle inequalities which take the optimal ϕ-type II errors as the benchmark. [sent-257, score-0.142]
59 In this subsection, the excess type II error will be measured, and we will characterize the price to pay by being conservative on type I error. [sent-258, score-0.561]
60 Denote by HB the set of convex combinations of the base classifiers that have ϕ-type I error bounded from above by α. [sent-268, score-0.183]
61 Therefore, we have the following observation: ˜ R+ (hκ ) − min R+ (h) ≤ T1 + T2 + T3 , R− (h)≤α where ˜ T1 = R+ (hκ ) − min R+ (h) , ϕ ϕ,α ϕ h∈HB T2 = min R+ (h) − ϕ,α ϕ h∈HB T3 = inf − R (h)≤α −B≤h≤B inf R− (h)≤α R+ (h) , ϕ −B≤h≤B R+ (h) − ϕ inf R+ (h) . [sent-269, score-0.285]
62 We can see that with a fixed sample size, choosing a set of base classifiers with smaller range will result in a tighter bound for the excess ϕ-type II error. [sent-272, score-0.153]
63 However, if one concerns more about the true type II error, choosing a smaller B should not be a better option, because only signs matters for true type I and II errors. [sent-273, score-0.182]
64 ϕ,α Note that HB ⊂ {h : R− (h) ≤ α, −B ≤ h ≤ B}, so T2 reflects the price to pay for being conservative on type I error. [sent-280, score-0.312]
65 It also reflect the bias for choosing a specific candidate pool of classifiers, that is, convex combinations of base classifiers. [sent-281, score-0.174]
66 However in our belief, the price to pay for being conservative is unavoidable. [sent-283, score-0.221]
67 Even if we do not resort to convexification, getting the best insurance on type I error still demands a high premium on type II error. [sent-284, score-0.223]
68 (2002), where it was claimed without justification that if we use α′ < α for the empirical program, “it seems unlikely that we ˆ can control the estimation error R+ (h) − R+ (h∗ ) in a distribution independent way”. [sent-286, score-0.133]
69 In particular, the excess type II risk of h∗ (α − εn− ), εn− > 0 does not converge to zero as sample ˜ sizes increase even if εn− → 0. [sent-293, score-0.274]
70 In particular, ˆ the excess type II risk of h(α − εn− ), εn− > 0 does not converge to zero with positive probability, as sample sizes increase even if εn− → 0. [sent-297, score-0.274]
71 In view of this negative result and our previous discussion, we have to accept the price to pay for ˜ being conservative on type I error, and our classifier hκ is no exception. [sent-303, score-0.312]
72 Chance Constrained Optimization Implementing the Neyman-Pearson paradigm for the convexified binary classification bears strong connections with chance constrained optimization. [sent-307, score-0.309]
73 A chance constrained optimization problem is of the following form: min f (λ) s. [sent-310, score-0.192]
74 Problem (14) may not be convex because the chance constraint {λ ∈ Λ : IP{F(λ, ξ) ≤ 0} ≥ 1 − α} is not convex in general and thus may not be tractable. [sent-320, score-0.351]
75 (2005) have derived sufficient conditions on the distribution of ξ for the chance constraint to be convex. [sent-322, score-0.139]
76 It amounts to solving the following convex optimization problem min λ∈Λ,t∈Rs f (λ) s. [sent-338, score-0.161]
77 The problem (17) provides a conservative convex approximation to (14), in the sense that every x feasible for (17) is also feasible for (14). [sent-341, score-0.298]
78 Nemirovski and Shapiro (2006) considered a particular class of conservative convex approximation where the key step is to replace IP{F(λ, ξ) ≥ 0} by IEϕ(F(λ, ξ)) in (14), where ϕ a nonnegative, nondecreasing, convex function that takes value 1 at 0. [sent-342, score-0.33]
79 The idea of a conservative convex approximation is also what we employ in our paper. [sent-344, score-0.224]
80 Replacing R+ (hλ ) by R+ (hλ ) turns (18) into a standard chance constrained optimization problem: ϕ min R+ (hλ ) ϕ λ∈Λ s. [sent-350, score-0.192]
81 On the other hand, chance constrained optimization techniques in previous literature assume knowledge about the distribution of the random vector ξ. [sent-356, score-0.137]
82 Given a finite sample, it is not feasible to construct a strictly conservative approximation to the ˆ constraint in (19). [sent-358, score-0.192]
83 In retrospect, our approach to (19) is an innovative hybrid between the analytical approach based on convex surrogates and the scenario approach. [sent-360, score-0.134]
84 The following theorem summarizes our contribution to chance constrained optimization. [sent-377, score-0.172]
85 , Koltchinskii, 2011, Chapter 2), we find ¯ ¯ IEΦ sup |(Pn − P)(ϕ ◦ hλ )| ≤ IEΦ 2 sup |Rn (ϕ ◦ hλ )| ≤ IEΦ 4L sup |Rn (hλ )| . [sent-429, score-0.159]
86 Next, we have for any data-dependent classifier h ∈ H conv such that R− (h) ≤ ακ : − ˆϕ R− (h) ≤ R− (h) + sup ϕ h∈H conv κ ˆϕ ˆϕ R− (h) − R− (h) ≤ α − √ − + sup R− (h) − R− (h) . [sent-440, score-0.778]
87 ϕ ϕ conv n h∈H Lemma 10 implies that, with probability 1 − δ sup h∈H conv κ − ˆϕ R− (h) − R− (h) = sup (Pn− − P− )(ϕ ◦ hλ ) ≤ √ − . [sent-441, score-0.805]
88 ϕ Proof First, it is clear that γ is a non-increasing function of α because for α′ > α, {hλ ∈ H conv : R− (hλ ) ≤ α} ⊂ {hλ ∈ H conv : R− (hλ ) ≤ α′ }. [sent-446, score-0.672]
89 ν0 − ν / Lemma 11 together with the assumption that H ϕ,α−ν0 = 0 imply that γ is a non-increasing convex real-valued function on [α − ν0 , 1] so that γ(α − ν) − γ(α) ≤ ν sup g∈∂γ(α−ν) |g| , where ∂γ(α − ν) denotes the sub-differential of γ at α − ν. [sent-459, score-0.159]
90 Moreover, since γ is a non-increasing convex function on [α − ν0 , α − ν], it holds γ(α − ν0 ) − γ(α − ν) ≥ (ν − ν0 ) sup g∈∂γ(α−ν) |g| . [sent-460, score-0.159]
91 3 Proof of Theorem 5 Define the events E − and E + by κ ˆϕ {|R− (h) − R− (h)| ≤ √ − } , ϕ n h∈H conv κ ˆϕ {|R+ (h) − R+ (h)| ≤ √ + } . [sent-463, score-0.336]
92 E+ = ϕ n h∈H conv E− = Lemma 10 implies IP(E − ) ∧ IP(E + ) ≥ 1 − δ . [sent-464, score-0.336]
93 s of (9) can be decomposed as ˜ R+ (hκ ) − min R+ (h) = A1 + A1 + A3 , ϕ ϕ h∈H ϕ,α where ˜ ˆϕ ˜ ˆϕ ˜ A1 = R+ (hκ ) − R+ (hκ ) + R+ (hκ ) − min R+ (h) ϕ ϕ ϕ,α h∈Hn− A2 = min R+ (h) − min R+ (h) ϕ ϕ ϕ,α ϕ,α h∈Hn− κ h∈H 2κ A3 = min R+ (h) − min R+ (h). [sent-468, score-0.33]
94 ϕ ϕ ϕ,ακ h∈H conv h∈Hn− Therefore, on the event E + it holds 2κ A1 ≤ √ + . [sent-470, score-0.381]
95 The proof of (13) follows by observing that √ √ 4 2ϕ(1)κ 2 2κ + + ˜κ Rϕ (hn ) − min Rϕ (h) > + √ np h∈H ϕ,α ¯ (1 − ε)α n(1 − p) c ⊂ ( A1 ∩ A2 ) ∪ A2 ∪ A3 , where ˜ A1 = R+ (hκ ) − min R+ (h) > ϕ ϕ n ϕ,α h∈H A2 = {N − < n(1 − p)/2} , A3 = {N + < np/2} . [sent-483, score-0.417]
96 Hence, we find ˜n IP Rϕ (hκ ) − min R+ (h) > ϕ + h∈H ϕ,α √ √ 4 2ϕ(1)κ 2 2κ + √ np ¯ (1 − ε)α n(1 − p) ≤ 2δ + e− n(1−p)2 2 + e− np2 2 , which completes the proof of the corollary. [sent-486, score-0.395]
97 6 Proof of Proposition 8 Let the base classifiers be defined as h1 (x) = −1 and h2 (x) = 1 ≤ α) − 1 > α) , I(x I(x ∀ x ∈ [0, 1] For any λ ∈ [0, 1], denote the convex combination of h1 and h2 by hλ = λh1 + (1 − λ)h2 , that is, hλ (x) = (1 − 2λ)1 ≤ α) − 1 > α) . [sent-500, score-0.142]
98 The next lemma provides a lower bound on the probability that a binomial distribution exceeds its expectation. [sent-530, score-0.142]
99 2), it is not hard to show that k n n−1 k Pk (N ≥ k + 1) = IP(U(k+1) ≤ ) = n n k n 0 t k (1 − t)n−k−1 dt , and in the same manner, Pk−1 (N ≥ k) = IP(U(k) ≤ n Note that n−1 k−1 )=n n k−1 k−1 n 0 n−1 n−1 k , = n−k k−1 k 2852 t k−1 (1 − t)n−k dt . [sent-567, score-0.204]
100 N EYMAN -P EARSON C LASSIFICATION so that (29) follows if we prove k−1 n k t k−1 (1 − t)n−k dt ≤ (n − k) 0 k n 0 t k (1 − t)n−k−1 dt . [sent-568, score-0.204]
wordName wordTfidf (topN-words)
[('ip', 0.441), ('conv', 0.336), ('np', 0.275), ('ie', 0.237), ('earson', 0.195), ('eyman', 0.195), ('igollet', 0.195), ('ers', 0.162), ('cannon', 0.152), ('convexi', 0.152), ('classi', 0.148), ('er', 0.135), ('lassification', 0.128), ('paradigm', 0.121), ('conservative', 0.118), ('excess', 0.117), ('minr', 0.114), ('hn', 0.106), ('convex', 0.106), ('hb', 0.103), ('chance', 0.102), ('dt', 0.102), ('ong', 0.099), ('type', 0.091), ('ii', 0.086), ('binomial', 0.086), ('scott', 0.069), ('princeton', 0.068), ('risk', 0.066), ('nq', 0.064), ('pn', 0.061), ('pq', 0.06), ('errors', 0.059), ('pay', 0.058), ('min', 0.055), ('sup', 0.053), ('nemirovski', 0.05), ('rigollet', 0.05), ('ir', 0.049), ('attitude', 0.049), ('cala', 0.049), ('conservativeness', 0.049), ('infr', 0.049), ('ird', 0.049), ('oracle', 0.046), ('shapiro', 0.046), ('surrogate', 0.045), ('event', 0.045), ('price', 0.045), ('hm', 0.043), ('strengthened', 0.042), ('irm', 0.042), ('cosh', 0.042), ('ore', 0.042), ('proposition', 0.041), ('error', 0.041), ('inf', 0.04), ('pk', 0.039), ('fix', 0.039), ('control', 0.038), ('feasible', 0.037), ('constraint', 0.037), ('philippe', 0.037), ('base', 0.036), ('constrained', 0.035), ('theorem', 0.035), ('treatment', 0.035), ('completes', 0.033), ('campi', 0.033), ('casasent', 0.033), ('howse', 0.033), ('infh', 0.033), ('lagoa', 0.033), ('novelty', 0.032), ('candidate', 0.032), ('proof', 0.032), ('empirical', 0.03), ('lemma', 0.029), ('consequences', 0.029), ('bernstein', 0.028), ('card', 0.028), ('surrogates', 0.028), ('xin', 0.028), ('probability', 0.027), ('concern', 0.026), ('rn', 0.026), ('binary', 0.026), ('nowak', 0.026), ('classical', 0.026), ('moreover', 0.025), ('lehmann', 0.025), ('bears', 0.025), ('hush', 0.025), ('anomaly', 0.025), ('valued', 0.025), ('solves', 0.024), ('copies', 0.024), ('unlikely', 0.024), ('testing', 0.023), ('initiated', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
2 0.10942667 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
3 0.069258414 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
4 0.062357657 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
5 0.057465132 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
6 0.055655804 91 jmlr-2011-The Sample Complexity of Dictionary Learning
7 0.054510839 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
8 0.051747199 105 jmlr-2011-lp-Norm Multiple Kernel Learning
9 0.047307111 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
10 0.046957009 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
11 0.041893087 58 jmlr-2011-Learning from Partial Labels
12 0.041763145 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
13 0.041526105 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
14 0.040824343 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
15 0.040776335 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
16 0.040599857 35 jmlr-2011-Forest Density Estimation
17 0.038636185 97 jmlr-2011-Union Support Recovery in Multi-task Learning
18 0.037812635 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
19 0.034639835 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
20 0.031588323 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
topicId topicWeight
[(0, 0.185), (1, -0.01), (2, 0.042), (3, 0.046), (4, 0.064), (5, -0.02), (6, -0.073), (7, -0.039), (8, -0.143), (9, -0.0), (10, -0.18), (11, -0.039), (12, -0.048), (13, -0.006), (14, -0.1), (15, 0.217), (16, -0.242), (17, -0.093), (18, -0.1), (19, -0.012), (20, 0.099), (21, -0.032), (22, 0.019), (23, -0.024), (24, 0.052), (25, -0.012), (26, -0.072), (27, 0.097), (28, 0.046), (29, 0.081), (30, 0.048), (31, -0.016), (32, -0.166), (33, 0.039), (34, 0.057), (35, -0.013), (36, 0.011), (37, -0.004), (38, 0.004), (39, 0.018), (40, 0.009), (41, 0.042), (42, -0.095), (43, 0.021), (44, 0.074), (45, -0.096), (46, -0.056), (47, 0.054), (48, -0.031), (49, -0.104)]
simIndex simValue paperId paperTitle
same-paper 1 0.95869267 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
2 0.72794467 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
3 0.57591969 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
Author: Huixin Wang, Xiaotong Shen, Wei Pan
Abstract: In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of flat classification, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its flat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classification, structured learning
4 0.53166753 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.50924206 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
Author: Stefano Melacci, Mikhail Belkin
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classification, optimization
6 0.47872576 58 jmlr-2011-Learning from Partial Labels
7 0.38639855 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
8 0.3769272 22 jmlr-2011-Differentially Private Empirical Risk Minimization
9 0.37487131 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
10 0.37485331 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
11 0.32901773 91 jmlr-2011-The Sample Complexity of Dictionary Learning
12 0.32105571 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
13 0.3106167 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
14 0.3098689 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
15 0.29948819 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
16 0.29622629 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
17 0.25902525 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
18 0.2532894 12 jmlr-2011-Bayesian Co-Training
19 0.25235516 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
20 0.25154456 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
topicId topicWeight
[(4, 0.067), (6, 0.017), (9, 0.059), (10, 0.029), (24, 0.041), (31, 0.126), (32, 0.035), (41, 0.034), (60, 0.01), (64, 0.372), (65, 0.012), (73, 0.021), (78, 0.084), (90, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.69556081 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
2 0.42290765 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
3 0.4208478 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.41322625 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
5 0.41313517 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
6 0.41173494 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
7 0.41154325 59 jmlr-2011-Learning with Structured Sparsity
8 0.41124564 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
9 0.40930012 12 jmlr-2011-Bayesian Co-Training
10 0.40713379 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
11 0.40549555 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
12 0.40549475 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
13 0.40546235 16 jmlr-2011-Clustering Algorithms for Chains
14 0.40489233 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
15 0.40412185 101 jmlr-2011-Variable Sparsity Kernel Learning
16 0.40181759 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
17 0.40155378 105 jmlr-2011-lp-Norm Multiple Kernel Learning
18 0.4012357 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
19 0.40052789 104 jmlr-2011-X-Armed Bandits
20 0.40008783 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation