jmlr jmlr2013 jmlr2013-6 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Marshall Business School University of Southern California Los Angeles, CA 90089, USA Editor: John Shawe-Taylor Abstract The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. [sent-3, score-0.377]
2 It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. [sent-4, score-0.332]
3 NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. [sent-9, score-0.438]
4 As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. [sent-10, score-0.756]
5 Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. [sent-11, score-0.392]
6 Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection 1. [sent-12, score-0.459]
7 The risk function can be expressed as a convex combination of type I and II errors: R(h) = IP(Y = 0)R0 (h) + IP(Y = 1)R1 (h) , (1) where R0 (h) = IP (h(X) = Y |Y = 0) denotes the type I error, R1 (h) = IP (h(X) = Y |Y = 1) denotes the type II error. [sent-28, score-0.375]
8 A certain classifier h in classical binary classification paradigm is good if the excess risk ˆ − R∗ is small on average or with high probability. [sent-33, score-0.297]
9 The NP paradigm is irrelevant if we can achieve very small type I and type II errors simultaneously. [sent-37, score-0.377]
10 Clearly by (1), it is not feasible to have both type I error R0 and type II error R1 be smaller than 10%. [sent-41, score-0.3]
11 ˆ Moreover, even if a classifier φ achieves a small risk, there is no guarantee on attaining desirable type I or type II errors. [sent-43, score-0.232]
12 Scott (2007) proposed performance measures for NP classification that weights type I and type II error in sensible ways. [sent-57, score-0.266]
13 There is a commonality in this line of literature: a relaxed empirical type I error constraint is used in the optimization program, and as a result, type I errors of the classifiers can only be shown to satisfy a relaxed upper bound. [sent-61, score-0.298]
14 They proposed the program min ˆ φ∈H ,R0 (φ)≤α+ε0 /2 ˆ R1 (φ) , ˆ ˆ where R0 and R1 denote empirical type I and type II errors respectively. [sent-64, score-0.302]
15 It is shown that solution ˆ ˆ to the above program φ satisfies simultaneously with high probability, the type II error R1 (φ) is ∗ ) + ε , for some ε > 0, and the type I error R (φ) is bounded from ˆ bounded from above by R1 (φ 1 1 0 above by α + ε0 . [sent-65, score-0.3]
16 establishing an explicit ˆ diminishing rate for the excess type II error R1 (φ) − R1 (φ∗ ). [sent-68, score-0.308]
17 A related framework that also addresses asymmetry in errors is the cost-sensitive learning, which assigns different costs as weights of type I and type II errors (see, e. [sent-84, score-0.32]
18 Two kinds of errors arise: type I error occurs when rejecting P− when it is true, and type II error occurs when not rejecting P− when it is false. [sent-101, score-0.332]
19 In other words, we specify a significance level α on type I error, and minimize type II error. [sent-103, score-0.298]
20 It 3014 N EYMAN -P EARSON C LASSIFICATION is worthy to note that our plug-in approach to NP classification leads to problems related to density level set estimation (see Rigollet and Vert 2009 and reference therein), where the task is to estimate {x : p(x) > λ}, for some level λ > 0. [sent-115, score-0.219]
21 Density level set estimation has applications in anomaly detection and unsupervised or semi-supervised classification. [sent-116, score-0.472]
22 Plug-in methods for density level set estimation, as opposed to direct methods, do not involve complex optimization procedure, and only amounts to thresholding the density estimate at proper level. [sent-117, score-0.263]
23 First, the threshold level in our current setup needs to be estimated, and secondly, we deal with density ratios rather than densities. [sent-119, score-0.237]
24 However Audibert and Tsybakov (2007) combined a smoothness condition condition on regression function with the margin assumption, and showed that plug-in classifiers 1 ηn ≥ 1/2) based on I( ˆ local polynomial estimators can achieve rates faster than O(1/n). [sent-124, score-0.236]
25 3 Application to Anomaly Detection NP classification is a useful framework to address anomaly detection problems. [sent-128, score-0.406]
26 In anomaly detection, the goal is to discover patterns that are different from usual outcomes or behaviors. [sent-129, score-0.264]
27 There are many approaches to anomaly detection; some serving a specific purpose while others are more generic. [sent-132, score-0.264]
28 A recent comprehensive review of anomaly detection literature is provided by Chandola et al. [sent-134, score-0.406]
29 When we have training data from the normal class, a common approach to anomaly detection is to estimate the normal class density p0 and try to threshold at a proper level, but this is inappropriate if the anomaly class is far from uniformly distributed. [sent-138, score-0.934]
30 Our main results in NP classification will be adapted for anomaly detection applications, where the normal sample size n is much bigger than the anomaly sample size m. [sent-141, score-0.692]
31 3016 N EYMAN -P EARSON C LASSIFICATION The above condition for densities was first introduced in Polonik (1995), and its counterpart in the classical binary classification was called margin condition (Mammen and Tsybakov, 1999), from which we borrow the same terminology for discussion. [sent-188, score-0.215]
32 Recall that the set {x : η(x) = 1/2} is the decision boundary of the Bayes classifier in the classical paradigm, and the margin condition in the classical paradigm is a special case of Definition 2 by taking p = η and C∗ = 1/2. [sent-190, score-0.278]
33 Most importantly, we formulate some detection condition to detect the right threshold level in plug-in classifiers under the NP paradigm. [sent-195, score-0.29]
34 1 Class 0 Density p0 Known In this subsection, suppose that we know the class 0 density p0 , but have to estimate the class 1 density p1 . [sent-197, score-0.256]
35 In particular, we will establish oracle ˆ level Cα ˆ inequalities regarding the excess type I and type II errors. [sent-214, score-0.521]
36 Note that since Cα is constructed to meet ˆ vanishes, that is, the level α exactly, the excess type I error of φ ˆ R0 (φ) − R0 (φ∗ ) = 0 . [sent-215, score-0.349]
37 ˆ ˆ The following theorem addresses the excess type II error of φ: R1 (φ) − R1 (φ∗ ). [sent-222, score-0.307]
38 Note that the dependency of the upper bound for the excess type II error on parameters β, L, and L′ is incorporated into the constant C, whose explicit formula is given in Lemma 1, which has an important role in the proof. [sent-227, score-0.283]
39 Also from the upper bound, we can see that the larger the ¯ parameter γ, the sharper the margin assumption, and then the faster the rate of convergence for the 3018 N EYMAN -P EARSON C LASSIFICATION excess type II error. [sent-240, score-0.335]
40 Proof First note that the excess type II error can be represented by ˆ R1 (φ) − R1 (φ∗ ) = ˆ G∗ △G p1 ∗ −Cα dP0 , p0 p1 ˆ ∗ ˆ ˆ ˆ ˆ ˆ < Cα and G = p0 < Cα , and G∗ △G = (G∗ ∩ Gc ) ∪ (G∗c ∩ G) is the symmetric ˆ difference between G∗ and G. [sent-242, score-0.283]
41 ˆ G∗ △G Define an event regarding the sample S1 : E = { p1 − p1 ∞ < δ1 µmin }, where δ1 = µ2C log(m/δ) , ˆ 2 mhd min and C is the same as in Lemma 1 (with p replaced by p1 ). [sent-244, score-0.286]
42 2 Class 0 Density p0 Unknown Assume that both the class 0 density p0 and the class 1 density p1 are unknown. [sent-262, score-0.256]
43 Because data from class 0 is needed to estimate both the class 0 density and the threshold level, we split the class 0 data into two pieces. [sent-264, score-0.261]
44 First estimate p0 and p1 respectively from S0 and S1 by kernel estimators, p0 (x) = ˆ 1 nhd n n ∑K i=1 Xi− − x hn and p1 (x) = ˆ 1 m ∑K mhd i=1 m Xi+ − x hm , ˆ where hn and hm denote the bandwidths. [sent-277, score-0.723]
45 We know that having fast diminishing excess type II error demands a low noise condition, such as the margin assumption. [sent-281, score-0.394]
46 Approximating the optimal threshold level is not a problem in the classical setting, because in that setting, the Bayes classifier is 1 I(η(x) ≥ 1/2), and the threshold level 1/2 on the regression function η is known. [sent-283, score-0.282]
47 The following level α detection condition addresses this concern. [sent-285, score-0.263]
48 ˆ The next theorem addresses the excess type II error of φ. [sent-305, score-0.307]
49 Also assume that the likelihood ratio p1 /p0 satisfies the level α detection condition for ¯ some γ ≥ γ. [sent-310, score-0.26]
50 ¯ In particular, there exists some positive C, such that for all n, m ≥ 1/δ, ˆ ¯ log n R1 (φ) − R1 (φ ) ≤ C n min ∗ Proof Denote by G∗ = p1 p0 ∗ ˆ < Cα and G = p1 ˆ p0 ˆ γ γ 1 1+¯ β(1+¯ ) 2 , 2− , 2β+d γ log m + m β(1+¯ ) γ 2β+d . [sent-312, score-0.224]
51 ˆ G∗ △G Therefore the excess type II error can be decomposed in two parts, ˆ P1 (G) − P1 (G∗ ) = ˆ G∗ △G p1 ∗ ∗ ˆ −Cα dP0 +Cα P0 (G∗c ) − P0 (Gc ) . [sent-315, score-0.283]
52 From the above decomposition, we see ˆ should be not only smaller than the level α, that to control the excess type II error, type I error of φ but also not far from α. [sent-317, score-0.487]
53 This is intuitively correct, because having a small type I error amounts to having a very tight constraint set, which leads to significant deterioration in achievable type II error. [sent-318, score-0.266]
54 By the level α detection condition, it is enough to take cn = (2dn /C1 ) inequality (8), P0 1/γ p1 (X) ∗ ≥ Cα + (2dn /C1 ) − p0 (X) ≤ α − 2dn ≤ P0 3023 1/γ − . [sent-343, score-0.267]
55 Same as the p0 known setup, the coefficient γ from the ¯ margin assumption has influence on the convergence rate of the excess type II error. [sent-356, score-0.335]
56 The larger the γ, the easier the classification problem, and hence the faster the convergence of the excess type II error. [sent-357, score-0.249]
57 Take ¯ it to the extreme γ → ∞ (keep γ fixed), which holds when the amount of data around the optimal − threshold level goes to zero, log n n 1+¯ γ 2− γ → log n n 3024 0 = 1. [sent-360, score-0.303]
58 In anomaly detection applications, let class 0 represent the normal class, and class 1 represent the anomaly class. [sent-362, score-0.774]
59 Intuitively, this says that if the normal class sample size is large enough n ≥ log m compared to the anomaly class sample size, lack of precise knowledge on normal class density p0 does not change the type II error rate bound, up to a multiplicative constant. [sent-370, score-0.761]
60 Recall that η = ∑i=1 Yi K h / ∑i=1 K h can written as fˆ/ p, ˆ ˆ ˆ 3025 T ONG where p(x) = ˆ 1 mhd m ∑K i=1 Xi − x h 1 and fˆ(x) = mhd m ∑ Yi K i=1 Xi − x h , in which h is the bandwidth. [sent-391, score-0.322]
61 I( ˆ (10) ˜ ˆ Since Dα is constructed to meet the level α exactly, the excess type I error of φ vanishes, that is, ˜ R0 (φ) − R0 (φ∗ ) = 0 . [sent-396, score-0.349]
62 Then there exists a positive D, such that for any log(m/δ) mhd δ ∈ (0, 1) and any sample size m satisfying < 1, it holds with probability 1 − δ, ˜ ˜ log(3m/δ) R1 (φ) − R1 (φ ) ≤ D mhd ∗ 1+¯ γ 2 , 1 2β+d . [sent-401, score-0.322]
63 Proof First note that the excess type II error ˜ R1 (φ) − R1 (φ∗ ) = ˆ G∗ △G p1 ∗ ˆ −Cα dP0 = P1 (G) − P1 (G∗ ) , p0 3026 N EYMAN -P EARSON C LASSIFICATION ˆ ˆ ˆ ˆ ˆ ˆ where G∗ = {η < D∗ } and G = η < Dα , and G∗ △G = (G∗ ∩ Gc ) ∪ (G∗c ∩ G). [sent-403, score-0.283]
64 n i=1 Unlike the previous setup where p0 is known, we now bound the excess type I error. [sent-436, score-0.282]
65 A similar α level detection condition ∗ can be formulated can be formulated for the regression function, but we omit it as Cα is simply ∗ . [sent-439, score-0.263]
66 The next theorem address the excess type II error of φ: R (φ) − R (φ∗ ). [sent-440, score-0.283]
67 Assume condition 3 and the regression I( ˆ ¯ function η satisfies the level α detection condition for some γ(≥ γ). [sent-442, score-0.294]
68 ¯ Then there exists a positive constant C, such that for any δ ∈ (0, 1) and any m, n ≥ 1/δ, it holds with probability 1 − 2δ, ˜ ¯ log n R1 (φ) − R1 (φ ) ≤ C n min ∗ γ 1 1+¯ 2 , 2− γ log m + m β(1+¯ ) γ 2β+d . [sent-444, score-0.224]
69 ˆ ˜ ˆ ˆ Proof Let G∗ = {η < D∗ } and G = {η < Dα }, then the excess type II error of φ can be decomposed α by ˆ P1 (G) − P1 (G∗ ) = G∗ △G∗ p1 ∗ ∗ ˆ −Cα dP0 +Cα + P0 (G∗c ) − P0 (Gc ) . [sent-445, score-0.283]
70 α α By the level α detection condition, it is enough to take cn = (2dn /C1 ) P0 η ≥ D∗ + (2dn /C1 ) α 1/γ − 1/γ − . [sent-469, score-0.241]
71 In anomaly detection applications, normal samples are considered abundant, that is, n ≫ m, 1 1 which implies that ( log n ) 2β+d ≤ ( log m ) 2β+d . [sent-481, score-0.614]
72 Then the upper bounds for the excess type II errors n m in Theorem 2 and Theorem 3 are of the same order. [sent-482, score-0.281]
73 Having access to the mixture (contaminated) ¯ sample S looks like a weaker condition than having access to a pure anomaly sample. [sent-483, score-0.295]
74 The essence is revealed by observing that the density ratio p1 /p0 and the regression function η play the same role in the oracle NP classifier at level α: ∗ φ∗ (x) = 1 1 /p0 (x) ≥ Cα ) = 1 I(p I(η(x) ≥ D∗ ) . [sent-485, score-0.251]
75 Being able to estimate the anomaly density p1 is not of particular advantage, because only the ratio p1 /p0 matters. [sent-487, score-0.372]
76 , Xn }, where h nhd ∑i=1 log(n/ε) nhd is the bandwidth. [sent-500, score-0.972]
77 For any ε ∈ (0, 1), if the sample size n is such that IP( p − p ˆ ∞ < 1, it holds ≥ δ) ≤ ε , where δ = (32c2 d + 48dc1 ) √ ′ log(n/ε) dL 1 1 β/2 −2β β ˜ + 2Lc3 h + d + (L + C ∑ )d n , nhd nh nh s! [sent-501, score-0.75]
78 Let h = log n n 1 2β+d ∞ + |K| t , it is enough to take δ = C C = 32c2 d + β dt, log(n/ε) , nhd c3 = |K| t β dt, ˜ and C is such that where √ ˜ 48dc1 + 2Lc3 + dL′ + L + C 1 . [sent-503, score-0.579]
79 Note that for any δ > 0, IP ( p − p ˆ ∞ ≥ δ) ≤ IP (M1 + M2 + M3 ≥ δ) , 3031 (14) T ONG where M1 = sup √ x−x′ ≤ d M x−x′ ≤ n 1 nhd ∑ K( Xi − x Xi − x′ ) − K( ) h h , d M i=1 sup √ |p(x) − p(x′ )| , M2 = 1 nhd M3 = sup x∈G n ∑ K( i=1 Xi − x ) − p(x) . [sent-508, score-1.071]
80 h Note that because K is L′ -Lipschitz, M1 ≤ sup √ x−x′ ≤ d M 1 nhd Xi − x Xi − x′ K( ) − K( ) h h n ∑ i=1 √ √ ′ dL 1 1 n dL′ = d . [sent-509, score-0.519]
81 ≤ d nh Mh nh nh To control M2 , note that if β ≤ 1, |p(x) − p(x′ )| = |p(x) − px′ (x)| ≤ L x − x′ β . [sent-510, score-0.418]
82 √ ′ dL 1 1 ˜ Define by t = δ − nhd nh − L + C ∑1≤|s|≤⌊β⌋ s! [sent-519, score-0.618]
83 Use a union bound to control the tail probability of M3 , IP (M3 ≥ t) ≤ ∑ IP x∈G 1 nhd n ∑ K( i=1 Xi − x ) − p(x) ≥ t . [sent-522, score-0.529]
84 Because log 2 + d log(2M + 1) ≤ 6d log n, it is sufficient to have 6d log n − nhd t 2 ≤ log ε . [sent-528, score-0.858]
85 Then we can take √ ′ log n dL 1 1 β ε ˜ + 2Lc3 h + d + L +C ∑ d nh nh nh s! [sent-532, score-0.489]
86 1≤|s|≤⌊β⌋ log n nhd = hβ = log n n β 2β+d d β/2 n−2β . [sent-533, score-0.672]
87 log n ε + 2Lc3 nhd log n √ ′ + dL nhd 1 log n ˜ + L +C ∑ nhd s! [sent-536, score-1.737]
88 1≤|s|≤⌊β⌋ log n ε , nhd √ √ 1 ˜ where C = 32c2 d + 48dc1 + 2Lc3 + dL′ + L + C ∑1≤|s|≤⌊β⌋ s! [sent-537, score-0.579]
89 Moreover, assume log(n/ε) nhd p ≥ µ′ (> 0) and the sample size n is such that min ˆ IP( η − η ∞ < 1. [sent-547, score-0.524]
90 Then for any ε > 0, ≥ δ) ≤ 3ε , for δ = 1 µ′ − δ′ min + 1 µ′ − δ′ min δ′ + (32d K ∞+ 12d K 2 p ∞) √ ′ 1 1 dL ˜ + L + C1 ∑ nhd nh s! [sent-548, score-0.694]
91 1≤|s|≤⌊β⌋ log(n/ε) + (c4 + c5 )hβ nhd d β/2 n2β , p where δ′ is the same as δ in Lemma 1, c4 = µ′ ∞ L 1 + µf′ ∞ |K(z)| · z β dz and c5 = L |K(z)| · min min ˜ ˜ z β dz, and C1 is such that C1 ≥ sup1≤|s|≤⌊β⌋ supx∈[−1,1]d |Ds p(x)|. [sent-549, score-0.597]
92 nhd Proof Recall that p(x) = ˆ ˆ |η − η| = 1 nhd −x ∑n K( Xih ) and fˆ(x) = i=1 1 nhd −x ∑n Yi K( Xih ), so i=1 fˆ f f | p − p| ˆ fˆ f f 1 − ≤ − + − = |f| + | fˆ − f | . [sent-551, score-1.458]
93 x∈G The quantity M1 can be controlled as follows: M1 = sup √ x−x′ ≤ ≤ ≤ d M 1 nhd n ∑ Yi K( i=1 Xi − x 1 )− d h nh n ∑ Yi K( i=1 Xi − x′ ) h Xi − x Xi − x′ ) − K( ) ∑ h h ′ ≤ d i=1 x−x M √ ′ √ ′ dL 1 1 n dL = d . [sent-568, score-0.651]
94 d Mh nh nh nh sup √ 1 nhd n K( The quantity M2 can be controlled similarly as its counterpart in proof for Lemma 1, M2 ≤ ˜ L + C1 1 s! [sent-569, score-0.915]
95 T ONG 1 Let t = δ3 − nhd √ dL′ nh β 1 ˜ − L + C1 ∑1≤|s|≤⌊β⌋ s! [sent-571, score-0.618]
96 ˆ 1 nhd Note that IE 1 nhd n ∑ K( i=1 Xi − x )η(Xi ) − IE[ p(x)η(x)] ˆ h = y−x 1 K( )(η(y) − η(x))p(y)dy d h h = K(z)[η(x + hz) − η(x)]p(x + hz)dz ≤ p ∞ |K(z)| · |η(x + hz) − η(x)|dz . [sent-583, score-0.972]
97 The above inequality together with (16) implies that IE 1 nhd n ∑ K( i=1 Xi − x )η(Xi ) − IE[ p(x) · η(x)] ≤ c4 hβ , ˆ h 3037 T ONG where c4 = p ∞L µ′ min 1+ f ∞ µ′ min |K(z)| · z β dz . [sent-587, score-0.623]
98 d β/2 n−2β , we have IP( fˆ − f Take δ2 = δ3 , µ′ −δ′ min ∞ ˜ ≥ δ3 ) ≤ IP(M1 ≥ t ) ≤ we have IP( fˆ − f ∞ log(n/ε) nhd ∑ IP x∈G 1 ˜ + (c4 + c5 )hβ , and δ3 = t + nhd √ ′ dL nh + ˜ |B1 (x)| ≥ t − (c4 + c5 )hβ ≤ ε . [sent-593, score-1.142]
99 An overview of anomaly detection techniques: Existing solutions and latest technological trends. [sent-691, score-0.406]
100 Measuring mass concentrations and estimating density contour clusters-an excess mass approach. [sent-695, score-0.22]
wordName wordTfidf (topN-words)
[('nhd', 0.486), ('ip', 0.439), ('anomaly', 0.264), ('ie', 0.219), ('earson', 0.173), ('eyman', 0.173), ('gc', 0.172), ('mhd', 0.161), ('np', 0.15), ('detection', 0.142), ('excess', 0.133), ('nh', 0.132), ('type', 0.116), ('paradigm', 0.113), ('dl', 0.107), ('hz', 0.105), ('lassification', 0.103), ('classi', 0.101), ('rigollet', 0.097), ('ong', 0.096), ('log', 0.093), ('density', 0.087), ('margin', 0.086), ('ii', 0.078), ('vert', 0.074), ('er', 0.07), ('level', 0.066), ('tsybakov', 0.063), ('dn', 0.056), ('ers', 0.054), ('xih', 0.053), ('oracle', 0.053), ('threshold', 0.051), ('event', 0.05), ('adn', 0.049), ('cannon', 0.049), ('malignant', 0.049), ('markou', 0.049), ('lemma', 0.049), ('el', 0.046), ('tumor', 0.044), ('densities', 0.043), ('class', 0.041), ('min', 0.038), ('regarding', 0.037), ('xi', 0.036), ('novelty', 0.035), ('dz', 0.035), ('ds', 0.034), ('kernel', 0.034), ('error', 0.034), ('setup', 0.033), ('cn', 0.033), ('sup', 0.033), ('legendre', 0.032), ('mammen', 0.032), ('estimators', 0.032), ('errors', 0.032), ('smoothness', 0.032), ('condition', 0.031), ('erm', 0.031), ('scott', 0.031), ('vc', 0.029), ('fc', 0.029), ('risk', 0.027), ('diagnosis', 0.027), ('inequality', 0.026), ('medical', 0.026), ('sd', 0.026), ('decompose', 0.025), ('agyemang', 0.025), ('casasent', 0.025), ('chandola', 0.025), ('hodge', 0.025), ('liml', 0.025), ('patcha', 0.025), ('tarigan', 0.025), ('wied', 0.025), ('xin', 0.025), ('diminishing', 0.025), ('gx', 0.025), ('addresses', 0.024), ('classical', 0.024), ('audibert', 0.024), ('readers', 0.024), ('regression', 0.024), ('estimator', 0.023), ('opposed', 0.023), ('zi', 0.023), ('control', 0.022), ('normal', 0.022), ('blanchard', 0.021), ('ird', 0.021), ('lehmann', 0.021), ('marshall', 0.021), ('stylized', 0.021), ('zadrozny', 0.021), ('hm', 0.021), ('ratio', 0.021), ('tail', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto
Abstract: We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the ℓ2 , Huber, ℓ1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with specified means and variances from simple scalar building blocks. The main contribution of this paper is a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for estimation. In particular, a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties has a dual representation that can be exploited to design interior point (IP) methods. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. We provide several numerical examples, along with a code that can be used to solve general PLQ problems. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaus∗. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013
3 0.058467742 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
4 0.057109442 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
5 0.056752097 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
6 0.053278219 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
7 0.048651982 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
8 0.048072398 114 jmlr-2013-The Rate of Convergence of AdaBoost
9 0.047334012 73 jmlr-2013-Multicategory Large-Margin Unified Machines
10 0.047048487 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
11 0.043320131 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
12 0.042302206 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
13 0.041126493 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
14 0.040137663 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
15 0.039218798 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
16 0.038205814 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
17 0.037168793 96 jmlr-2013-Regularization-Free Principal Curve Estimation
18 0.036409486 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
19 0.036279883 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
20 0.036128242 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
topicId topicWeight
[(0, -0.191), (1, 0.075), (2, 0.04), (3, 0.064), (4, 0.008), (5, -0.086), (6, 0.065), (7, 0.021), (8, 0.01), (9, -0.085), (10, -0.016), (11, -0.028), (12, 0.025), (13, -0.051), (14, -0.114), (15, -0.041), (16, -0.054), (17, -0.091), (18, -0.023), (19, -0.121), (20, -0.23), (21, -0.133), (22, 0.044), (23, -0.036), (24, -0.13), (25, -0.109), (26, 0.193), (27, 0.227), (28, -0.094), (29, -0.187), (30, 0.114), (31, -0.203), (32, -0.07), (33, 0.131), (34, 0.029), (35, -0.119), (36, -0.003), (37, 0.119), (38, -0.158), (39, 0.014), (40, -0.017), (41, -0.051), (42, -0.003), (43, 0.033), (44, 0.004), (45, -0.127), (46, -0.005), (47, -0.062), (48, -0.062), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.92352813 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto
Abstract: We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the ℓ2 , Huber, ℓ1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with specified means and variances from simple scalar building blocks. The main contribution of this paper is a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for estimation. In particular, a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties has a dual representation that can be exploited to design interior point (IP) methods. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. We provide several numerical examples, along with a code that can be used to solve general PLQ problems. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaus∗. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013
3 0.4782443 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
4 0.38522923 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
5 0.35330057 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.32523274 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
9 0.28156012 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
10 0.28129089 8 jmlr-2013-A Theory of Multiclass Boosting
11 0.2741085 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
12 0.26781553 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
13 0.25948176 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
14 0.25772214 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
15 0.25403899 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
16 0.2494853 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
17 0.24385777 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
18 0.24076703 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
19 0.23502769 33 jmlr-2013-Dimension Independent Similarity Computation
20 0.23272902 22 jmlr-2013-Classifying With Confidence From Incomplete Information
topicId topicWeight
[(0, 0.017), (5, 0.128), (6, 0.021), (10, 0.063), (20, 0.01), (23, 0.03), (53, 0.018), (68, 0.017), (70, 0.029), (75, 0.022), (85, 0.537), (87, 0.013)]
simIndex simValue paperId paperTitle
1 0.90531802 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
same-paper 2 0.90339386 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
3 0.55961484 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
4 0.53641605 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.51157302 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.49764732 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
7 0.47085544 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
8 0.46010843 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
9 0.4549354 21 jmlr-2013-Classifier Selection using the Predicate Depth
10 0.44993541 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
11 0.44723028 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
12 0.44491816 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
13 0.43473032 32 jmlr-2013-Differential Privacy for Functions and Functional Data
14 0.4323146 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
15 0.43071023 114 jmlr-2013-The Rate of Convergence of AdaBoost
16 0.42674115 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
17 0.42085508 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
18 0.42080656 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
19 0.41938269 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.41643685 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization