jmlr jmlr2013 jmlr2013-6 knowledge-graph by maker-knowledge-mining

6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Source: pdf

Author: Xin Tong

Abstract: The Neyman-Pearson (NP) paradigm in binary classiﬁcation treats type I and type II errors with different priorities. It seeks classiﬁers that minimize type II error, subject to a type I error constraint under a user speciﬁed level α. In this paper, plug-in classiﬁers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classiﬁers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classiﬁers handle different sampling schemes. This work focuses on theoretical properties of the proposed classiﬁers; in particular, we derive oracle inequalities that can be viewed as ﬁnite sample versions of risk bounds. NP classiﬁcation can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a speciﬁc form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Marshall Business School University of Southern California Los Angeles, CA 90089, USA Editor: John Shawe-Taylor Abstract The Neyman-Pearson (NP) paradigm in binary classiﬁcation treats type I and type II errors with different priorities. [sent-3, score-0.377]

2 It seeks classiﬁers that minimize type II error, subject to a type I error constraint under a user speciﬁed level α. [sent-4, score-0.332]

3 NP classiﬁcation can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. [sent-9, score-0.438]

4 As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a speciﬁc form for anomaly distributions. [sent-10, score-0.756]

5 Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. [sent-11, score-0.392]

6 Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection 1. [sent-12, score-0.459]

7 The risk function can be expressed as a convex combination of type I and II errors: R(h) = IP(Y = 0)R0 (h) + IP(Y = 1)R1 (h) , (1) where R0 (h) = IP (h(X) = Y |Y = 0) denotes the type I error, R1 (h) = IP (h(X) = Y |Y = 1) denotes the type II error. [sent-28, score-0.375]

8 A certain classiﬁer h in classical binary classiﬁcation paradigm is good if the excess risk ˆ − R∗ is small on average or with high probability. [sent-33, score-0.297]

9 The NP paradigm is irrelevant if we can achieve very small type I and type II errors simultaneously. [sent-37, score-0.377]

10 Clearly by (1), it is not feasible to have both type I error R0 and type II error R1 be smaller than 10%. [sent-41, score-0.3]

11 ˆ Moreover, even if a classiﬁer φ achieves a small risk, there is no guarantee on attaining desirable type I or type II errors. [sent-43, score-0.232]

12 Scott (2007) proposed performance measures for NP classiﬁcation that weights type I and type II error in sensible ways. [sent-57, score-0.266]

13 There is a commonality in this line of literature: a relaxed empirical type I error constraint is used in the optimization program, and as a result, type I errors of the classiﬁers can only be shown to satisfy a relaxed upper bound. [sent-61, score-0.298]

14 They proposed the program min ˆ φ∈H ,R0 (φ)≤α+ε0 /2 ˆ R1 (φ) , ˆ ˆ where R0 and R1 denote empirical type I and type II errors respectively. [sent-64, score-0.302]

15 It is shown that solution ˆ ˆ to the above program φ satisﬁes simultaneously with high probability, the type II error R1 (φ) is ∗ ) + ε , for some ε > 0, and the type I error R (φ) is bounded from ˆ bounded from above by R1 (φ 1 1 0 above by α + ε0 . [sent-65, score-0.3]

16 establishing an explicit ˆ diminishing rate for the excess type II error R1 (φ) − R1 (φ∗ ). [sent-68, score-0.308]

17 A related framework that also addresses asymmetry in errors is the cost-sensitive learning, which assigns different costs as weights of type I and type II errors (see, e. [sent-84, score-0.32]

18 Two kinds of errors arise: type I error occurs when rejecting P− when it is true, and type II error occurs when not rejecting P− when it is false. [sent-101, score-0.332]

19 In other words, we specify a signiﬁcance level α on type I error, and minimize type II error. [sent-103, score-0.298]

20 It 3014 N EYMAN -P EARSON C LASSIFICATION is worthy to note that our plug-in approach to NP classiﬁcation leads to problems related to density level set estimation (see Rigollet and Vert 2009 and reference therein), where the task is to estimate {x : p(x) > λ}, for some level λ > 0. [sent-115, score-0.219]

21 Density level set estimation has applications in anomaly detection and unsupervised or semi-supervised classiﬁcation. [sent-116, score-0.472]

22 Plug-in methods for density level set estimation, as opposed to direct methods, do not involve complex optimization procedure, and only amounts to thresholding the density estimate at proper level. [sent-117, score-0.263]

23 First, the threshold level in our current setup needs to be estimated, and secondly, we deal with density ratios rather than densities. [sent-119, score-0.237]

24 However Audibert and Tsybakov (2007) combined a smoothness condition condition on regression function with the margin assumption, and showed that plug-in classiﬁers 1 ηn ≥ 1/2) based on I( ˆ local polynomial estimators can achieve rates faster than O(1/n). [sent-124, score-0.236]

25 3 Application to Anomaly Detection NP classiﬁcation is a useful framework to address anomaly detection problems. [sent-128, score-0.406]

26 In anomaly detection, the goal is to discover patterns that are different from usual outcomes or behaviors. [sent-129, score-0.264]

27 There are many approaches to anomaly detection; some serving a speciﬁc purpose while others are more generic. [sent-132, score-0.264]

28 A recent comprehensive review of anomaly detection literature is provided by Chandola et al. [sent-134, score-0.406]

29 When we have training data from the normal class, a common approach to anomaly detection is to estimate the normal class density p0 and try to threshold at a proper level, but this is inappropriate if the anomaly class is far from uniformly distributed. [sent-138, score-0.934]

30 Our main results in NP classiﬁcation will be adapted for anomaly detection applications, where the normal sample size n is much bigger than the anomaly sample size m. [sent-141, score-0.692]

31 3016 N EYMAN -P EARSON C LASSIFICATION The above condition for densities was ﬁrst introduced in Polonik (1995), and its counterpart in the classical binary classiﬁcation was called margin condition (Mammen and Tsybakov, 1999), from which we borrow the same terminology for discussion. [sent-188, score-0.215]

32 Recall that the set {x : η(x) = 1/2} is the decision boundary of the Bayes classiﬁer in the classical paradigm, and the margin condition in the classical paradigm is a special case of Deﬁnition 2 by taking p = η and C∗ = 1/2. [sent-190, score-0.278]

33 Most importantly, we formulate some detection condition to detect the right threshold level in plug-in classiﬁers under the NP paradigm. [sent-195, score-0.29]

34 1 Class 0 Density p0 Known In this subsection, suppose that we know the class 0 density p0 , but have to estimate the class 1 density p1 . [sent-197, score-0.256]

35 In particular, we will establish oracle ˆ level Cα ˆ inequalities regarding the excess type I and type II errors. [sent-214, score-0.521]

36 Note that since Cα is constructed to meet ˆ vanishes, that is, the level α exactly, the excess type I error of φ ˆ R0 (φ) − R0 (φ∗ ) = 0 . [sent-215, score-0.349]

37 ˆ ˆ The following theorem addresses the excess type II error of φ: R1 (φ) − R1 (φ∗ ). [sent-222, score-0.307]

38 Note that the dependency of the upper bound for the excess type II error on parameters β, L, and L′ is incorporated into the constant C, whose explicit formula is given in Lemma 1, which has an important role in the proof. [sent-227, score-0.283]

39 Also from the upper bound, we can see that the larger the ¯ parameter γ, the sharper the margin assumption, and then the faster the rate of convergence for the 3018 N EYMAN -P EARSON C LASSIFICATION excess type II error. [sent-240, score-0.335]

40 Proof First note that the excess type II error can be represented by ˆ R1 (φ) − R1 (φ∗ ) = ˆ G∗ △G p1 ∗ −Cα dP0 , p0 p1 ˆ ∗ ˆ ˆ ˆ ˆ ˆ < Cα and G = p0 < Cα , and G∗ △G = (G∗ ∩ Gc ) ∪ (G∗c ∩ G) is the symmetric ˆ difference between G∗ and G. [sent-242, score-0.283]

41 ˆ G∗ △G Deﬁne an event regarding the sample S1 : E = { p1 − p1 ∞ < δ1 µmin }, where δ1 = µ2C log(m/δ) , ˆ 2 mhd min and C is the same as in Lemma 1 (with p replaced by p1 ). [sent-244, score-0.286]

42 2 Class 0 Density p0 Unknown Assume that both the class 0 density p0 and the class 1 density p1 are unknown. [sent-262, score-0.256]

43 Because data from class 0 is needed to estimate both the class 0 density and the threshold level, we split the class 0 data into two pieces. [sent-264, score-0.261]

44 First estimate p0 and p1 respectively from S0 and S1 by kernel estimators, p0 (x) = ˆ 1 nhd n n ∑K i=1 Xi− − x hn and p1 (x) = ˆ 1 m ∑K mhd i=1 m Xi+ − x hm , ˆ where hn and hm denote the bandwidths. [sent-277, score-0.723]

45 We know that having fast diminishing excess type II error demands a low noise condition, such as the margin assumption. [sent-281, score-0.394]

46 Approximating the optimal threshold level is not a problem in the classical setting, because in that setting, the Bayes classiﬁer is 1 I(η(x) ≥ 1/2), and the threshold level 1/2 on the regression function η is known. [sent-283, score-0.282]

47 The following level α detection condition addresses this concern. [sent-285, score-0.263]

48 ˆ The next theorem addresses the excess type II error of φ. [sent-305, score-0.307]

49 Also assume that the likelihood ratio p1 /p0 satisﬁes the level α detection condition for ¯ some γ ≥ γ. [sent-310, score-0.26]

50 ¯ In particular, there exists some positive C, such that for all n, m ≥ 1/δ,   ˆ ¯  log n R1 (φ) − R1 (φ ) ≤ C  n min ∗ Proof Denote by G∗ = p1 p0 ∗ ˆ < Cα and G = p1 ˆ p0 ˆ γ γ 1 1+¯ β(1+¯ ) 2 , 2− , 2β+d γ log m + m β(1+¯ ) γ 2β+d  . [sent-312, score-0.224]

51 ˆ G∗ △G Therefore the excess type II error can be decomposed in two parts, ˆ P1 (G) − P1 (G∗ ) = ˆ G∗ △G p1 ∗ ∗ ˆ −Cα dP0 +Cα P0 (G∗c ) − P0 (Gc ) . [sent-315, score-0.283]

52 From the above decomposition, we see ˆ should be not only smaller than the level α, that to control the excess type II error, type I error of φ but also not far from α. [sent-317, score-0.487]

53 This is intuitively correct, because having a small type I error amounts to having a very tight constraint set, which leads to signiﬁcant deterioration in achievable type II error. [sent-318, score-0.266]

54 By the level α detection condition, it is enough to take cn = (2dn /C1 ) inequality (8), P0 1/γ p1 (X) ∗ ≥ Cα + (2dn /C1 ) − p0 (X) ≤ α − 2dn ≤ P0 3023 1/γ − . [sent-343, score-0.267]

55 Same as the p0 known setup, the coefﬁcient γ from the ¯ margin assumption has inﬂuence on the convergence rate of the excess type II error. [sent-356, score-0.335]

56 The larger the γ, the easier the classiﬁcation problem, and hence the faster the convergence of the excess type II error. [sent-357, score-0.249]

57 Take ¯ it to the extreme γ → ∞ (keep γ ﬁxed), which holds when the amount of data around the optimal − threshold level goes to zero, log n n 1+¯ γ 2− γ → log n n 3024 0 = 1. [sent-360, score-0.303]

58 In anomaly detection applications, let class 0 represent the normal class, and class 1 represent the anomaly class. [sent-362, score-0.774]

59 Intuitively, this says that if the normal class sample size is large enough n ≥ log m compared to the anomaly class sample size, lack of precise knowledge on normal class density p0 does not change the type II error rate bound, up to a multiplicative constant. [sent-370, score-0.761]

60 Recall that η = ∑i=1 Yi K h / ∑i=1 K h can written as fˆ/ p, ˆ ˆ ˆ 3025 T ONG where p(x) = ˆ 1 mhd m ∑K i=1 Xi − x h 1 and fˆ(x) = mhd m ∑ Yi K i=1 Xi − x h , in which h is the bandwidth. [sent-391, score-0.322]

61 I( ˆ (10) ˜ ˆ Since Dα is constructed to meet the level α exactly, the excess type I error of φ vanishes, that is, ˜ R0 (φ) − R0 (φ∗ ) = 0 . [sent-396, score-0.349]

62 Then there exists a positive D, such that for any log(m/δ) mhd δ ∈ (0, 1) and any sample size m satisfying < 1, it holds with probability 1 − δ, ˜ ˜ log(3m/δ) R1 (φ) − R1 (φ ) ≤ D mhd ∗ 1+¯ γ 2 , 1 2β+d . [sent-401, score-0.322]

63 Proof First note that the excess type II error ˜ R1 (φ) − R1 (φ∗ ) = ˆ G∗ △G p1 ∗ ˆ −Cα dP0 = P1 (G) − P1 (G∗ ) , p0 3026 N EYMAN -P EARSON C LASSIFICATION ˆ ˆ ˆ ˆ ˆ ˆ where G∗ = {η < D∗ } and G = η < Dα , and G∗ △G = (G∗ ∩ Gc ) ∪ (G∗c ∩ G). [sent-403, score-0.283]

64 n i=1 Unlike the previous setup where p0 is known, we now bound the excess type I error. [sent-436, score-0.282]

65 A similar α level detection condition ∗ can be formulated can be formulated for the regression function, but we omit it as Cα is simply ∗ . [sent-439, score-0.263]

66 The next theorem address the excess type II error of φ: R (φ) − R (φ∗ ). [sent-440, score-0.283]

67 Assume condition 3 and the regression I( ˆ ¯ function η satisﬁes the level α detection condition for some γ(≥ γ). [sent-442, score-0.294]

68 ¯ Then there exists a positive constant C, such that for any δ ∈ (0, 1) and any m, n ≥ 1/δ, it holds with probability 1 − 2δ,   ˜ ¯  log n R1 (φ) − R1 (φ ) ≤ C  n min ∗ γ 1 1+¯ 2 , 2− γ log m + m β(1+¯ ) γ 2β+d  . [sent-444, score-0.224]

69 ˆ ˜ ˆ ˆ Proof Let G∗ = {η < D∗ } and G = {η < Dα }, then the excess type II error of φ can be decomposed α by ˆ P1 (G) − P1 (G∗ ) = G∗ △G∗ p1 ∗ ∗ ˆ −Cα dP0 +Cα + P0 (G∗c ) − P0 (Gc ) . [sent-445, score-0.283]

70 α α By the level α detection condition, it is enough to take cn = (2dn /C1 ) P0 η ≥ D∗ + (2dn /C1 ) α 1/γ − 1/γ − . [sent-469, score-0.241]

71 In anomaly detection applications, normal samples are considered abundant, that is, n ≫ m, 1 1 which implies that ( log n ) 2β+d ≤ ( log m ) 2β+d . [sent-481, score-0.614]

72 Then the upper bounds for the excess type II errors n m in Theorem 2 and Theorem 3 are of the same order. [sent-482, score-0.281]

73 Having access to the mixture (contaminated) ¯ sample S looks like a weaker condition than having access to a pure anomaly sample. [sent-483, score-0.295]

74 The essence is revealed by observing that the density ratio p1 /p0 and the regression function η play the same role in the oracle NP classiﬁer at level α: ∗ φ∗ (x) = 1 1 /p0 (x) ≥ Cα ) = 1 I(p I(η(x) ≥ D∗ ) . [sent-485, score-0.251]

75 Being able to estimate the anomaly density p1 is not of particular advantage, because only the ratio p1 /p0 matters. [sent-487, score-0.372]

76 , Xn }, where h nhd ∑i=1 log(n/ε) nhd is the bandwidth. [sent-500, score-0.972]

77 For any ε ∈ (0, 1), if the sample size n is such that IP( p − p ˆ ∞ < 1, it holds ≥ δ) ≤ ε , where δ = (32c2 d + 48dc1 ) √ ′ log(n/ε) dL 1 1 β/2 −2β β ˜ + 2Lc3 h + d + (L + C ∑ )d n , nhd nh nh s! [sent-501, score-0.75]

78 Let h = log n n 1 2β+d ∞ + |K| t , it is enough to take δ = C C = 32c2 d + β dt, log(n/ε) , nhd c3 = |K| t β dt, ˜ and C is such that where √ ˜ 48dc1 + 2Lc3 + dL′ + L + C 1 . [sent-503, score-0.579]

79 Note that for any δ > 0, IP ( p − p ˆ ∞ ≥ δ) ≤ IP (M1 + M2 + M3 ≥ δ) , 3031 (14) T ONG where M1 = sup √ x−x′ ≤ d M x−x′ ≤ n 1 nhd ∑ K( Xi − x Xi − x′ ) − K( ) h h , d M i=1 sup √ |p(x) − p(x′ )| , M2 = 1 nhd M3 = sup x∈G n ∑ K( i=1 Xi − x ) − p(x) . [sent-508, score-1.071]

80 h Note that because K is L′ -Lipschitz, M1 ≤ sup √ x−x′ ≤ d M 1 nhd Xi − x Xi − x′ K( ) − K( ) h h n ∑ i=1 √ √ ′ dL 1 1 n dL′ = d . [sent-509, score-0.519]

81 ≤ d nh Mh nh nh To control M2 , note that if β ≤ 1, |p(x) − p(x′ )| = |p(x) − px′ (x)| ≤ L x − x′ β . [sent-510, score-0.418]

82 √ ′ dL 1 1 ˜ Deﬁne by t = δ − nhd nh − L + C ∑1≤|s|≤⌊β⌋ s! [sent-519, score-0.618]

83 Use a union bound to control the tail probability of M3 , IP (M3 ≥ t) ≤ ∑ IP x∈G 1 nhd n ∑ K( i=1 Xi − x ) − p(x) ≥ t . [sent-522, score-0.529]

84 Because log 2 + d log(2M + 1) ≤ 6d log n, it is sufﬁcient to have 6d log n − nhd t 2 ≤ log ε . [sent-528, score-0.858]

85 Then we can take √ ′ log n dL 1 1 β ε ˜ + 2Lc3 h + d + L +C ∑ d nh nh nh s! [sent-532, score-0.489]

86 1≤|s|≤⌊β⌋ log n nhd = hβ = log n n β 2β+d d β/2 n−2β . [sent-533, score-0.672]

87 log n ε + 2Lc3 nhd log n √ ′ + dL nhd 1 log n ˜ + L +C ∑ nhd s! [sent-536, score-1.737]

88 1≤|s|≤⌊β⌋ log n ε , nhd √ √ 1 ˜ where C = 32c2 d + 48dc1 + 2Lc3 + dL′ + L + C ∑1≤|s|≤⌊β⌋ s! [sent-537, score-0.579]

89 Moreover, assume log(n/ε) nhd p ≥ µ′ (> 0) and the sample size n is such that min ˆ IP( η − η ∞ < 1. [sent-547, score-0.524]

90 Then for any ε > 0, ≥ δ) ≤ 3ε , for δ = 1 µ′ − δ′ min + 1 µ′ − δ′ min δ′ + (32d K ∞+ 12d K 2 p ∞) √ ′ 1 1 dL ˜ + L + C1 ∑ nhd nh s! [sent-548, score-0.694]

91 1≤|s|≤⌊β⌋ log(n/ε) + (c4 + c5 )hβ nhd d β/2 n2β , p where δ′ is the same as δ in Lemma 1, c4 = µ′ ∞ L 1 + µf′ ∞ |K(z)| · z β dz and c5 = L |K(z)| · min min ˜ ˜ z β dz, and C1 is such that C1 ≥ sup1≤|s|≤⌊β⌋ supx∈[−1,1]d |Ds p(x)|. [sent-549, score-0.597]

92 nhd Proof Recall that p(x) = ˆ ˆ |η − η| = 1 nhd −x ∑n K( Xih ) and fˆ(x) = i=1 1 nhd −x ∑n Yi K( Xih ), so i=1 fˆ f f | p − p| ˆ fˆ f f 1 − ≤ − + − = |f| + | fˆ − f | . [sent-551, score-1.458]

93 x∈G The quantity M1 can be controlled as follows: M1 = sup √ x−x′ ≤ ≤ ≤ d M 1 nhd n ∑ Yi K( i=1 Xi − x 1 )− d h nh n ∑ Yi K( i=1 Xi − x′ ) h Xi − x Xi − x′ ) − K( ) ∑ h h ′ ≤ d i=1 x−x M √ ′ √ ′ dL 1 1 n dL = d . [sent-568, score-0.651]

94 d Mh nh nh nh sup √ 1 nhd n K( The quantity M2 can be controlled similarly as its counterpart in proof for Lemma 1, M2 ≤ ˜ L + C1 1 s! [sent-569, score-0.915]

95 T ONG 1 Let t = δ3 − nhd √ dL′ nh β 1 ˜ − L + C1 ∑1≤|s|≤⌊β⌋ s! [sent-571, score-0.618]

96 ˆ 1 nhd Note that IE 1 nhd n ∑ K( i=1 Xi − x )η(Xi ) − IE[ p(x)η(x)] ˆ h = y−x 1 K( )(η(y) − η(x))p(y)dy d h h = K(z)[η(x + hz) − η(x)]p(x + hz)dz ≤ p ∞ |K(z)| · |η(x + hz) − η(x)|dz . [sent-583, score-0.972]

97 The above inequality together with (16) implies that IE 1 nhd n ∑ K( i=1 Xi − x )η(Xi ) − IE[ p(x) · η(x)] ≤ c4 hβ , ˆ h 3037 T ONG where c4 = p ∞L µ′ min 1+ f ∞ µ′ min |K(z)| · z β dz . [sent-587, score-0.623]

98 d β/2 n−2β , we have IP( fˆ − f Take δ2 = δ3 , µ′ −δ′ min ∞ ˜ ≥ δ3 ) ≤ IP(M1 ≥ t ) ≤ we have IP( fˆ − f ∞ log(n/ε) nhd ∑ IP x∈G 1 ˜ + (c4 + c5 )hβ , and δ3 = t + nhd √ ′ dL nh + ˜ |B1 (x)| ≥ t − (c4 + c5 )hβ ≤ ε . [sent-593, score-1.142]

99 An overview of anomaly detection techniques: Existing solutions and latest technological trends. [sent-691, score-0.406]

100 Measuring mass concentrations and estimating density contour clusters-an excess mass approach. [sent-695, score-0.22]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nhd', 0.486), ('ip', 0.439), ('anomaly', 0.264), ('ie', 0.219), ('earson', 0.173), ('eyman', 0.173), ('gc', 0.172), ('mhd', 0.161), ('np', 0.15), ('detection', 0.142), ('excess', 0.133), ('nh', 0.132), ('type', 0.116), ('paradigm', 0.113), ('dl', 0.107), ('hz', 0.105), ('lassification', 0.103), ('classi', 0.101), ('rigollet', 0.097), ('ong', 0.096), ('log', 0.093), ('density', 0.087), ('margin', 0.086), ('ii', 0.078), ('vert', 0.074), ('er', 0.07), ('level', 0.066), ('tsybakov', 0.063), ('dn', 0.056), ('ers', 0.054), ('xih', 0.053), ('oracle', 0.053), ('threshold', 0.051), ('event', 0.05), ('adn', 0.049), ('cannon', 0.049), ('malignant', 0.049), ('markou', 0.049), ('lemma', 0.049), ('el', 0.046), ('tumor', 0.044), ('densities', 0.043), ('class', 0.041), ('min', 0.038), ('regarding', 0.037), ('xi', 0.036), ('novelty', 0.035), ('dz', 0.035), ('ds', 0.034), ('kernel', 0.034), ('error', 0.034), ('setup', 0.033), ('cn', 0.033), ('sup', 0.033), ('legendre', 0.032), ('mammen', 0.032), ('estimators', 0.032), ('errors', 0.032), ('smoothness', 0.032), ('condition', 0.031), ('erm', 0.031), ('scott', 0.031), ('vc', 0.029), ('fc', 0.029), ('risk', 0.027), ('diagnosis', 0.027), ('inequality', 0.026), ('medical', 0.026), ('sd', 0.026), ('decompose', 0.025), ('agyemang', 0.025), ('casasent', 0.025), ('chandola', 0.025), ('hodge', 0.025), ('liml', 0.025), ('patcha', 0.025), ('tarigan', 0.025), ('wied', 0.025), ('xin', 0.025), ('diminishing', 0.025), ('gx', 0.025), ('addresses', 0.024), ('classical', 0.024), ('audibert', 0.024), ('readers', 0.024), ('regression', 0.024), ('estimator', 0.023), ('opposed', 0.023), ('zi', 0.023), ('control', 0.022), ('normal', 0.022), ('blanchard', 0.021), ('ird', 0.021), ('lehmann', 0.021), ('marshall', 0.021), ('stylized', 0.021), ('zadrozny', 0.021), ('hm', 0.021), ('ratio', 0.021), ('tail', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

2 0.11019552 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto

Abstract: We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the ℓ2 , Huber, ℓ1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with speciﬁed means and variances from simple scalar building blocks. The main contribution of this paper is a ﬂexible statistical modeling framework for a variety of learning applications, together with a toolbox of efﬁcient numerical methods for estimation. In particular, a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties has a dual representation that can be exploited to design interior point (IP) methods. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. We provide several numerical examples, along with a code that can be used to solve general PLQ problems. The efﬁciency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaus∗. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013

3 0.058467742 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a ﬂexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difﬁculty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

4 0.057109442 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

5 0.056752097 8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classiﬁers to form highly accurate predictors. Although the case of binary classiﬁcation is well understood, in the multiclass setting, the “correct” requirements on the weak classiﬁer, or the notion of the most efﬁcient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classiﬁer, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

6 0.053278219 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

7 0.048651982 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

8 0.048072398 114 jmlr-2013-The Rate of Convergence of AdaBoost

9 0.047334012 73 jmlr-2013-Multicategory Large-Margin Unified Machines

10 0.047048487 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

11 0.043320131 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

12 0.042302206 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.041126493 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

14 0.040137663 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

15 0.039218798 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

16 0.038205814 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

17 0.037168793 96 jmlr-2013-Regularization-Free Principal Curve Estimation

18 0.036409486 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

19 0.036279883 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

20 0.036128242 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.191), (1, 0.075), (2, 0.04), (3, 0.064), (4, 0.008), (5, -0.086), (6, 0.065), (7, 0.021), (8, 0.01), (9, -0.085), (10, -0.016), (11, -0.028), (12, 0.025), (13, -0.051), (14, -0.114), (15, -0.041), (16, -0.054), (17, -0.091), (18, -0.023), (19, -0.121), (20, -0.23), (21, -0.133), (22, 0.044), (23, -0.036), (24, -0.13), (25, -0.109), (26, 0.193), (27, 0.227), (28, -0.094), (29, -0.187), (30, 0.114), (31, -0.203), (32, -0.07), (33, 0.131), (34, 0.029), (35, -0.119), (36, -0.003), (37, 0.119), (38, -0.158), (39, 0.014), (40, -0.017), (41, -0.051), (42, -0.003), (43, 0.033), (44, 0.004), (45, -0.127), (46, -0.005), (47, -0.062), (48, -0.062), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92352813 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

2 0.72733092 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto

3 0.4782443 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classiﬁers are two important groups of techniques for classiﬁcation problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classiﬁers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classiﬁcation task or not. In particular, soft classiﬁers predict the label based on the obtained class conditional probabilities, while hard classiﬁers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classiﬁcation, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Uniﬁed Machine (LUM) was recently proposed as a uniﬁed family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classiﬁers. For multicategory cases, however, the concept of soft and hard classiﬁcation becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classiﬁcation under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classiﬁcation and its transition behavior from soft to hard classiﬁers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classiﬁcation, large-margin, soft classiﬁcation, support vector machine

4 0.38522923 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

5 0.35330057 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.32523274 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

7 0.30444893 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

8 0.28965953 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

9 0.28156012 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

10 0.28129089 8 jmlr-2013-A Theory of Multiclass Boosting

11 0.2741085 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

12 0.26781553 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

13 0.25948176 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

14 0.25772214 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

15 0.25403899 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

16 0.2494853 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

17 0.24385777 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

18 0.24076703 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

19 0.23502769 33 jmlr-2013-Dimension Independent Similarity Computation

20 0.23272902 22 jmlr-2013-Classifying With Confidence From Incomplete Information

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (5, 0.128), (6, 0.021), (10, 0.063), (20, 0.01), (23, 0.03), (53, 0.018), (68, 0.017), (70, 0.029), (75, 0.022), (85, 0.537), (87, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90531802 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-speciﬁc characterization of the sample complexity of large-margin classiﬁcation with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-speciﬁc upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classiﬁcation. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classiﬁers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classiﬁers, distribution-dependence

same-paper 2 0.90339386 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

3 0.55961484 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz

Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efﬁcient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efﬁcient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in signiﬁcantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classiﬁers, margin, adaptive sub-modularity

4 0.53641605 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

5 0.51157302 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

6 0.49764732 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

7 0.47085544 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

8 0.46010843 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

9 0.4549354 21 jmlr-2013-Classifier Selection using the Predicate Depth

10 0.44993541 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

11 0.44723028 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

12 0.44491816 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

13 0.43473032 32 jmlr-2013-Differential Privacy for Functions and Functional Data

14 0.4323146 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

15 0.43071023 114 jmlr-2013-The Rate of Convergence of AdaBoost

16 0.42674115 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

17 0.42085508 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

18 0.42080656 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

19 0.41938269 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.41643685 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization