jmlr jmlr2011 jmlr2011-69 knowledge-graph by maker-knowledge-mining

69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Source: pdf

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. [sent-5, score-0.272]

2 The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. [sent-6, score-0.197]

3 New techniques to handle such problems are developed and they have consequences on chance constrained programming. [sent-7, score-0.166]

4 We also evaluate the price to pay in terms of type II error for being conservative on type I error. [sent-8, score-0.444]

5 Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization 1. [sent-9, score-0.314]

6 With slight abuse of language, in verbal discussion we do not distinguish type I/II error from probability of type I/II error. [sent-11, score-0.25]

7 In the learning context, as true errors are inaccessible, we cannot enforce almost surely the desired upper bound for type I error. [sent-16, score-0.15]

8 The best we can hope is that a data dependent classiﬁer has type I error bounded with high probability. [sent-17, score-0.132]

9 The ﬁrst is that type I error of the classiﬁer fˆ is bounded from above by a pre-speciﬁed level with pre-speciﬁed high probability; the second is that fˆ has good performance bounds for excess type II error. [sent-19, score-0.34]

10 R IGOLLET AND T ONG conservative attitude on type I error, and has good performance bound measured by the excess ϕ-type II error. [sent-23, score-0.375]

11 A parallel between binary classiﬁcation and statistical hypothesis testing is drawn in Section 3 with emphasis on the NP paradigm in both frameworks. [sent-27, score-0.17]

12 Finally, Section 5 illustrates an application of our results to chance constrained optimization. [sent-29, score-0.137]

13 Clearly the indicator function is not convex, and for computational convenience, a common practice is to replace it by a convex surrogate (see, e. [sent-39, score-0.151]

14 Deﬁnition 1 A function ϕ : [−1, 1] → R+ is called a convex surrogate if it is non-decreasing, continuous and convex and if ϕ(0) = 1. [sent-45, score-0.257]

15 Commonly used examples of convex surrogates are the hinge loss ϕ(x) = (1 + x)+ , the logit loss ϕ(x) = log2 (1 + ex ) and the exponential loss ϕ(x) = ex . [sent-46, score-0.134]

16 In our subsequent analysis, this convex relaxation will also be the ground to analyze a stochastic convex 2832 N EYMAN -P EARSON C LASSIFICATION optimization problem subject to stochastic constraints. [sent-49, score-0.212]

17 While the h j ’s may have no satisfactory classifying power individually, for over two decades, boosting type of algorithms have successfully exploited the idea that a suitable weighted majority vote among these classiﬁers may result in low classiﬁcation risk (Schapire, 1990). [sent-66, score-0.157]

18 Consequently, we restrict our search for classiﬁers to the set of functions consisting of convex combinations of the h j ’s: H conv = {hλ = M ∑ λ j h j , λ ∈ Λ}, j=1 where Λ denotes the ﬂat simplex of IRM and is deﬁned by Λ = {λ ∈ IRM : λ j ≥ 0, ∑M λ j = 1}. [sent-67, score-0.442]

19 In j=1 effect, classiﬁcation rules given by the sign of h ∈ H conv are exactly the set of rules produced by the weighted majority votes among the base classiﬁers h1 , . [sent-68, score-0.372]

20 By restricting our search to classiﬁers in H conv , the best attainable ϕ-risk is called oracle risk and is abusively denoted by Rϕ (H conv ). [sent-72, score-0.784]

21 As a result, we have Rϕ (h) ≥ Rϕ (H conv ) for any h ∈ H conv and a natural measure of performance for a classiﬁer h ∈ H conv is given by its excess risk deﬁned by Rϕ (h) − Rϕ (H conv ). [sent-73, score-1.527]

22 The excess risk of a data driven classiﬁer hn is a random quantity and we are interested in bounding it with high probability. [sent-74, score-0.289]

23 Formally, the statistical goal of binary classiﬁcation is to construct a classiﬁer hn such that the oracle inequality Rϕ (hn ) ≤ Rϕ (hH conv ) + ∆n (H conv , δ) holds with probability 1 − δ, where ∆n (·, ·) should be as small as possible. [sent-75, score-0.877]

24 In the scope of this paper, we focus on candidate classiﬁers in the class H conv . [sent-76, score-0.368]

25 Under this convention, the risk function in classical binary classiﬁcation can be expressed as a convex combination of type I error R− (h) = IP (−Y h(X) ≥ 0|Y = −1) and type II error 2833 R IGOLLET AND T ONG R+ (h) = IP (−Y h(X) > 0|Y = 1): R(h) = IP(Y = −1)R− (h) + IP(Y = 1)R+ (h). [sent-82, score-0.488]

26 While the goal of classical binary classiﬁcation is minh∈H R(h), where H is the set of candidate classiﬁers, the NP classiﬁcation targets on min R+ (h) . [sent-83, score-0.139]

27 Two kinds of errors arise: type I error occurs when rejecting P− when it is true, and type II error occurs when accepting P− when it is false. [sent-97, score-0.323]

28 The Neyman-Pearson paradigm in hypothesis testing amounts to choosing φ that solves the following constrained optimization problem maximize subject to IE[φ(X)|Y = 1] , IE[φ(X)|Y = −1] ≤ α , where α ∈ (0, 1) is the signiﬁcance level of the test. [sent-98, score-0.203]

29 In other words, we specify a signiﬁcance level α on type I error, and minimize type II error. [sent-99, score-0.182]

30 1 Conservativeness on Type I Error While the binary classiﬁcation problem has been extensively studied, theoretical proposition on how to implement the NP paradigm remains scarce. [sent-139, score-0.188]

31 (2002) initiated the theoretical treatment of the NP classiﬁcation paradigm and an early empirical study can be found in Casasent and Chen (2003). [sent-141, score-0.209]

32 (2002) states that, simultaneously with high probability, the type II ˆ ˆ error R+ (h) is bounded from above by R+ (h∗ ) + ε1 , for some ε1 > 0 and the type I error of h is bounded from above by α + ε0 . [sent-148, score-0.264]

33 However, − ˆ it is our primary interest to make sure that R (h) ≤ α with high probability, following the original principle of the Neyman-Pearson paradigm that type I error should be controlled by a pre-speciﬁed ˆ level α. [sent-161, score-0.253]

34 As we follow an empirical risk minimization procedure, to control IP(R− (h) > α), it is ˆ be a solution to some program with a strengthened constraint on empirical necessary to have h type I error. [sent-162, score-0.334]

35 However, we also want to evaluate the excess type II error. [sent-164, score-0.208]

36 Our conservative attitude on type I error faces new technical challenges which we summarize here. [sent-165, score-0.299]

37 (2002) and of Scott and Nowak (2005), the relaxed constraint on the type I error is constructed such that the constraint ˆ R− (h) ≤ α + ε0 /2 on type I error in (2) is satisﬁed by h∗ (deﬁned in (3)) with high probability, and that this classiﬁer accommodates the excess type II error well. [sent-167, score-0.587]

38 As a result, the control of type II error mainly follows as a standard exercise to control suprema of empirical processes. [sent-168, score-0.238]

39 Such methods have consequences not only in NP classiﬁcation but also on chance constraint programming as explained in Section 5. [sent-170, score-0.168]

40 2 Convexiﬁed NP Classiﬁer Concerned about computational feasibility, our proposed classiﬁer is the solution to a convex program, which is an empirical form NP classiﬁcation problem (1) where the distribution of the observations is unknown. [sent-172, score-0.136]

41 The treatment of the convex constraint should be done carefully and we proceed as follows. [sent-174, score-0.178]

42 ˆϕ ˆϕ For any classiﬁer h and a given convex surrogate ϕ, deﬁne R− and R+ to be the empirical coun+ − terparts of Rϕ and Rϕ respectively by − 1 n ˆϕ R− (h) = − ∑ ϕ(h(Xi− )) , n i=1 + and 1 n ˆϕ R+ (h) = + ∑ ϕ(−h(Xi+ )) . [sent-175, score-0.181]

43 n i=1 Moreover, for any a > 0, let H ϕ,a = {h ∈ H conv : R− (h) ≤ a} be the set of classiﬁers in H conv ϕ ϕ,a ˆϕ whose convexiﬁed type I errors are bounded from above by a, and let Hn− = {h ∈ H conv : R− (h) ≤ conv whose empirical convexiﬁed type I errors are bounded by a. [sent-176, score-1.674]

44 2836 N EYMAN -P EARSON C LASSIFICATION We are now in a position to construct a classiﬁer in H conv according to the Neyman-Pearson √ ˜ paradigm. [sent-178, score-0.336]

45 For any τ > 0 such that τ ≤ α n− , deﬁne the convexiﬁed NP classiﬁer hτ as any classiﬁer that solves the following optimization problem min h∈H conv √ ˆϕ R− (h)≤α−τ/ n− ˆϕ R+ (h) . [sent-179, score-0.415]

46 (4) Note that this problem consists of minimizing a convex function subject to a convex constraint and can therefore be solved by standard algorithms (see, e. [sent-180, score-0.249]

47 In the next section, we present a series of results on type I and type II errors of classiﬁers that ˜ are more general than hτ . [sent-183, score-0.241]

48 As the true type I and type II errors are usually the main concern in statistical learning, we will also address the effect of convexiﬁcation in terms of the excess type II error. [sent-188, score-0.475]

49 Interestingly, given that we want to be conservative on type I error, neither working on ϕ errors nor working on true errors leads to a most desirable type II error. [sent-189, score-0.418]

50 The price to pay for being conservative will be characterized explicitly. [sent-190, score-0.221]

51 κ = 4 2L log δ ˆϕ Then for any (random) classiﬁer h ∈ H conv that satisﬁes R− (h) ≤ ακ , we have R− (h) ≤ R− (h) ≤ α . [sent-196, score-0.336]

52 2 Simultaneous Control of the Two Errors Theorem 3 guarantees that any classiﬁer that satisﬁes the strengthened constraint on the empirical ϕ-type I error will have ϕ-type I error and true type I error bounded from above by α. [sent-200, score-0.323]

53 Indeed, an extremely small ακ would certainly ensure a good control of type I error but would deteriorate signiﬁcantly the best achievable ϕ-type II error. [sent-202, score-0.17]

54 ν0 − ν This proposition ensures that if the convex surrogate ϕ is continuous, strengthening the constraint on type I error (ϕ-type I error) does not increase too much the best achievable ϕ-type II error. [sent-210, score-0.361]

55 This proposition has direct consequences on chance constrained programming as discussed in Section 5. [sent-212, score-0.207]

56 denote the NP classiﬁer obtained with this sampling scheme by hn Let the event F be deﬁned by ˜ ˜ F = {R− (hκ ) ≤ α} ∩ {R+ (hκ ) − min R+ (h) ≤ ϕ ϕ n ϕ n ϕ,α h∈H 2κ 4ϕ(1)κ √ + √ }. [sent-248, score-0.206]

57 Rϕ (hn ) − min Rϕ (h) ≤ np h∈H ϕ,α ¯ (1 − ε)α n(1 − p) (13) 4. [sent-256, score-0.33]

58 4 Price to Pay For Being Conservative ˜ We have shown that the the computational feasible classiﬁer hκ satisﬁes oracle inequalities which take the optimal ϕ-type II errors as the benchmark. [sent-257, score-0.142]

59 In this subsection, the excess type II error will be measured, and we will characterize the price to pay by being conservative on type I error. [sent-258, score-0.561]

60 Denote by HB the set of convex combinations of the base classiﬁers that have ϕ-type I error bounded from above by α. [sent-268, score-0.183]

61 Therefore, we have the following observation: ˜ R+ (hκ ) − min R+ (h) ≤ T1 + T2 + T3 , R− (h)≤α where ˜ T1 = R+ (hκ ) − min R+ (h) , ϕ ϕ,α ϕ h∈HB T2 = min R+ (h) − ϕ,α ϕ h∈HB T3 = inf − R (h)≤α −B≤h≤B inf R− (h)≤α R+ (h) , ϕ −B≤h≤B R+ (h) − ϕ inf R+ (h) . [sent-269, score-0.285]

62 We can see that with a ﬁxed sample size, choosing a set of base classiﬁers with smaller range will result in a tighter bound for the excess ϕ-type II error. [sent-272, score-0.153]

63 However, if one concerns more about the true type II error, choosing a smaller B should not be a better option, because only signs matters for true type I and II errors. [sent-273, score-0.182]

64 ϕ,α Note that HB ⊂ {h : R− (h) ≤ α, −B ≤ h ≤ B}, so T2 reﬂects the price to pay for being conservative on type I error. [sent-280, score-0.312]

65 It also reﬂect the bias for choosing a speciﬁc candidate pool of classiﬁers, that is, convex combinations of base classiﬁers. [sent-281, score-0.174]

66 However in our belief, the price to pay for being conservative is unavoidable. [sent-283, score-0.221]

67 Even if we do not resort to convexiﬁcation, getting the best insurance on type I error still demands a high premium on type II error. [sent-284, score-0.223]

68 (2002), where it was claimed without justiﬁcation that if we use α′ < α for the empirical program, “it seems unlikely that we ˆ can control the estimation error R+ (h) − R+ (h∗ ) in a distribution independent way”. [sent-286, score-0.133]

69 In particular, the excess type II risk of h∗ (α − εn− ), εn− > 0 does not converge to zero as sample ˜ sizes increase even if εn− → 0. [sent-293, score-0.274]

70 In particular, ˆ the excess type II risk of h(α − εn− ), εn− > 0 does not converge to zero with positive probability, as sample sizes increase even if εn− → 0. [sent-297, score-0.274]

71 In view of this negative result and our previous discussion, we have to accept the price to pay for ˜ being conservative on type I error, and our classiﬁer hκ is no exception. [sent-303, score-0.312]

72 Chance Constrained Optimization Implementing the Neyman-Pearson paradigm for the convexiﬁed binary classiﬁcation bears strong connections with chance constrained optimization. [sent-307, score-0.309]

73 A chance constrained optimization problem is of the following form: min f (λ) s. [sent-310, score-0.192]

74 Problem (14) may not be convex because the chance constraint {λ ∈ Λ : IP{F(λ, ξ) ≤ 0} ≥ 1 − α} is not convex in general and thus may not be tractable. [sent-320, score-0.351]

75 (2005) have derived sufﬁcient conditions on the distribution of ξ for the chance constraint to be convex. [sent-322, score-0.139]

76 It amounts to solving the following convex optimization problem min λ∈Λ,t∈Rs f (λ) s. [sent-338, score-0.161]

77 The problem (17) provides a conservative convex approximation to (14), in the sense that every x feasible for (17) is also feasible for (14). [sent-341, score-0.298]

78 Nemirovski and Shapiro (2006) considered a particular class of conservative convex approximation where the key step is to replace IP{F(λ, ξ) ≥ 0} by IEϕ(F(λ, ξ)) in (14), where ϕ a nonnegative, nondecreasing, convex function that takes value 1 at 0. [sent-342, score-0.33]

79 The idea of a conservative convex approximation is also what we employ in our paper. [sent-344, score-0.224]

80 Replacing R+ (hλ ) by R+ (hλ ) turns (18) into a standard chance constrained optimization problem: ϕ min R+ (hλ ) ϕ λ∈Λ s. [sent-350, score-0.192]

81 On the other hand, chance constrained optimization techniques in previous literature assume knowledge about the distribution of the random vector ξ. [sent-356, score-0.137]

82 Given a ﬁnite sample, it is not feasible to construct a strictly conservative approximation to the ˆ constraint in (19). [sent-358, score-0.192]

83 In retrospect, our approach to (19) is an innovative hybrid between the analytical approach based on convex surrogates and the scenario approach. [sent-360, score-0.134]

84 The following theorem summarizes our contribution to chance constrained optimization. [sent-377, score-0.172]

86 Next, we have for any data-dependent classiﬁer h ∈ H conv such that R− (h) ≤ ακ : − ˆϕ R− (h) ≤ R− (h) + sup ϕ h∈H conv κ ˆϕ ˆϕ R− (h) − R− (h) ≤ α − √ − + sup R− (h) − R− (h) . [sent-440, score-0.778]

87 ϕ ϕ conv n h∈H Lemma 10 implies that, with probability 1 − δ sup h∈H conv κ − ˆϕ R− (h) − R− (h) = sup (Pn− − P− )(ϕ ◦ hλ ) ≤ √ − . [sent-441, score-0.805]

88 ϕ Proof First, it is clear that γ is a non-increasing function of α because for α′ > α, {hλ ∈ H conv : R− (hλ ) ≤ α} ⊂ {hλ ∈ H conv : R− (hλ ) ≤ α′ }. [sent-446, score-0.672]

89 ν0 − ν / Lemma 11 together with the assumption that H ϕ,α−ν0 = 0 imply that γ is a non-increasing convex real-valued function on [α − ν0 , 1] so that γ(α − ν) − γ(α) ≤ ν sup g∈∂γ(α−ν) |g| , where ∂γ(α − ν) denotes the sub-differential of γ at α − ν. [sent-459, score-0.159]

90 Moreover, since γ is a non-increasing convex function on [α − ν0 , α − ν], it holds γ(α − ν0 ) − γ(α − ν) ≥ (ν − ν0 ) sup g∈∂γ(α−ν) |g| . [sent-460, score-0.159]

91 3 Proof of Theorem 5 Deﬁne the events E − and E + by κ ˆϕ {|R− (h) − R− (h)| ≤ √ − } , ϕ n h∈H conv κ ˆϕ {|R+ (h) − R+ (h)| ≤ √ + } . [sent-463, score-0.336]

92 E+ = ϕ n h∈H conv E− = Lemma 10 implies IP(E − ) ∧ IP(E + ) ≥ 1 − δ . [sent-464, score-0.336]

93 s of (9) can be decomposed as ˜ R+ (hκ ) − min R+ (h) = A1 + A1 + A3 , ϕ ϕ h∈H ϕ,α where ˜ ˆϕ ˜ ˆϕ ˜ A1 = R+ (hκ ) − R+ (hκ ) + R+ (hκ ) − min R+ (h) ϕ ϕ ϕ,α h∈Hn− A2 = min R+ (h) − min R+ (h) ϕ ϕ ϕ,α ϕ,α h∈Hn− κ h∈H 2κ A3 = min R+ (h) − min R+ (h). [sent-468, score-0.33]

94 ϕ ϕ ϕ,ακ h∈H conv h∈Hn− Therefore, on the event E + it holds 2κ A1 ≤ √ + . [sent-470, score-0.381]

95 The proof of (13) follows by observing that √ √ 4 2ϕ(1)κ 2 2κ + + ˜κ Rϕ (hn ) − min Rϕ (h) > + √ np h∈H ϕ,α ¯ (1 − ε)α n(1 − p) c ⊂ ( A1 ∩ A2 ) ∪ A2 ∪ A3 , where ˜ A1 = R+ (hκ ) − min R+ (h) > ϕ ϕ n ϕ,α h∈H A2 = {N − < n(1 − p)/2} , A3 = {N + < np/2} . [sent-483, score-0.417]

96 Hence, we ﬁnd ˜n IP Rϕ (hκ ) − min R+ (h) > ϕ + h∈H ϕ,α √ √ 4 2ϕ(1)κ 2 2κ + √ np ¯ (1 − ε)α n(1 − p) ≤ 2δ + e− n(1−p)2 2 + e− np2 2 , which completes the proof of the corollary. [sent-486, score-0.395]

97 6 Proof of Proposition 8 Let the base classiﬁers be deﬁned as h1 (x) = −1 and h2 (x) = 1 ≤ α) − 1 > α) , I(x I(x ∀ x ∈ [0, 1] For any λ ∈ [0, 1], denote the convex combination of h1 and h2 by hλ = λh1 + (1 − λ)h2 , that is, hλ (x) = (1 − 2λ)1 ≤ α) − 1 > α) . [sent-500, score-0.142]

98 The next lemma provides a lower bound on the probability that a binomial distribution exceeds its expectation. [sent-530, score-0.142]

99 2), it is not hard to show that k n n−1 k Pk (N ≥ k + 1) = IP(U(k+1) ≤ ) = n n k n 0 t k (1 − t)n−k−1 dt , and in the same manner, Pk−1 (N ≥ k) = IP(U(k) ≤ n Note that n−1 k−1 )=n n k−1 k−1 n 0 n−1 n−1 k , = n−k k−1 k 2852 t k−1 (1 − t)n−k dt . [sent-567, score-0.204]

100 N EYMAN -P EARSON C LASSIFICATION so that (29) follows if we prove k−1 n k t k−1 (1 − t)n−k dt ≤ (n − k) 0 k n 0 t k (1 − t)n−k−1 dt . [sent-568, score-0.204]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ip', 0.441), ('conv', 0.336), ('np', 0.275), ('ie', 0.237), ('earson', 0.195), ('eyman', 0.195), ('igollet', 0.195), ('ers', 0.162), ('cannon', 0.152), ('convexi', 0.152), ('classi', 0.148), ('er', 0.135), ('lassification', 0.128), ('paradigm', 0.121), ('conservative', 0.118), ('excess', 0.117), ('minr', 0.114), ('hn', 0.106), ('convex', 0.106), ('hb', 0.103), ('chance', 0.102), ('dt', 0.102), ('ong', 0.099), ('type', 0.091), ('ii', 0.086), ('binomial', 0.086), ('scott', 0.069), ('princeton', 0.068), ('risk', 0.066), ('nq', 0.064), ('pn', 0.061), ('pq', 0.06), ('errors', 0.059), ('pay', 0.058), ('min', 0.055), ('sup', 0.053), ('nemirovski', 0.05), ('rigollet', 0.05), ('ir', 0.049), ('attitude', 0.049), ('cala', 0.049), ('conservativeness', 0.049), ('infr', 0.049), ('ird', 0.049), ('oracle', 0.046), ('shapiro', 0.046), ('surrogate', 0.045), ('event', 0.045), ('price', 0.045), ('hm', 0.043), ('strengthened', 0.042), ('irm', 0.042), ('cosh', 0.042), ('ore', 0.042), ('proposition', 0.041), ('error', 0.041), ('inf', 0.04), ('pk', 0.039), ('fix', 0.039), ('control', 0.038), ('feasible', 0.037), ('constraint', 0.037), ('philippe', 0.037), ('base', 0.036), ('constrained', 0.035), ('theorem', 0.035), ('treatment', 0.035), ('completes', 0.033), ('campi', 0.033), ('casasent', 0.033), ('howse', 0.033), ('infh', 0.033), ('lagoa', 0.033), ('novelty', 0.032), ('candidate', 0.032), ('proof', 0.032), ('empirical', 0.03), ('lemma', 0.029), ('consequences', 0.029), ('bernstein', 0.028), ('card', 0.028), ('surrogates', 0.028), ('xin', 0.028), ('probability', 0.027), ('concern', 0.026), ('rn', 0.026), ('binary', 0.026), ('nowak', 0.026), ('classical', 0.026), ('moreover', 0.025), ('lehmann', 0.025), ('bears', 0.025), ('hush', 0.025), ('anomaly', 0.025), ('valued', 0.025), ('solves', 0.024), ('copies', 0.024), ('unlikely', 0.024), ('testing', 0.023), ('initiated', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

2 0.10942667 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

3 0.069258414 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classiﬁcation. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classiﬁcation, loss functions, divergence, statistical information, regret bounds

4 0.062357657 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

Author: Stéphane Gaïffas, Guillaume Lecué

Abstract: Given a ﬁnite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars

5 0.057465132 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.055655804 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.054510839 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

8 0.051747199 105 jmlr-2011-lp-Norm Multiple Kernel Learning

9 0.047307111 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

10 0.046957009 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

11 0.041893087 58 jmlr-2011-Learning from Partial Labels

12 0.041763145 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

13 0.041526105 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

14 0.040824343 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

15 0.040776335 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

16 0.040599857 35 jmlr-2011-Forest Density Estimation

17 0.038636185 97 jmlr-2011-Union Support Recovery in Multi-task Learning

18 0.037812635 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

19 0.034639835 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

20 0.031588323 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, -0.01), (2, 0.042), (3, 0.046), (4, 0.064), (5, -0.02), (6, -0.073), (7, -0.039), (8, -0.143), (9, -0.0), (10, -0.18), (11, -0.039), (12, -0.048), (13, -0.006), (14, -0.1), (15, 0.217), (16, -0.242), (17, -0.093), (18, -0.1), (19, -0.012), (20, 0.099), (21, -0.032), (22, 0.019), (23, -0.024), (24, 0.052), (25, -0.012), (26, -0.072), (27, 0.097), (28, 0.046), (29, 0.081), (30, 0.048), (31, -0.016), (32, -0.166), (33, 0.039), (34, 0.057), (35, -0.013), (36, 0.011), (37, -0.004), (38, 0.004), (39, 0.018), (40, 0.009), (41, 0.042), (42, -0.095), (43, 0.021), (44, 0.074), (45, -0.096), (46, -0.056), (47, 0.054), (48, -0.031), (49, -0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95869267 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

2 0.72794467 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

3 0.57591969 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classiﬁcation, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classiﬁcation is speciﬁed by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of ﬂat classiﬁcation, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its ﬂat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classiﬁcation, structured learning

4 0.53166753 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

5 0.50924206 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classiﬁers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classiﬁcation. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efﬁciently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classiﬁcation accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be signiﬁcantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the beneﬁts of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classiﬁcation, optimization

6 0.47872576 58 jmlr-2011-Learning from Partial Labels

7 0.38639855 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

8 0.3769272 22 jmlr-2011-Differentially Private Empirical Risk Minimization

9 0.37487131 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

10 0.37485331 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

11 0.32901773 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.32105571 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

13 0.3106167 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

14 0.3098689 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

15 0.29948819 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

16 0.29622629 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

17 0.25902525 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

18 0.2532894 12 jmlr-2011-Bayesian Co-Training

19 0.25235516 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

20 0.25154456 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.067), (6, 0.017), (9, 0.059), (10, 0.029), (24, 0.041), (31, 0.126), (32, 0.035), (41, 0.034), (60, 0.01), (64, 0.372), (65, 0.012), (73, 0.021), (78, 0.084), (90, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69556081 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

2 0.42290765 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

3 0.4208478 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.41322625 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

5 0.41313517 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

6 0.41173494 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

7 0.41154325 59 jmlr-2011-Learning with Structured Sparsity

8 0.41124564 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

9 0.40930012 12 jmlr-2011-Bayesian Co-Training

10 0.40713379 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

11 0.40549555 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

12 0.40549475 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

13 0.40546235 16 jmlr-2011-Clustering Algorithms for Chains

14 0.40489233 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

15 0.40412185 101 jmlr-2011-Variable Sparsity Kernel Learning

16 0.40181759 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

17 0.40155378 105 jmlr-2011-lp-Norm Multiple Kernel Learning

18 0.4012357 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

19 0.40052789 104 jmlr-2011-X-Armed Bandits

20 0.40008783 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation