jmlr jmlr2009 jmlr2009-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
Reference: text
sentIndex sentText sentNum sentScore
1 Warmuth Abstract We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. [sent-10, score-0.523]
2 Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. [sent-11, score-0.663]
3 We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). [sent-12, score-0.585]
4 ) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). [sent-14, score-0.837]
5 This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. [sent-15, score-0.573]
6 We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). [sent-16, score-0.717]
7 Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. [sent-21, score-0.465]
8 In the current paper we consider a much more challenging malicious noise model. [sent-37, score-0.333]
9 , 2008) or even simpler target functions (Mansour and Parnas, 1998) with malicious noise typically make strong assumptions about the underlying distribution D , and can learn to accuracy 1 − ε only for noise rates η much smaller than ε. [sent-44, score-0.475]
10 In this paper we consider learning under the uniform distribution on the unit ball in Rn , and more generally under any isotropic log-concave distribution. [sent-47, score-0.413]
11 Our algorithms can learn from malicious noise rates that are quite high, as we now describe. [sent-49, score-0.333]
12 We are not aware of any previous polynomial-time algorithms for learning under isotropic logconcave distributions in the presence of malicious noise. [sent-54, score-0.598]
13 The marginal distribution over Rn is assumed to be isotropic log-concave; so the idea is that an “adversary” chooses an η fraction of examples to mislabel, but unlike the malicious noise model she cannot change the (isotropic log-concave) distribution of the actual example points in Rn . [sent-59, score-0.75]
14 For the adversarial label noise model we prove: Theorem 3 There is a poly(n, 1/ε)-time algorithm that learns origin-centered halfspaces to accuracy 1 − ε with respect to any isotropic log-concave distribution over Rn and can tolerate adversarial label noise at rate η = Ω(ε3 / log(1/ε)). [sent-61, score-0.922]
15 1994) directly imply that halfspaces can be learned for any distribution over the domain in randomized poly(n,1/ε) time with malicious noise at a rate Ω(ε/n); the algorithm repeatedly picks a random subsample of the training data, hoping to miss all the noisy examples. [sent-67, score-0.581]
16 showed that if the distribution over the instances is uniform over the unit ball, the averaging algorithm tolerates adversarial label noise at a rate Ω(ε/ log(1/ε)) in poly(n,1/ε) time. [sent-76, score-0.373]
17 ) They also described an algorithm that fits low-degree polynomials that tolerates noise at a rate within 4 an additive ε of the accuracy, but in poly n1/ε time; for log-concave distributions, their algorithm 2717 K LIVANS , L ONG AND S ERVEDIO took poly nd(1/ε) time, for an unspecified function d. [sent-78, score-0.359]
18 (2009) designed and analyzed an algorithm that performs principal component analysis when some of the examples are corrupted arbitrarily, as in the malicious noise model studied here. [sent-83, score-0.391]
19 Intuitively the “averaging” algorithm can only tolerate low malicious noise rates because the adversary can generate noisy examples which “pull” the average vector far from its true location. [sent-93, score-0.507]
20 However, even if there were no noise in the data, the average of the positive examples under an isotropic log-concave 1. [sent-105, score-0.462]
21 To get around this, we show that after outlier removal the average of the positive examples gives a (real-valued) weak hypothesis that has some nontrivial predictive accuracy. [sent-110, score-0.355]
22 This is not entirely straightforward because boosting “skews” the distribution of examples; this has the undesirable effects of both increasing the effective malicious noise rate, and causing the distribution to no longer be isotropic log-concave. [sent-113, score-0.766]
23 Servedio (2002) did not consider a noisy scenario, and Servedio (2003) only considered the averaging algorithm without any outlier removal as the weak learner (and thus could only handle quite low rates of malicious noise in our isotropic log-concave setting). [sent-117, score-1.101]
24 3 T OLERATING A DVERSARIAL L ABEL N OISE Finally, our results for learning under isotropic log-concave distributions with adversarial label noise are obtained using a similar approach. [sent-120, score-0.538]
25 A probability distribution over Rn is isotropic if the mean of the distribution is 0 and the covariance matrix is the identity, that is, E[xi x j ] = 1 for i = j and 0 otherwise. [sent-147, score-0.359]
26 It is well known that any distribution induced by taking a uniform distribution over an arbitrary convex set and applying a suitable linear transformation to make it isotropic is then isotropic and log-concave. [sent-152, score-0.685]
27 a We will use the following facts: Lemma 4 (Lov´ sz and Vempala 2007) Let D be an isotropic log-concave distribution over Rn a and a ∈ Sn−1 any direction. [sent-154, score-0.355]
28 Then for x drawn according to D , the distribution of a · x is an isotropic log-concave distribution over R. [sent-155, score-0.359]
29 Lemma 5 (Lov´ sz and Vempala 2007) Any isotropic log-concave distribution D over Rn has light a tails, √ Pr [||x|| > β n] ≤ e−β+1 . [sent-156, score-0.355]
30 2 Properties of the Clean Examples In this subsection we establish properties of the clean examples that were sampled in Step 1 of Amu . [sent-188, score-0.524]
31 over a random draw of ℓ clean examples Sclean , we have max a∈Sn−1 1 ∑ (a · x)2 ℓ (x,y)∈Sclean ≤ 1 + n O(n + log ℓ) . [sent-192, score-0.671]
32 The next lemma says that in fact no direction has too many clean examples lying far out in that direction: 2721 K LIVANS , L ONG AND S ERVEDIO Lemma 8 For any β > 0 and κ > 1, if Sclean is a random set of ℓ ≥ then w. [sent-195, score-0.613]
33 we have max a∈Sn−1 2 O(1)·n2 β2 eβ n/2 (1+κ) ln(1+κ) clean examples 2 1 |{x ∈ Sclean : (a · x)2 > β2 }| ≤ (1 + κ)e−β n/2 . [sent-198, score-0.524]
34 3 What is Removed In this section, we provide bounds on the number of clean and dirty examples removed in Step 3. [sent-202, score-0.912]
35 over the random draw of the m-element sample S, the number of clean examples removed during any one execution of Step 3 in Amu is at most 6n log m. [sent-207, score-0.733]
36 the number ℓ of clean examples is at least (say) m/2. [sent-212, score-0.524]
37 We would like to apply Lemma 8 with κ = 5ℓ4 n log ℓ and β = may do this because we have m O(1) · n2 β2 eβ n/2 O(1) · n(log m)m5 ≤ ≤O (1 + κ) ln(1 + κ) (1 + κ) ln(1 + κ) log m 2 ≤ 10 log m , n and indeed we m ≤ℓ 2 for n sufficiently large. [sent-213, score-0.342]
38 Since clean points are only removed if they have (a · x)2 > β2 , Lemma 8 gives us that the number of clean points removed is at most 2 n/2 m(1 + κ)e−β ≤ 6m5 n log(ℓ)/m5 ≤ 6n log m. [sent-214, score-1.208]
39 It tells us that if examples are removed in Step 3, then there must be many dirty examples removed. [sent-216, score-0.466]
40 over the random draw of S, whenever Amu executes step 3, it removes at least 4m log m noisy examples from Sdirty , the set of dirty examples in S. [sent-221, score-0.649]
41 , for all a ∈ S clean of clean examples drawn in step 1, we have 2m . [sent-231, score-1.009]
42 Since S is reasonable, by (1) the contribution to the sum from the clean examples that survived to the current stage is at most 2m/n so we must have ∑ (x,y)∈Sdirty (w · x)2 ≥ 10m log(m)/n − 2m/n > 9m log(m)/n. [sent-235, score-0.549]
43 Since |N| ≤ |Sdirty | ≤ ηm, (any dirty examples removed in earlier rounds will only reduce the size of Sdirty ) we have ∑ (w · x)2 ≤ (ηm)10 log(m)/n (x,y)∈N and so |F| ≥ ∑ (w · x)2 ≥ 9m log(m)/n − (ηm)10 log(m)/n ≥ 4m log(m)/n (x,y)∈F (the last line used the fact that η < 1/2). [sent-239, score-0.427]
44 over the draw of S, the halfspace with normal vector v = √ η′ log m + α n + O n log n m 1 |S′ | ∑(x,y)∈S′ yx has error rate . [sent-265, score-0.445]
45 Let vclean be the average of all the clean examples, v′ be the average of the dirty (noisy) examples that were not deleted (i. [sent-269, score-0.993]
46 , the dirty ′ examples in Sdirty ), and vdel be the average of the clean examples that were deleted. [sent-271, score-0.998]
47 dirty ∑ yx ′ (x,y)∈Sclean −Sclean Let us begin by exploiting the bound on the variance in every direction to bound the length of For any w ∈ Sn−1 we know that v′ . [sent-273, score-0.52]
48 dirty ∑ (w · x)2 ≤ (x,y)∈S′ 10m log m , n ′ (x,y)∈Sdirty ′ ′ since Sdirty ⊆ S′ . [sent-274, score-0.44]
49 Taking w to be the unit vector in the direction of v′ , we have v′ dirty dirty = w · v′ = w · dirty 1 ∑ yx ′ ′ |Sdirty | (x,y)∈Sdirty ≤ 1 ∑ |w · x| ′ ′ |Sdirty | (x,y)∈Sdirty ≤ 10m log m . [sent-276, score-1.234]
50 For the second term, Equation (2) gives us 10m(η′ )2 log m = ′ |Sdirty |n η′ v′ dirty ≤ 10mη′ log m ≤ |S′ |n 20η′ log m , n where for the last equality we used |S′ | ≥ m/2. [sent-280, score-0.668]
51 dirty √ Similar to the above analysis, we again use Lemma 11 (but now the lower bound u · v ≥ Ω(1/ n)), Equation (2), and the fact that ||vdel || ≤ 1. [sent-284, score-0.353]
52 Combining this with (4) and (3), we get log n m O Pr[hv = f ] ≤ π c √ n − + 20η′ log m n 20η′ log m n +α =O −α n log n + m √ η′ log m + α n . [sent-286, score-0.57]
53 each outlier removal stage removes at most 6n log m clean points. [sent-291, score-0.879]
54 Since, by Lemma 10, each outlier removal stage removes at least 4m log m noisy examples, there n must be at most O(n/(log m)) such stages. [sent-292, score-0.47]
55 Consequently the total number of clean examples removed across all stages is O(n2 ). [sent-293, score-0.586]
56 the initial number of clean examples is at least 3m/4, this means that the final data set (on which the averaging algorithm is run) contains at least 3m/4 − O(n2 ) clean examples, and hence at least 3m/4 − O(n2 ) examples in total. [sent-297, score-1.085]
57 Consequently the value of α from Lemma 12 after the final outlier removal stage (the ratio of the total number of clean 2 examples deleted, to the total number of surviving examples) is at most O(n ) . [sent-299, score-0.824]
58 Isotropic Log-concave Distributions and Malicious Noise Our algorithm Amlc that works for arbitrary isotropic log-concave distributions uses smooth boosting. [sent-312, score-0.362]
59 2 The Algorithm Our algorithm for learning under isotropic log-concave distributions with malicious noise, Algorithm Amlc , applies the smooth booster from Lemma 13 with the following weak learner, which we call Algorithm Amlcw . [sent-331, score-0.642]
60 an isotropic log-concave distribution D , η′ ≤ η/ε, and η ≤ ε2 Ω(ε3 / log2 (n/ε)). [sent-349, score-0.333]
61 4 Lemmas in Support of Lemma 14 In this section, let us consider a single call to the weak learner with an oracle EXη′ ( f , D ′ ) where D ′ is (1/ε)-smooth with respect to an isotropic log-concave distribution D and η′ ≤ η/ε. [sent-357, score-0.494]
62 all clean examples drawn in Step 1 of Algorithm Amlcw √ √ have x ≤ 3n log m; indeed, for any given draw of x from D ′ , the probability that x > 3n log m e is at most εm3 by Lemma 5 together with the fact that D ′ is 1/ε-smooth with respect to an i. [sent-362, score-0.785]
63 The following anticoncentration bound will be useful for proving that clean examples drawn from D ′ tend to be classified correctly with a large margin. [sent-388, score-0.551]
64 ε x∼D ε The next two lemmas are isotropic log-concave analogues of the uniform distribution Lemmas 7 and 8 respectively. [sent-394, score-0.376]
65 over a random draw of ℓ clean examples Sclean from D ′ , we have max a∈Sn−1 1 ∑ ℓ (x,y)∈S clean (a · x)2 ≤ O(1) log2 1 n3/2 log2 ℓ √ + ε ℓ Proof. [sent-401, score-1.042]
66 The second lemma says that for a sufficiently large clean data set, w. [sent-406, score-0.552]
67 we have max a∈Sn−1 O(1)εeβ (n ln(e−β /ε)+log m) (1+κ) ln(1+κ) clean examples 1 −β+1 1 |{x ∈ Sclean : |a · x| > β}| ≤ (1 + κ) e . [sent-412, score-0.524]
68 Lemma 5 implies that for the original isotropic log-concave distribution D , we have Pr [|a · x| > β] ≤ e−β+1 . [sent-414, score-0.333]
69 The following is an isotropic log-concave analogue of Corollary 9, establishing that not too many clean examples are removed in the outlier removal step: Corollary 19 W. [sent-417, score-1.126]
70 over the random draw of the m-element sample S from EXη′ ( f , D ′ ), the number of clean examples removed during any one execution of the outlier removal step (final substep of Step 2) in Algorithm Amlcw is at most 6mε3 /n4 . [sent-420, score-0.852]
71 the number ℓ of clean examples in S is at least (say) m/2. [sent-425, score-0.524]
72 We would like to apply Lemma 18 with κ = (n/ε)c0 −4 and β = c0 log(n/ε), and we may do this since we have O(1)εeβ n ln εeβ + log m O(1)ε(n/ε)c0 n log m m ≤ ≤ O(1)n5 /ε3 ≪ ≤ ℓ c0 −4 log m (1 + κ) ln(1 + κ) (n/ε) 2 for a suitable fixed poly(n/ε) choice of m. [sent-426, score-0.362]
73 Since clean points are only removed if they have |a · x| ≥ β, Lemma 18 gives us that the number of clean points removed is at most (6/ε)(n/ε)c0 −4 1 ≤ 6mε3 /n4 . [sent-427, score-1.094]
74 m(1 + κ) · e−β+1 ≤ m ε (n/ε)c0 The following lemma is an analogue of Lemma 10; it lower bounds the number of dirty examples that are removed in the outlier removal step. [sent-428, score-0.727]
75 for all a ∈ Sn−1 , the following holds for all the clean examples in the data before any examples were removed: ∑ (x,y)∈Sclean (a · x)2 ≤ cm log2 (1/ε) (6) where c is an absolute constant. [sent-442, score-0.563]
76 We will now show that, for any reasonable sample S, the number of noisy examples removed during m the first execution of the outlier removal step of Amlcw is at least O(n) . [sent-445, score-0.41]
77 Since S 0 is reasonable, by (6) the contribution to the sum from the clean examples that have survived until this point is at most cm log2 (1/ε) so we must have ∑ (x,y)∈Sdirty (w · x)2 ≥ (c2 − c)m log2 (n/ε). [sent-447, score-0.524]
78 Also as before, let vclean be the average of all the clean ′ examples, vdirty be the average of the dirty (noisy) examples that were not deleted, and vdel be the average of the clean examples that were deleted. [sent-461, score-1.66]
79 dirty The expectation of vclean will play a special role in the analysis: de f v∗ clean = Ex∼D ′ [ f (x)x]. [sent-463, score-0.954]
80 Once again, we will demonstrate the limited effect of v′ dirty by bounding its length. [sent-464, score-0.344]
81 0 Applying this for the unit vector w in the direction of v′ as was done in Lemma 12, this implies dirty v′ dirty ≤ c0 log(n/ε) m . [sent-466, score-0.712]
82 dirty 2731 K LIVANS , L ONG AND S ERVEDIO |Ex∼D ′ [ f (x)(v′ · x)]| = |v′ · v∗ | dirty dirty clean ≤ c0 log(n/ε) m ||v∗ ||. [sent-468, score-1.463]
83 ′ |Sdirty | clean (8) To show that vclean plays a relatively large role, it is helpful to lower bound the length of v∗ . [sent-469, score-0.673]
84 clean We do this by lower bounding the length of its projection onto the unit normal vector u of the target as follows: v∗ · u = Ex∼D ′ [( f (x)x) · u] = Ex∼D ′ [sgn(u · x)(x · u)] = Ex∼D ′ [|x · u|] ≥ ε/8, clean by Lemma 16. [sent-470, score-1.044]
85 clean (9) Armed with this bound, we can now lower bound the benefit imparted by vclean : 1 Ez∼D ′ [ f (z)(vclean · z)] = ∑ Sclean (x,y)∈S = 1 ∑ Sclean (x,y)∈S clean clean Ez∼D ′ [y f (z)(x · z)] (yx) · v∗ . [sent-472, score-1.625]
86 clean Since E[(yx) · v∗ ] = ||v∗ ||2 , and (yx) · v∗ ∈ [−3n log m, 3n log m], a Hoeffding bound implies clean clean clean that w. [sent-473, score-2.195]
87 clean Since the noise rate η′ is at most η/ε and η certainly less than ε/4 as discussed above, another Hoeffding bound gives that w. [sent-477, score-0.646]
88 |Sclean | is at least m/2; thus for a suitably large polynomial choice of m, using (9) we have Ez∼D ′ [ f (z)(vclean · z)] ≥ ||v∗ ||2 − O(n log3/2 m)/ m/2 ≥ clean ||v∗ ||2 clean . [sent-480, score-0.999]
89 dirty We bound each of the three contributions in turn. [sent-483, score-0.353]
90 First, using 1 − η′ ≥ 1/2 and (10), we have (1 − η′ + α)E[ f (x)(vclean · x)] ≥ Next, by (8), we have ||v∗ ||2 clean . [sent-484, score-0.485]
91 Combining all these bounds, we get ||v∗ ||2 ||v∗ ||2 ε2 clean − clean − o(ε2 ) ≥ 4 8 1024 by (9). [sent-488, score-0.97]
92 1 The Model We now define the model of learning with adversarial label noise under isotropic log-concave distributions. [sent-493, score-0.517]
93 In this model the learning algorithm has access to an oracle that provides independent random examples drawn according to a fixed distribution P on Rn × {−1, 1}, where • the marginal distribution over Rn is isotropic log-concave, and • there is a halfspace f such that Pr(x,y)∼P [ f (x) = y] = η. [sent-494, score-0.528]
94 2733 K LIVANS , L ONG AND S ERVEDIO Lemma 21 Suppose Algorithm Aalcw is run using P′ as the source of labeled examples, where P′ is a distribution that is (1/ε)-smooth with respect to a joint distribution P on Rn × {−1, 1} whose marginal D ′ on Rn is isotropic and log-concave. [sent-509, score-0.359]
95 Furthermore, as we have assumed without loss of generality that ||x|| ≤ √ ples in the training set, and therefore that ||v|| ≤ 3n log m, we have Ex∼D ′ [h(x) f (x)] = Ex∼D ′ √ (12) 3n log m for all exam- f (x)(x · v) 3n log m (13) so we will work on bounding Ex∼D ′ [ f (x)(x · v)]. [sent-536, score-0.36]
96 Let ′ v∗ dirty = E(x,y)∼Pdirty [yx] v∗ correct = Ex∼D ′ [ f (x)x]. [sent-539, score-0.359]
97 dirty Combining this with (16) and (14) we have that if η′ log(1/(η′ ε) ≤ cε2 for a suitably small constant c, then Ex∼D ′ [ f (x)(x · v)] is a sum of m i. [sent-552, score-0.355]
98 As a challenge for future work, we pose the following question: do there exist computationally efficient algorithms for learning halfspaces under arbitrary distributions in the presence of malicious noise? [sent-564, score-0.394]
99 We feel that even a small improvement in the malicious noise rate that can be handled for halfspaces would be a very interesting result. [sent-566, score-0.479]
100 Then if 1 c d log α + log 1 δ , m≥ αK log K where c is an absolute constant, then ˆ Pr [∃ f ∈ F, ED ( f ) ≤ α but Eu ( f ) > Kα] ≤ δ, u∼D m ˆ where Eu ( f ) = 1 m ∑m f (ui ). [sent-607, score-0.342]
wordName wordTfidf (topN-words)
[('clean', 0.485), ('dirty', 0.326), ('isotropic', 0.307), ('sdirty', 0.285), ('sclean', 0.252), ('malicious', 0.217), ('ex', 0.211), ('outlier', 0.144), ('vclean', 0.143), ('halfspaces', 0.128), ('noise', 0.116), ('log', 0.114), ('alicious', 0.109), ('ervedio', 0.109), ('livans', 0.109), ('vdel', 0.109), ('oise', 0.107), ('alfspaces', 0.101), ('pr', 0.1), ('poly', 0.1), ('removal', 0.089), ('halfspace', 0.084), ('yx', 0.082), ('noisy', 0.076), ('amlcw', 0.076), ('boosting', 0.074), ('sn', 0.073), ('servedio', 0.07), ('adversarial', 0.069), ('amu', 0.067), ('vcorrect', 0.067), ('lemma', 0.067), ('rn', 0.066), ('weak', 0.063), ('ong', 0.063), ('removed', 0.062), ('learner', 0.052), ('oracle', 0.046), ('aalcw', 0.042), ('surviving', 0.042), ('examples', 0.039), ('klivans', 0.038), ('hv', 0.038), ('unit', 0.038), ('averaging', 0.037), ('adversary', 0.036), ('smooth', 0.034), ('aweak', 0.034), ('ddirty', 0.034), ('lov', 0.034), ('vdirty', 0.034), ('draw', 0.033), ('kalai', 0.033), ('correct', 0.033), ('completes', 0.032), ('kearns', 0.031), ('suitably', 0.029), ('presence', 0.028), ('bound', 0.027), ('distribution', 0.026), ('earning', 0.026), ('label', 0.025), ('amlc', 0.025), ('blumer', 0.025), ('logconcave', 0.025), ('prx', 0.025), ('tolerates', 0.025), ('stage', 0.025), ('lemmas', 0.024), ('vempala', 0.023), ('tolerate', 0.023), ('agnostic', 0.023), ('hardness', 0.023), ('ball', 0.023), ('removes', 0.022), ('sz', 0.022), ('direction', 0.022), ('distributions', 0.021), ('perceptron', 0.021), ('hypothesis', 0.02), ('ln', 0.02), ('ez', 0.019), ('rocco', 0.019), ('uniform', 0.019), ('fraction', 0.019), ('principal', 0.019), ('bounding', 0.018), ('length', 0.018), ('rate', 0.018), ('variance', 0.018), ('sgn', 0.018), ('remove', 0.018), ('pca', 0.018), ('blum', 0.018), ('hoeffding', 0.017), ('arora', 0.017), ('dunagan', 0.017), ('dversarial', 0.017), ('erroneously', 0.017), ('pdirty', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
2 0.092302226 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
3 0.074447632 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
Author: Vitaly Feldman
Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a specific marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning
4 0.052882899 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
5 0.048183657 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
6 0.034286946 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
8 0.033313509 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
9 0.032638334 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
10 0.030743968 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
11 0.030156719 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
12 0.030108608 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
13 0.026973514 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
14 0.026360091 29 jmlr-2009-Estimating Labels from Label Proportions
15 0.025145406 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
16 0.02488484 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
17 0.02382068 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
18 0.023165626 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)
19 0.023044394 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
20 0.022363689 81 jmlr-2009-Robust Process Discovery with Artificial Negative Events (Special Topic on Mining and Learning with Graphs and Relations)
topicId topicWeight
[(0, 0.122), (1, -0.022), (2, 0.023), (3, 0.02), (4, -0.104), (5, 0.033), (6, -0.079), (7, 0.051), (8, -0.044), (9, 0.076), (10, 0.114), (11, -0.073), (12, -0.184), (13, 0.128), (14, -0.184), (15, -0.244), (16, -0.044), (17, 0.112), (18, 0.134), (19, 0.086), (20, 0.113), (21, 0.04), (22, 0.116), (23, 0.083), (24, -0.081), (25, 0.065), (26, 0.05), (27, 0.015), (28, 0.085), (29, 0.048), (30, -0.04), (31, -0.016), (32, 0.079), (33, -0.09), (34, 0.156), (35, 0.138), (36, 0.17), (37, -0.238), (38, 0.145), (39, -0.018), (40, 0.108), (41, -0.057), (42, -0.104), (43, 0.052), (44, -0.201), (45, -0.11), (46, -0.009), (47, -0.058), (48, 0.257), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.9479472 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
2 0.5200147 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
3 0.46009475 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
Author: Vitaly Feldman
Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a specific marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning
4 0.31893355 16 jmlr-2009-Classification with Gaussians and Convex Loss
Author: Dao-Hong Xiang, Ding-Xuan Zhou
Abstract: This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classification, general convex loss, varying Gaussian kernels, covering number, approximation
5 0.25458893 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
7 0.19339384 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
8 0.17147385 48 jmlr-2009-Learning Nondeterministic Classifiers
9 0.1709004 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
10 0.15833297 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
11 0.14846273 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
12 0.14602795 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
13 0.14502808 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
14 0.14411145 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
15 0.14067234 50 jmlr-2009-Learning When Concepts Abound
16 0.14049222 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
17 0.1399336 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
18 0.12964539 29 jmlr-2009-Estimating Labels from Label Proportions
20 0.12014748 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
topicId topicWeight
[(8, 0.017), (11, 0.021), (19, 0.014), (26, 0.013), (38, 0.032), (39, 0.485), (47, 0.012), (52, 0.038), (55, 0.04), (58, 0.023), (66, 0.12), (68, 0.014), (90, 0.036), (93, 0.027), (96, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.68538445 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
2 0.56201428 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
Author: Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray, David Page
Abstract: A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze Skewing, a heuristic method that was developed to improve the ability of greedy decision tree algorithms to identify relevant variables of correlation immune Boolean functions, given examples drawn from the uniform distribution (Page and Ray, 2003). We present theoretical results revealing both the capabilities and limitations of skewing. Second, we explore the problem of identifying relevant variables in the Product Distribution Choice (PDC) learning model, a model in which the learner can choose product distributions and obtain examples from them. We prove a lemma establishing a property of Boolean functions that may be of independent interest. Using this lemma, we give two new algorithms for finding relevant variables of correlation immune functions in the PDC model. Keywords: correlation immune functions, skewing, relevant variables, Boolean functions, product distributions c 2009 Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray and David Page. H ELLERSTEIN , ROSELL , BACH , R AY AND PAGE
3 0.31420729 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
Author: Vitaly Feldman
Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a specific marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning
4 0.27711987 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
5 0.27706113 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
6 0.27566427 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
7 0.27534845 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
8 0.27500793 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths
9 0.2741605 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
10 0.27306941 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
11 0.27232549 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
13 0.26971948 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
14 0.26837721 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
15 0.26808679 16 jmlr-2009-Classification with Gaussians and Convex Loss
16 0.26612666 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
17 0.26596805 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
18 0.26585475 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
19 0.26543084 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
20 0.2653181 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models