nips nips2011 nips2011-153 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Learning large-margin halfspaces with more malicious noise Rocco A. [sent-1, score-1.09]
2 com Abstract We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). [sent-6, score-1.286]
3 Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). [sent-7, score-0.987]
4 Our algorithm does not work by optimizing a convex loss function. [sent-8, score-0.08]
5 We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. [sent-9, score-1.714]
6 In this paper we study the problem of learning an unknown γ-margin halfspace in the model of Probably Approximately Correct (PAC) learning with malicious noise at rate η. [sent-11, score-1.123]
7 More precisely, in this learning scenario the target function is an unknown origin-centered halfspace f (x) = sign(w · x) over the domain Rn (we may assume w. [sent-12, score-0.281]
8 (It may be helpful to think of the noisy examples as being constructed by an omniscient and malevolent adversary who has full knowledge of the state of the learning algorithm and previous draws from the oracle. [sent-19, score-0.265]
9 In particular, note that noisy examples need not satisfy the margin constraint and can lie arbitrarily close to, or on, the hyperplane w · x = 0. [sent-20, score-0.242]
10 (Because a success probability can be improved efficiently using standard repeat-and-test techniques [19], we follow the common practice of excluding this success probability from our analysis. [sent-22, score-0.062]
11 1 Introduced by Valiant in 1985 [30], the malicious noise model is a challenging one, as witnessed by the fact that learning algorithms can typically only withstand relatively low levels of malicious noise. [sent-24, score-1.428]
12 Interestingly, the original Perceptron algorithm [5, 26, 27] for learning a γ-margin halfspace can be shown to have relatively high tolerance to malicious noise. [sent-28, score-0.982]
13 Several researchers [14, 17] have established upper bounds on the number of mistakes that the Perceptron algorithm will make when run on a sequence of examples that are linearly separable with a margin except for some limited number of “noisy” data points. [sent-29, score-0.182]
14 2 of Auer and Cesa-Bianchi [3] yields a straightforward “PAC version” of the online Perceptron algorithm that can learn γ-margin halfspaces to accuracy 1 − ε in the presence of malicious noise provided that the malicious noise rate η is at most some value Θ(εγ). [sent-31, score-2.099]
15 We give a simple new algorithm for learning γ-margin halfspaces in the presence of malicious noise. [sent-35, score-1.009]
16 Like the earlier approaches, our algorithm runs in time poly(n, 1/γ, 1/ε); however, it goes beyond the Θ(εγ) malicious noise tolerance of previous approaches. [sent-36, score-0.898]
17 Our first main result is: Theorem 1 There is a poly(n, 1/γ, 1/ε)-time algorithm that can learn an unknown γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise at any rate η ≤ cεγ log(1/γ) whenever γ < 1/7, where c > 0 is a universal constant. [sent-37, score-1.319]
18 The algorithm of Theorem 1 is not based on convex optimization, and this is not a coincidence: our second main result is, roughly stated, the following. [sent-39, score-0.08]
19 Informal paraphrase of Theorem 2 Let A be any learning algorithm that chooses a hypothesis vector v so as to minimize a convex proxy for the binary misclassification error. [sent-40, score-0.309]
20 Then A cannot learn γ-margin halfspaces to accuracy 1 − ε in the presence of malicious noise at rate η ≥ cεγ, where c > 0 is a universal constant. [sent-41, score-1.309]
21 The algorithm of Theorem 1 is a modification of a boosting-based approach to learning halfspaces that is due to Balcan and Blum [7] (see also [6]). [sent-43, score-0.315]
22 [7] considers a weak learner which simply generates a random origin-centered halfspace sign(v · x) by taking v to be a uniform random unit vector. [sent-44, score-0.668]
23 The analysis of [7], which is for a noise-free setting, shows that such a random halfspace has probability Ω(γ) of having accuracy at least 1/2 + Ω(γ) with respect to D. [sent-45, score-0.365]
24 Given this, any boosting algorithm can be used to get a PAC algorithm for learning γ-margin halfspaces to accuracy 1 − ε. [sent-46, score-0.563]
25 Our algorithm is based on a modified weak learner which generates a collection of k = log(1/γ) independent random origin-centered halfspaces h1 = sign(v1 · x), . [sent-47, score-0.677]
26 , hk = sign(vk · x) and takes the majority vote H = Maj(h1 , . [sent-50, score-0.142]
27 The crux of our analysis is to show that if there is√ noise, no then with probability at least (roughly) γ 2 the function H has accuracy at least 1/2 + Ω(γ k) with respect to D (see Section 2, in particular Lemma 1). [sent-54, score-0.142]
28 By using this weak learner in conjunction with a “smooth” boosting algorithm as in [28], we get the overall malicious-noise-tolerant PAC learning algorithm of Theorem 1 (see Section 3). [sent-55, score-0.523]
29 For Theorem 2 we consider any algorithm that draws some number m of samples and minimizes a convex proxy for misclassification error. [sent-56, score-0.303]
30 We also establish the same fact about algorithms that use a regularizer from a class that includes the most popular regularizers based on p-norms. [sent-59, score-0.039]
31 As mentioned above, Servedio [28] gave a boosting-based algorithm that learns γ-margin halfspaces with malicious noise at rates up to η = Θ(εγ). [sent-61, score-1.133]
32 Khardon and Wachman [21] empirically studied the noise tolerance of variants of the Perceptron algorithm. [sent-62, score-0.245]
33 [22] showed that an algorithm that combines PCA-like techniques with smooth boosting can tolerate relatively high levels of malicious noise provided that the distribution D is sufficiently “nice” (uniform over the unit sphere or isotropic log-concave). [sent-64, score-1.186]
34 We previously [23] showed that any boosting algorithm that works by stagewise minimization of a convex “potential function” cannot tolerate random classification noise – this is a type of “benign” rather than malicious noise, which independently flips the label of each example with probability η. [sent-66, score-1.171]
35 A natural question is whether Theorem 2 follows from [23] by having the malicious noise simply simulate random classification noise; the answer is no, essentially because the ordering of quantifiers is reversed in the two results. [sent-67, score-0.83]
36 The construction and analysis from [23] crucially relies on the fact that in the setting of that paper, first the random misclassification noise rate η is chosen to take some particular value in (0, 1/2), and then the margin parameter γ is selected in a way that depends on η. [sent-68, score-0.29]
37 In contrast, in this paper the situation is reversed: in our setting first the margin parameter γ is selected, and then given this value we study how high a malicious noise rate η can be tolerated. [sent-69, score-0.898]
38 2 The basic weak learner for Theorem 1 Let f (x) = sign(w · x) be an unknown halfspace and D be an unknown distribution over the ndimensional unit ball that has a γ margin with respect to f as described in Section 1. [sent-70, score-0.735]
39 For odd k ≥ 1 we let Ak denote the algorithm that works as follows: Ak generates k independent uniform random unit vectors v1 , . [sent-71, score-0.173]
40 , vk in Rn and outputs the hypothesis H(x) = Maj(sign(v1 · x), . [sent-74, score-0.178]
41 Note that Ak does not use any examples (and thus malicious noise does not affect its execution). [sent-78, score-0.876]
42 A useful tail bound The following notation will be useful in analyzing algorithm Ak : Pr k i=1 Let vote(γ, k) := Xi < k/2 where X1 , . [sent-81, score-0.045]
43 Clearly vote(γ, k) is the lower tail of a Binomial distribution, but for our purposes we need an upper bound on vote(γ, k) when k is very small relative to 1/γ 2 and the value of vote(γ, k) is close to but – crucially – less than 1/2. [sent-88, score-0.046]
44 Standard Chernoff-type bounds [10] do not seem to be useful here, so we give a simple self-contained proof of the bound we need (no attempt has been made to optimize constant factors below). [sent-89, score-0.048]
45 Lemma 2 For 0 < γ < 1/2 and odd k ≤ 1 16γ 2 we have vote(γ, k) ≤ 1/2 − √ γ k 50 . [sent-90, score-0.043]
46 Proof: The lemma is easily verified for k = 1, 3, 5, 7 so we assume k ≥ 9 below. [sent-91, score-0.033]
47 3 Proof of Theorem 1: smooth boosting the weak learner to tolerate malicious noise Our overall algorithm for learning γ-margin halfspaces with malicious noise, which we call Algorithm B, combines a weak learner derived from Section 2 with a “smooth” boosting algorithm. [sent-94, score-2.842]
48 Recall that boosting algorithms [15, 25] work by repeatedly running a weak learner on a sequence of carefully crafted distributions over labeled examples. [sent-95, score-0.537]
49 Given the initial distribution P over labeled examples (x, y), a distribution Pi over labeled examples is said to be κ-smooth if 1 Pi [(x, y)] ≤ κ P [(x, y)] for every (x, y) in the support of P. [sent-96, score-0.324]
50 Several boosting algorithms are known [9, 16, 28] that generate only 1/ε-smooth distributions when boosting to final accuracy 1 − ε. [sent-97, score-0.395]
51 For concreteness we will use the MadaBoost algorithm of [9], which generates a (1 − ε)-accurate final 1 1 hypothesis after O( εγ 2 ) stages of calling the weak learner and runs in time poly( 1 , γ ). [sent-98, score-0.488]
52 ε At a high level our analysis here is related to previous works [28, 22] that used smooth boosting to tolerate malicious noise. [sent-99, score-0.941]
53 The basic idea is that since a smooth booster does not increase the weight of any example by more than a 1/ε factor, it cannot “amplify” the malicious noise rate by more than this factor. [sent-100, score-0.929]
54 In [28] the weak learner only achieved advantage O(γ) so as long as the malicious noise rate was initially O(εγ), the “amplified” malicious noise rate of O(γ) could not completely “overcome” the advantage and boosting could proceed successfully. [sent-101, score-2.163]
55 Here we have a weak learner that achieves a higher advantage, so boosting can proceed successfully in the presence of more malicious noise. [sent-102, score-1.173]
56 The weak learner W that B uses is a slight extension of algorithm Ak from Section 2 with k = log(1/γ) . [sent-104, score-0.332]
57 When invoked with distribution Pt over labeled examples, algorithm W • makes (specified later) calls to algorithm A log(1/γ) , generating candidate hypotheses H1 , . [sent-105, score-0.221]
58 , H using M (specified later) independent examples drawn from Pt and outputs the Hj that makes the fewest errors on these examples. [sent-111, score-0.12]
59 Recall that we are assuming η ≤ cεγ log(1/γ); we will show that under this assumption, algorithm B outputs a final hypothesis h that satisfies Prx∼D [h(x) = f (x)] ≥ 1 − ε with probability at least 1/2. [sent-113, score-0.225]
60 5 First, let SN ⊆ S denote the noisy examples in S. [sent-114, score-0.126]
61 A standard Chernoff bound [10] implies that with probability at least 5/6 we have |SN |/|S| ≤ 2η; we henceforth write η to denote |SN |/|S|. [sent-115, score-0.058]
62 We will show below that with high probability, every time MadaBoost calls the weak learner W with a distribution Pt , W generates a weak hypothesis (call it ht ) that has Pr(x,y)∼Pt [ht (x) = y] ≥ 1/2 + Θ(γ log(1/γ)). [sent-116, score-0.707]
63 MadaBoost’s boosting guarantee then implies that the final hypothesis (call it h) of Algorithm B satisfies Pr(x,y)∼P [h(x) = y] ≥ 1 − ε/4. [sent-117, score-0.273]
64 Since h is correct on (1 − ε/4) of the points in the sample S and η ≤ 2η, h must be correct on at least 1 − ε/4 − 2η of the points in S \ SN , which is a noise-free sample of poly(n, 1/γ, 1/ε) labeled examples generated according to D. [sent-118, score-0.206]
65 Thus it remains to show that with high probability each time W is called on a distribution Pt , it indeed generates a weak hypothesis with advantage at least Ω(γ log(1/γ)). [sent-120, score-0.387]
66 Suppose R is the uniform distribution over the noisy points SN ⊆ S, and P is the uniform distribution over the remaining points S \ SN (we may view P as the “clean” version of P ). [sent-122, score-0.145]
67 Let Pt denote the distribution generated by MadaBoost during boosting stage t. [sent-124, score-0.217]
68 The smoothness of MadaBoost implies that Pt [SN ] ≤ 4η / , so the noisy examples have total probability at most 4η /ε under Pt . [sent-125, score-0.157]
69 By Lemma 1, each call to algorithm A log(1/γ) Pr[errorPt (g) ≤ 1/2 − γ g (4) yields a hypothesis (call it g) that satisfies log(1/γ)/(100π)] ≥ Ω(γ 2 ), (5) def where for any distribution Q we define errorQ (g) = Pr(x,y)∼Q [g(x) = y]. [sent-127, score-0.151]
70 4 Convex optimization algorithms have limited malicious noise tolerance Given a sample S = {(x1 , y1 ), . [sent-132, score-0.897]
71 , (xm , ym )} of labeled examples, the number of examples misclassified by the hypothesis sign(v · x) is a nonconvex function of v, and thus it can be difficult to find a v that minimizes this error (see [12, 18] for theoretical results that support this intuition in various settings). [sent-135, score-0.312]
72 In an effort to bring the powerful tools of convex optimization to bear on various halfspace learning problems, a widely used approach is to instead minimize some convex proxy for misclassification error. [sent-136, score-0.491]
73 This definition allows algorithms to use regularization, but by setting the regularizer ψ to be the all-0 function it also covers algorithms that do not. [sent-138, score-0.039]
74 6 Definition 2 A function φ : R → R+ is a convex misclassification proxy if φ is convex, nonincreasing, differentiable, and satisfies φ (0) < 0. [sent-139, score-0.183]
75 A function ψ : Rn → [0, ∞) is a componentwise n regularizer if ψ(v) = i=1 τ (vi ) for a convex, differentiable τ : R → [0, ∞) for which τ (0) = 0. [sent-140, score-0.076]
76 Given a sample of labeled examples S = {(x1 , y1 ), . [sent-141, score-0.158]
77 , (xm , ym )} ∈ (Rn × {−1, 1})m , the (φ,ψ)m loss of vector v on S is Lφ,ψ,S (v) := ψ(v) + i=1 φ(y(v · xi )). [sent-144, score-0.036]
78 A (φ,ψ)-minimizer is any learning algorithm that minimizes Lφ,ψ,S (v) whenever the minimum exists. [sent-145, score-0.057]
79 Our main negative result, shows that for any sample size, algorithms that minimize a regularized convex proxy for misclassification error will succeed with exponentially small probability for a malicious noise rate that is Θ(εγ), and therefore for any larger malicious noise rate. [sent-146, score-1.874]
80 Theorem 2 Fix φ to be any convex misclassification proxy and ψ to be any componentwise regularizer, and let algorithm A be a (φ,ψ)-minimizer. [sent-147, score-0.242]
81 Fix ε ∈ (0, 1/8] to be any error parameter, γ ∈ (0, 1/8] to be any margin parameter, and m ≥ 1 to be any sample size. [sent-148, score-0.077]
82 Proof: The analysis has two cases based on whether or not the number of examples m exceeds m0 := 321γ 2 . [sent-151, score-0.079]
83 (We emphasize that Case 2, in which n is taken to be just 2, is the case that is of primary interest, since in Case 1 the algorithm does not have enough examples to reliably learn a γ-margin halfspace even in a noiseless scenario. [sent-152, score-0.376]
84 ) Case 1 (m ≤ m0 ): Let n = 1/γ 2 and let e(i) ∈ Rn denote the unit vector with a 1 in the ith component. [sent-153, score-0.032]
85 , e(n) } is shattered by the family F which consists of all 2n halfspaces whose weight vectors are in {−γ, γ}n , and any distribution whose support is E is a γ-margin distribution for any such halfspace. [sent-157, score-0.343]
86 [31]) that O( εγ 2 ) examples suffice to learn γ-margin n-dimensional halfspaces for any n if there is no noise, so noisy examples will play an important role in the construction in this case. [sent-162, score-0.523]
87 The target halfspace is f (x) = sign( 1 − γ 2 x1 + γx2 ). [sent-164, score-0.25]
88 When the malicious adversary is allowed to corrupt an example, with probability 1/2 it provides the point (1, 0) and mislabels it as negative, and with probability 1/2 it provides the point (0, 1) and mislabels it as negative. [sent-166, score-0.827]
89 Let S = ((x1 , y1 ), “ √ m , ym”o˛be a sample of m examples drawn from EXη (f, D). [sent-167, score-0.1]
90 [10]) and a union bound we get Pr[pS,1 = 0 or pS,2 = 0 or pS,1 > 3 or ηS,1 < η/4 or ηS,2 < η/4] ηm m + 2 exp − ≤ (1 − 2ε(1 − η))m + (1 − (1 − 2ε)(1 − η))m + exp − 12 24 ηm m + 2 exp − (since ≤ 1/4 and η ≤ 1/2) ≤ 2(1 − ε)m + exp − 12 24 1 1 1 ≤ 2 exp − + exp − + 2 exp − . [sent-175, score-0.182]
91 32γ 2 96γ 2 48γ Since the theorem allows for a e−c/γ success probability for A, it suffices to consider the case in which pS,1 and pS,2 are both positive, pS,1 ≤ 3 , and min{ηS,1 , ηS,2 } ≥ η/4. [sent-176, score-0.072]
92 Since L is convex, this means that for each v2 ∈ R we have that the value v1 that minimizes L(v1 , v2 ) is a negative value v1 < 0. [sent-186, score-0.035]
93 So, if pS,1 √ γ 2 < ηS,1 , the linear classifier v output by Aφ has v1 ≤ 0; hence it 1−γ misclassifies the point ( √ γ 1−γ 2 , 0), and thus has error rate at least 2 with respect to D. [sent-187, score-0.072]
94 5 Conclusion It would be interesting to further improve on the malicious noise tolerance of efficient algorithms for PAC learning γ-margin halfspaces, or to establish computational hardness results for this problem. [sent-190, score-0.902]
95 Another goal for future work is to develop an algorithm that matches the noise tolerance of Theorem 1 but uses a single halfspace as its hypothesis representation. [sent-191, score-0.621]
96 Specification and simulation of statistical query algorithms for efficiency and noise tolerance. [sent-195, score-0.166]
97 Learning nested differences in the presence of malicious noise. [sent-199, score-0.694]
98 On-line learning with malicious noise and the closure algorithm. [sent-207, score-0.797]
99 A general lower bound on the number of examples needed for learning. [sent-251, score-0.079]
100 Generalization of a probability limit theorem of Cram´ r. [sent-264, score-0.072]
wordName wordTfidf (topN-words)
[('malicious', 0.631), ('halfspaces', 0.293), ('halfspace', 0.25), ('boosting', 0.169), ('noise', 0.166), ('learner', 0.162), ('madaboost', 0.16), ('weak', 0.148), ('poly', 0.14), ('misclassi', 0.126), ('proxy', 0.125), ('perceptron', 0.123), ('vote', 0.116), ('pt', 0.108), ('hypothesis', 0.104), ('pac', 0.103), ('tolerate', 0.094), ('prx', 0.092), ('pr', 0.091), ('examples', 0.079), ('tolerance', 0.079), ('sign', 0.077), ('ak', 0.075), ('sn', 0.071), ('errorpt', 0.068), ('presence', 0.063), ('draws', 0.063), ('servedio', 0.063), ('convex', 0.058), ('labeled', 0.058), ('accuracy', 0.057), ('margin', 0.056), ('adversary', 0.054), ('ex', 0.053), ('generates', 0.052), ('smooth', 0.047), ('noisy', 0.047), ('khardon', 0.046), ('rate', 0.045), ('odd', 0.043), ('outputs', 0.041), ('rn', 0.041), ('theorem', 0.041), ('booster', 0.04), ('maj', 0.04), ('mislabels', 0.04), ('bn', 0.04), ('kearns', 0.04), ('regularizer', 0.039), ('hyperplane', 0.037), ('componentwise', 0.037), ('klivans', 0.037), ('calls', 0.036), ('ym', 0.036), ('minimizes', 0.035), ('rocco', 0.034), ('log', 0.034), ('vk', 0.033), ('lemma', 0.033), ('reversed', 0.033), ('ht', 0.032), ('unit', 0.032), ('haussler', 0.031), ('nonincreasing', 0.031), ('colt', 0.031), ('unknown', 0.031), ('probability', 0.031), ('chernoff', 0.029), ('invoked', 0.029), ('blum', 0.029), ('hypotheses', 0.029), ('universal', 0.029), ('satis', 0.028), ('oracle', 0.027), ('least', 0.027), ('jmlr', 0.026), ('exp', 0.026), ('hardness', 0.026), ('hk', 0.026), ('learn', 0.025), ('bounds', 0.025), ('distribution', 0.025), ('xt', 0.025), ('uniform', 0.024), ('separating', 0.023), ('clean', 0.023), ('lie', 0.023), ('stage', 0.023), ('tail', 0.023), ('agnostic', 0.023), ('crucially', 0.023), ('proof', 0.023), ('recall', 0.023), ('dt', 0.023), ('cation', 0.022), ('algorithm', 0.022), ('auer', 0.022), ('learns', 0.021), ('xm', 0.021), ('sample', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
2 0.41072956 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
3 0.19114654 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
4 0.11714942 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
5 0.11214127 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
6 0.090060495 41 nips-2011-Autonomous Learning of Action Models for Planning
7 0.081149645 49 nips-2011-Boosting with Maximum Adaptive Sampling
8 0.079501912 299 nips-2011-Variance Penalizing AdaBoost
9 0.066448644 80 nips-2011-Efficient Online Learning via Randomized Rounding
10 0.065960392 22 nips-2011-Active Ranking using Pairwise Comparisons
11 0.062960483 100 nips-2011-Gaussian Process Training with Input Noise
12 0.062854417 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
13 0.059976179 162 nips-2011-Lower Bounds for Passive and Active Learning
14 0.059089154 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
15 0.058408633 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
16 0.057725646 28 nips-2011-Agnostic Selective Classification
17 0.056527432 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
18 0.056137886 186 nips-2011-Noise Thresholds for Spectral Clustering
19 0.055525817 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
20 0.0539735 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
topicId topicWeight
[(0, 0.164), (1, -0.086), (2, -0.047), (3, -0.1), (4, 0.035), (5, 0.082), (6, 0.026), (7, -0.135), (8, -0.289), (9, 0.003), (10, -0.027), (11, 0.028), (12, 0.047), (13, -0.07), (14, 0.123), (15, -0.085), (16, 0.115), (17, -0.113), (18, -0.201), (19, -0.241), (20, -0.003), (21, -0.008), (22, -0.076), (23, -0.069), (24, 0.097), (25, -0.099), (26, 0.1), (27, 0.041), (28, 0.057), (29, -0.022), (30, -0.056), (31, 0.047), (32, -0.071), (33, -0.007), (34, -0.007), (35, -0.015), (36, 0.03), (37, 0.169), (38, 0.028), (39, -0.038), (40, 0.033), (41, -0.015), (42, -0.074), (43, -0.026), (44, -0.053), (45, 0.017), (46, -0.006), (47, -0.088), (48, 0.083), (49, 0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.94458961 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
2 0.92656201 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
3 0.7417984 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
4 0.56876415 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
5 0.50937641 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
6 0.45417535 41 nips-2011-Autonomous Learning of Action Models for Planning
7 0.38823146 49 nips-2011-Boosting with Maximum Adaptive Sampling
8 0.33737418 162 nips-2011-Lower Bounds for Passive and Active Learning
9 0.32938275 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
10 0.32851303 95 nips-2011-Fast and Accurate k-means For Large Datasets
11 0.31656384 241 nips-2011-Scalable Training of Mixture Models via Coresets
12 0.31195268 21 nips-2011-Active Learning with a Drifting Distribution
13 0.29821426 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
14 0.29096308 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
15 0.2876305 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
16 0.27384368 73 nips-2011-Divide-and-Conquer Matrix Factorization
17 0.27369842 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
18 0.2710737 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
19 0.26826045 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
20 0.26617813 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
topicId topicWeight
[(0, 0.03), (4, 0.063), (20, 0.05), (26, 0.036), (31, 0.062), (43, 0.117), (45, 0.121), (57, 0.038), (65, 0.013), (74, 0.041), (83, 0.033), (87, 0.256), (99, 0.05)]
simIndex simValue paperId paperTitle
1 0.8832776 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
same-paper 2 0.7947101 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
3 0.74887097 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
4 0.73648322 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
5 0.67134845 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.62286669 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
7 0.62199634 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
8 0.6175552 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
9 0.61686075 258 nips-2011-Sparse Bayesian Multi-Task Learning
10 0.61527449 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
11 0.61484802 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
12 0.61418444 80 nips-2011-Efficient Online Learning via Randomized Rounding
13 0.61407691 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
14 0.61335266 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
15 0.61298859 199 nips-2011-On fast approximate submodular minimization
16 0.61180794 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
17 0.6116876 265 nips-2011-Sparse recovery by thresholded non-negative least squares
18 0.61126167 186 nips-2011-Noise Thresholds for Spectral Clustering
19 0.61050111 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
20 0.60986197 109 nips-2011-Greedy Model Averaging