nips nips2011 nips2011-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Algorithms and hardness results for parallel large margin learning Rocco A. [sent-1, score-0.441]
2 com Abstract We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. [sent-6, score-0.709]
3 Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. [sent-7, score-1.157]
4 We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). [sent-8, score-0.62]
5 In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. [sent-9, score-0.718]
6 (Throughout this paper we refer to such a combination of target halfspace f and distribution D as a γ-margin halfspace. [sent-13, score-0.307]
7 ) The learning algorithm is given access to labeled examples (x, f (x)) where each x is independently drawn from D, and it must with high probability output a hypothesis h : Rn → {−1, 1} that satisfies Prx∼D [h(x) = f (x)] ≤ ε. [sent-14, score-0.367]
8 Learning a large-margin halfspace is a fundamental problem in machine learning; indeed, one of the most famous algorithms in machine learning is the Perceptron algorithm [25] for this problem. [sent-15, score-0.307]
9 PAC algorithms based on the Percep1 1 tron [17] run in poly(n, γ , 1 ) time, use O( εγ 2 ) labeled examples in Rn , and learn an unknown ε n-dimensional γ-margin halfspace to accuracy 1 − ε. [sent-16, score-0.531]
10 The last few years have witnessed a resurgence of interest in highly efficient parallel algorithms for a wide range of computational problems in many areas including machine learning [33, 32]. [sent-18, score-0.393]
11 So a natural goal is to develop an efficient parallel algorithm for learning γ-margin halfspaces that matches the performance of the Perceptron algorithm. [sent-19, score-0.476]
12 A well-established theoretical notion of efficient parallel computation is that an efficient parallel algorithm for a problem with input size N is one that uses poly(N ) processors and runs in parallel time polylog(N ), see e. [sent-20, score-1.483]
13 1 Main Question: Is there a learning algorithm that uses poly(n, γ , 1 ) processors ε 1 and runs in time poly(log n, log γ , log 1 ) to learn an unknown n-dimensional γε margin halfspace to accuracy 1 − ε? [sent-24, score-0.85]
14 (See [31] for a detailed definition of parallel learning algorithms; here we only recall that an efficient parallel learning algorithm’s hypothesis must be efficiently evaluatable in parallel. [sent-25, score-0.912]
15 Table 1 summarizes the running time and number of processors used by various parallel algorithms to learn a γ-margin halfspace over Rn . [sent-29, score-0.899]
16 We do not see how to obtain parallel time bounds better than O(1/γ 2 ) from recent analyses of other algorithms based of gradient descent (such as [7, 8, 4]), some of which use assumptions incomparable in strength to the γ-margin condition studied here. [sent-31, score-0.393]
17 It boosts for O(1/γ 2 ) stages over a O(1/γ 2 )-size sample; using one ˜ processor for each coordinate of each example, the running time bound is O(1/γ 2 ) · log n, using poly(n, 1/γ) processors. [sent-33, score-0.302]
18 ) The third line of the table, included for comparison, is simply a standard sequential algorithm for learning a halfspace based on polynomial-time linear programming executed on one processor, see e. [sent-35, score-0.395]
19 Efficient parallel algorithms have been developed for some simpler PAC learning problems such as learning conjunctions, disjunctions, and symmetric Boolean functions [31]. [sent-38, score-0.393]
20 [6] gave efficient parallel PAC learning algorithms for some geometric constant-dimensional concept classes. [sent-39, score-0.418]
21 Our main positive result is a parallel algorithm that uses poly(n, 1/γ) processors ˜ to learn γ-margin halfspaces in parallel time O(1/γ) + O(log n) (see Table 1). [sent-43, score-1.092]
22 Our analysis can be modified to establish similar positive results for other formulations of the large-margin learning problem, including ones (see [28]) that have been tied closely to weak learnability (these modifications are not presented due to space constraints). [sent-45, score-0.34]
23 In contrast, our main negative result is an 2 information-theoretic argument that suggests that such positive parallel learning results cannot be obtained by boosting alone. [sent-46, score-0.684]
24 We show that if the weak learner must be called as an oracle, boosting cannot be parallelized: any parallel booster must perform Ω(1/γ 2 ) sequential stages of boosting a “black-box” γ-advantage weak learner in the worst case. [sent-47, score-2.404]
25 This extends an earlier lower bound of Freund [10] for standard (sequential) boosters that can only call the weak learner once per stage. [sent-48, score-0.602]
26 2 A parallel algorithm for learning γ-margin halfspaces over Bn Our parallel algorithm is an amalgamation of existing tools from high-dimensional geometry, convex optimization, parallel algorithms for linear algebra, and learning theory. [sent-49, score-1.35]
27 Roughly speaking the ˜ algorithm works as follows: given a data set of m = O(1/γ 2 ) labeled examples from Bn × {−1, 1}, ˜ it begins by randomly projecting the examples down to d = O(1/γ 2 ) dimensions. [sent-50, score-0.223]
28 This essentially preserves the geometry so the resulting d-dimensional labeled examples are still linearly separable with margin Θ(γ). [sent-51, score-0.222]
29 The algorithm then uses a variant of a linear programming algorithm of Renegar [24, 21] which, roughly speaking, solves linear programs with m constraints to high accuracy using √ (essentially) m stages of Newton’s method. [sent-52, score-0.26]
30 Within Renegar’s algorithm we employ fast parallel algorithms for linear algebra [22] to carry out each stage of Newton’s method in polylog(1/γ) parallel time steps. [sent-53, score-0.919]
31 This suffices to learn the unknown halfspace to high constant accuracy (say 9/10); to get a 1 − ε-accurate hypothesis we combine the above procedure with Freund’s approach [10] for boosting accuracy that was mentioned in the introduction. [sent-54, score-0.833]
32 In the rest of this section we address the necessary details in full and prove the following theorem: Theorem 1 There is a parallel algorithm with the following performance guarantee: Let f, D define an unknown γ-margin halfspace over Bn as described in the introduction. [sent-56, score-0.733]
33 The algorithm is given as input , δ > 0 and access to labeled examples (x, f (x)) that are drawn independently from D. [sent-57, score-0.216]
34 It runs in O(((1/γ)polylog(1/γ)+log(n)) log(1/ )+log log(1/δ)) time, uses poly(n, 1/γ, 1/ , log(1/δ)) processors, and with probability 1−δ it outputs a hypothesis h satisfying Prx∼D [h(x) = f (x)] ≤ ε. [sent-58, score-0.244]
35 Given such an A and a unit vector w ∈ Rn (recall that the target halfspace f is √ f (x) = sign(w · x)), let w denote (1/ d)wA. [sent-64, score-0.307]
36 We will use the following lemma from [1]: Lemma 1 [1] Let f (x) = sign(w · x) and D define a γ-margin halfspace as described in the introduction. [sent-66, score-0.401]
37 d i=1 Let F be the convex barrier function F (u) = log 1 (ui −ai )(bi −ui ) (we specify the values 1 1 bi −ui − ui −ai . [sent-70, score-0.251]
38 , bd ∈ [−2, 2], |bi − ai | ≥ 2 for all i, and L is a subspace of Rd : minimize − u1 such that u ∈ L and ai ≤ xi ≤ bi for all i. [sent-82, score-0.26]
39 The following key lemma shows that rounding intermediate solutions does not do too much harm: Lemma 4 For any k, if Fηk (w(k) ) ≤ optηk + 1/9, then Fηk (r(k) ) ≤ optηk + 1. [sent-106, score-0.219]
40 Recalling that √ 1 1 g(u)i = bi −ui − ui −ai , this means that ||gη (s)|| ≤ η + d22η+2d+1 so that applying (5) we get √ √ 2η+2d+1 d, we have Fη (r) − Fη (w) ≤ Fη (r) − Fη (w) ≤ ||w − r||(η + d2 √ ). [sent-132, score-0.152]
41 We will use an algorithm due to Reif [22]: Lemma 5 ([22]) There is a polylog(d, L)-time, poly(d, L)-processor parallel algorithm which, given as input a d × d matrix A with rational entries of total bit-length L, outputs A−1 . [sent-135, score-0.539]
42 Let A be a parallel learning algorithm such that for all D with support(D ) ⊆ support(D), given draws (x, f (x)) from D , with probability 9/10 A outputs a hypothesis with accuracy 9/10 (w. [sent-139, score-0.654]
43 Then there is a parallel algorithm B that with probability 1 − δ constructs a (1 − ε)-accurate hypothesis (w. [sent-143, score-0.568]
44 Given x it is easy to compute ΦA (x) using O(n log(1/γ)/γ 2 ) processors in O(log(n/γ)) time. [sent-156, score-0.223]
45 (6) The algorithm next draws m = c log(1/γ)/γ 2 labeled training examples (ΦA (x), f (x)) from D ; this can be done in O(log(n/γ)) time using O(n) · poly(1/γ) processors as noted above. [sent-160, score-0.387]
46 It then applies Acpr to find a d-dimensional halfspace h that classifies all m examples correctly (more on this below). [sent-161, score-0.375]
47 Now the standard VC bound for halfspaces [30] applied to h and D implies that since h classifies all m examples correctly, with overall probability at least 9/10 its accuracy is at least 9/10 with respect to D , i. [sent-169, score-0.219]
48 So the hypothesis h ◦ ΦA has accuracy 9/10 with respect to D with probability 9/10 as required by Lemma 6. [sent-172, score-0.201]
49 , (xm , ym ) ∈ Bd × {−1, 1} satisfying the above conditions, we will apply algorithm Acpr to the following linear program, called LP, with α = γ /2: “minimize −s such that yt (v · xt ) − st = s and 0 ≤ st ≤ 2 for all t ∈ [m]; −1 ≤ vi ≤ 1 for all i ∈ [d]; and −2 ≤ s ≤ 2. [sent-178, score-0.187]
50 ) Lemma 7 Given any d-dimensional linear program in the form (1), and an initial solution u ∈ L such that min{|ui − ai |, |ui − bi |} ≥ 1 for all i, Algorithm Acpr approximates the optimal solution √ to an additive ±α. [sent-183, score-0.147]
51 It runs in d · polylog(d/α) parallel time and uses poly(1/α, d) processors. [sent-184, score-0.45]
52 For k = 1, since initially mini {|ui − ai |, |ui − bi |} ≥ 1, we have F (u) ≤ 0, and, since η1 = 1 and u1 ≥ −1 we have Fη1 (u) ≤ 1 and optη1 ≥ −1. [sent-190, score-0.147]
53 3 Lower bound for parallel boosting in the oracle model Boosting is a widely used method for learning large-margin halfspaces. [sent-200, score-0.741]
54 In this section we consider the question of whether boosting algorithms can be efficiently parallelized. [sent-201, score-0.318]
55 We work in the original PAC learning setting [29, 16, 26] in which a weak learning algorithm is provided as an oracle that is called by the boosting algorithm, which must simulate a distribution over labeled examples for the weak learner. [sent-202, score-1.094]
56 Our main result for this setting is that boosting is inherently sequential; being able to to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of sequential boosting stages that are required. [sent-203, score-2.036]
57 In fact we show this in a very strong sense, by proving that a boosting algorithm that runs arbitrarily many copies of the weak learner in parallel in each stage cannot save even one stage over a sequential booster that runs the weak learner just once in each stage. [sent-204, score-2.278]
58 Below we first define the parallel boosting framework and give some examples of parallel boosters. [sent-206, score-1.136]
59 We then state and prove our lower bound on the number of stages required by parallel boosters. [sent-207, score-0.552]
60 A consequence of our lower bound is that Ω(log(1/ε)/γ 2 ) stages of parallel boosting are required in order to boost a γ-advantage weak learner to achieve classification accuracy 1 − ε no matter how many copies of the weak learner are used in parallel in each stage. [sent-208, score-2.282]
61 Our definition of weak learning is standard in PAC learning, except that for our discussion it suffices to consider a single target function f : X → {−1, 1} over a domain X. [sent-209, score-0.328]
62 Definition 1 A γ-advantage weak learner L is an algorithm that is given access to a source of independent random labeled examples drawn from an (unknown and arbitrary) probability distribution 3 Noting that ϑ ≤ 2d [24, page 35]. [sent-210, score-0.712]
63 L must4 return a weak hypothesis h : X → {−1, 1} that satisfies Pr(x,f (x))←P [h(x) = f (x)] ≥ 1/2 + γ. [sent-212, score-0.43]
64 Intuitively, a boosting algorithm runs the weak learner repeatedly on a sequence of carefully chosen distributions to obtain a sequence of weak hypotheses, and combines the weak hypotheses to obtain a final hypothesis that has high accuracy under P. [sent-220, score-1.736]
65 In stage t of a parallel booster the boosting algorithm may run the weak learner many times in parallel using different probability distributions. [sent-222, score-1.97]
66 No other dependence on x is allowed, since intuitively the only interface that the boosting algorithm should have with each data point is through its label and the values of the weak hypotheses from earlier stages. [sent-224, score-0.701]
67 We further observe that since the distribution P is the only source of labeled examples, a booster should construct the distributions at each stage by somehow “filtering” examples (x, f (x)) drawn from P based only on the value of f (x) and the values of the weak hypotheses from previous stages. [sent-225, score-0.847]
68 We thus define a parallel booster as follows: Definition 2 (Parallel booster) A T -stage parallel boosting algorithm with N -fold parallelism is defined by T N functions {αt,k }t∈[T ],k∈[N ] and a (randomized) Boolean function h, where αt,k : {−1, 1}(t−1)N +1 → [0, 1] and h : {−1, 1}T N → {−1, 1}. [sent-226, score-1.384]
69 In the t-th stage of boosting the weak learner is run N times in parallel. [sent-227, score-0.894]
70 For each k ∈ [N ], the distribution Pt,k over labeled examples that is given to the k-th run of the weak learner is as follows: a draw from Pt,k is made by drawing (x, f (x)) from P and accepting (x, f (x)) as the output of the draw from Pt,k with probability px = αt,k (h1,1 (x), . [sent-228, score-0.836]
71 In stage t, for each k ∈ [N ] the booster gives the weak learner access to Pt,k as defined above and the weak learner generates a hypothesis ht,k that has advantage at least γ w. [sent-232, score-1.468]
72 After T stages, T N weak hypotheses {ht,k }t∈[T ],k∈[N ] have been obtained from the weak learner. [sent-236, score-0.69]
73 The final hypothesis of the booster is H(x) := h(h1,1 (x), . [sent-237, score-0.367]
74 , hT,N (x)), and its accuracy is minht,k Pr(x,f (x))←P [H(x) = f (x)], where the min is taken over all sequences of T N weak hypotheses subject to the condition that each ht,k has advantage at least γ w. [sent-240, score-0.436]
75 The parameter N above corresponds to the number of processors that the parallel booster is using; we get a sequential booster when N = 1. [sent-244, score-1.157]
76 In [10] Freund gave a boosting algorithm and showed that after T stages of boosting, his algorithm generates a final hyj def T /2 T −j 1 pothesis that is guaranteed to have error at most vote(γ, T ) = j=0 T 2 + γ (1/2 − γ) j (see Theorem 2. [sent-246, score-0.533]
77 4) that any T -stage sequential booster must have error at least as large as vote(γ, T ), and so consequently any sequential booster that generates a (1 − ε)-accurate final hypothesis must run for Ω(log(1/ε)/γ 2 ) stages. [sent-249, score-0.751]
78 Our Theorem 2 below extends this lower bound to parallel boosters. [sent-250, score-0.419]
79 Several parallel boosting algorithms have been given in the literature, including branching program [20, 13, 18, 19] and decision tree [15] boosters. [sent-251, score-0.684]
80 All of these boosters take Ω(log(1/ε)/γ 2 ) stages to learn to accuracy 1 − ε; our theorem below implies that any parallel booster must run for Ω(log(1/ε)/γ 2 ) stages no matter how many parallel calls to the weak learner are made per stage. [sent-252, score-1.968]
81 Theorem 2 Let B be any T -stage parallel boosting algorithm with N -fold parallelism. [sent-253, score-0.708]
82 Then for any 0 < γ < 1/2, when B is used to boost a γ-advantage weak learner the resulting final hypothesis may have error as large as vote(γ, T ) (see the discussion after Definition 2). [sent-254, score-0.624]
83 4 The usual definition of a weak learner would allow L to fail with probability δ. [sent-256, score-0.523]
84 7 The theorem is proved as follows: fix any 0 < γ < 1/2 and fix B to be any T -stage parallel boosting algorithm. [sent-258, score-0.708]
85 We will exhibit a target function f and a distribution P over {(x, f (x))x∈X , and describe a strategy that a weak learner W can use to generate weak hypotheses ht,k that each have advantage at least γ with respect to the distributions Pt,k . [sent-259, score-0.908]
86 We show that with this weak learner W , the resulting final hypothesis H that B outputs will have accuracy at most 1 − vote(γ, T ). [sent-260, score-0.71]
87 Note that under P, given the label z = f (x) of a labeled example (x, f (x)), each coordinate ωi of x is correct in predicting the value of f (x, z) with probability 1/2 + γ independently of all other ωj ’s. [sent-270, score-0.184]
88 We next describe a way that a weak learner W can generate a γ-advantage weak hypothesis each time it is invoked by B. [sent-271, score-0.953]
89 When W is invoked with Pt,k it replies as follows (recall that for x ∈ X we have x = (z, ω) as described above): (i) if Pr(x,f (x))←Pt,k [ωt = f (x)] ≥ 1/2 + γ then the weak hypothesis ht,k (x) is the function “ωt ,” i. [sent-273, score-0.455]
90 Otherwise, (ii) the weak hypothesis ht,k (x) is “z,” i. [sent-276, score-0.43]
91 (Note that since f (x) = z for all x, this weak hypothesis has zero error under any distribution. [sent-279, score-0.43]
92 ) It is clear that each weak hypothesis ht,k generated as described above indeed has advantage at least γ w. [sent-280, score-0.43]
93 Proof: If the weak learner never uses option (ii) then H depends only on variables ω1 , . [sent-285, score-0.525]
94 For any k ∈ [N ], since t = 1 there are no weak hypotheses from previous stages, so the value of px is determined by the bit f (x) = z (see Definition 2). [sent-306, score-0.49]
95 For b ∈ {−1, 1}, a draw of (x = (z, ω); f (x) = z) from Db is obtained by setting z = b and independently setting each coordinate ωi equal to z with probability 1/2 + γ. [sent-308, score-0.155]
96 The inductive hypothesis and the weak learner’s strategy together imply that for each labeled example (x = (z, ω), f (x) = z), since hs, (x) = ωs for s < t, the rejection sampling parameter px = αt,k (h1,1 (x), . [sent-315, score-0.555]
97 Consequently the distribution Pt,k over labeled examples is some convex combination of 2t distributions which we denote Db , where b ranges over {−1, 1}t corresponding to conditioning on all possible values for ω1 , . [sent-325, score-0.173]
98 On the boosting ability of top-down decision tree learning algorithms. [sent-440, score-0.291]
99 O(log2 n) time efficient parallel factorization of dense, sparse separable, and banded matrices. [sent-476, score-0.393]
100 On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. [sent-500, score-0.631]
wordName wordTfidf (topN-words)
[('parallel', 0.393), ('weak', 0.304), ('poly', 0.302), ('boosting', 0.291), ('halfspace', 0.283), ('booster', 0.241), ('processors', 0.223), ('learner', 0.194), ('polylog', 0.137), ('stages', 0.133), ('hypothesis', 0.126), ('lemma', 0.118), ('acpr', 0.118), ('opt', 0.117), ('perceptron', 0.095), ('bn', 0.091), ('abt', 0.086), ('hypotheses', 0.082), ('bi', 0.082), ('labeled', 0.081), ('stage', 0.08), ('boosters', 0.078), ('vote', 0.078), ('pac', 0.078), ('newton', 0.075), ('rounding', 0.073), ('ui', 0.07), ('log', 0.066), ('ai', 0.065), ('pr', 0.064), ('prx', 0.063), ('rational', 0.062), ('bit', 0.06), ('halfspaces', 0.059), ('examples', 0.059), ('sequential', 0.059), ('runs', 0.057), ('draw', 0.052), ('parallelization', 0.052), ('freund', 0.052), ('coordinate', 0.051), ('accuracy', 0.05), ('bd', 0.048), ('margin', 0.048), ('db', 0.046), ('boolean', 0.044), ('px', 0.044), ('naive', 0.042), ('parallelism', 0.042), ('renegar', 0.039), ('def', 0.036), ('learnability', 0.036), ('outputs', 0.036), ('yt', 0.035), ('st', 0.035), ('separable', 0.034), ('bt', 0.034), ('claim', 0.034), ('rd', 0.034), ('polynomial', 0.034), ('convex', 0.033), ('equals', 0.033), ('correctly', 0.033), ('unknown', 0.033), ('nal', 0.032), ('oracle', 0.031), ('rn', 0.031), ('interior', 0.031), ('ym', 0.031), ('tools', 0.031), ('rocco', 0.03), ('ii', 0.03), ('programming', 0.029), ('algebra', 0.029), ('ces', 0.029), ('constantly', 0.028), ('intermediate', 0.028), ('question', 0.027), ('bradley', 0.027), ('parallelized', 0.027), ('xt', 0.027), ('option', 0.027), ('independently', 0.027), ('projection', 0.026), ('bound', 0.026), ('claimed', 0.026), ('processor', 0.026), ('fix', 0.025), ('probability', 0.025), ('run', 0.025), ('invoked', 0.025), ('recalling', 0.025), ('lp', 0.025), ('progress', 0.025), ('gave', 0.025), ('access', 0.025), ('algorithm', 0.024), ('theorem', 0.024), ('henceforth', 0.024), ('target', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
2 0.41072956 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
3 0.31003037 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
4 0.19417436 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
5 0.12679851 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
6 0.1232925 299 nips-2011-Variance Penalizing AdaBoost
7 0.1186147 41 nips-2011-Autonomous Learning of Action Models for Planning
8 0.11732866 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
9 0.097594954 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
10 0.096141003 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
11 0.089561671 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
12 0.083742641 220 nips-2011-Prediction strategies without loss
13 0.076733723 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
14 0.072701424 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
15 0.072680183 162 nips-2011-Lower Bounds for Passive and Active Learning
16 0.067065917 241 nips-2011-Scalable Training of Mixture Models via Coresets
17 0.066471078 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
18 0.065411039 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
19 0.064959832 72 nips-2011-Distributed Delayed Stochastic Optimization
20 0.063347496 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
topicId topicWeight
[(0, 0.199), (1, -0.108), (2, -0.072), (3, -0.104), (4, 0.033), (5, 0.123), (6, 0.023), (7, -0.149), (8, -0.366), (9, -0.006), (10, -0.023), (11, 0.021), (12, 0.022), (13, -0.114), (14, 0.138), (15, -0.055), (16, 0.111), (17, -0.111), (18, -0.237), (19, -0.276), (20, 0.006), (21, -0.034), (22, -0.05), (23, -0.057), (24, 0.053), (25, -0.174), (26, 0.051), (27, -0.003), (28, 0.036), (29, 0.016), (30, -0.085), (31, 0.01), (32, -0.088), (33, -0.045), (34, -0.03), (35, 0.046), (36, 0.037), (37, 0.124), (38, -0.001), (39, -0.018), (40, 0.051), (41, -0.007), (42, -0.073), (43, -0.024), (44, -0.054), (45, 0.022), (46, -0.034), (47, -0.061), (48, 0.112), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.9626106 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
2 0.90164685 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
3 0.80424398 282 nips-2011-The Fast Convergence of Boosting
Author: Matus J. Telgarsky
Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the infimal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the infimal empirical risk; the technique for overcoming this barrier may be of general interest. 1
4 0.6261245 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
5 0.61709791 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
6 0.43028402 49 nips-2011-Boosting with Maximum Adaptive Sampling
7 0.41617206 41 nips-2011-Autonomous Learning of Action Models for Planning
8 0.36385655 95 nips-2011-Fast and Accurate k-means For Large Datasets
9 0.33952165 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
10 0.33693939 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
11 0.33056605 241 nips-2011-Scalable Training of Mixture Models via Coresets
12 0.30679286 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
13 0.29349643 162 nips-2011-Lower Bounds for Passive and Active Learning
14 0.2901082 64 nips-2011-Convergent Bounds on the Euclidean Distance
15 0.28816867 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
16 0.28662473 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
17 0.28043574 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
18 0.27947289 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
19 0.27724537 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
20 0.26384395 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
topicId topicWeight
[(0, 0.048), (4, 0.076), (20, 0.036), (26, 0.043), (31, 0.054), (43, 0.139), (45, 0.105), (57, 0.039), (65, 0.013), (74, 0.042), (78, 0.127), (83, 0.044), (87, 0.054), (99, 0.098)]
simIndex simValue paperId paperTitle
same-paper 1 0.8692497 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
2 0.84049618 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
3 0.83816576 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa
Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-filtering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efficient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.
4 0.79912978 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
5 0.79727936 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.79328185 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
7 0.7913115 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
8 0.78743386 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
9 0.7866134 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
10 0.78490239 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
11 0.78390557 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
12 0.78255069 258 nips-2011-Sparse Bayesian Multi-Task Learning
13 0.78126192 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
14 0.77940577 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
15 0.77930629 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
16 0.77900159 198 nips-2011-On U-processes and clustering performance
17 0.77806336 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
18 0.7771216 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
19 0.77690542 80 nips-2011-Efficient Online Learning via Randomized Rounding
20 0.77475661 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure