jmlr jmlr2013 jmlr2013-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
Reference: text
sentIndex sentText sentNum sentScore
1 , Mail Code: 0401 New York, NY 10027 Editor: Yoav Freund Abstract We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. [sent-7, score-0.595]
2 As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. [sent-8, score-0.41]
3 We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). [sent-9, score-0.501]
4 In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. [sent-10, score-0.759]
5 We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. [sent-12, score-0.475]
6 More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. [sent-13, score-1.425]
7 Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers 1. [sent-15, score-0.595]
8 So a natural goal is to develop an efficient parallel algorithm for learning γ-margin halfspaces that matches the performance of the perceptron algorithm. [sent-27, score-0.56]
9 (1995) and the many references therein) is that an efficient parallel algorithm for a problem with input size N is one that uses poly(N) processors and runs in parallel time polylog(N). [sent-29, score-0.987]
10 Also, as did Vitter and Lin (1992), we require that an efficient parallel learning algorithm’s hypothesis must be efficiently evaluatable in parallel, since otherwise all the computation required to run any polynomial-time learning algorithm could be “offloaded” onto evaluating the hypothesis. [sent-33, score-0.49]
11 Main Question (simplified): Is there a learning algorithm that uses poly(n, 1 ) procesγ sors and runs in time poly(log n, log 1 ) to learn an unknown n-dimensional γ-margin γ halfspace to accuracy 9/10? [sent-37, score-0.502]
12 processors poly(n, 1/γ) poly(n, 1/γ) 1 n · poly(1/γ) Running time ˜ O(1/γ2 ) + O(log n) ˜ O(1/γ2 ) + O(log n) poly(n, log(1/γ)) ˜ O(1/γ) + O(log n) Table 1: Bounds on various parallel algorithms for learning a γ-margin halfspace over Rn . [sent-41, score-0.792]
13 1 Relevant Prior Results Table 1 summarizes the running time and number of processors used by various parallel algorithms to learn a γ-margin halfspace over Rn . [sent-43, score-0.848]
14 Since the examples are n-dimensional this can be accomplished in O(log(n/γ)) time using O(n/γ2 ) processors; the mistake bound of the online perceptron algorithm is 1/γ2 , so this gives a ˜ running time bound of O(1/γ2 ) · log n. [sent-46, score-0.438]
15 We do not see how to obtain parallel time bounds better than 2 ) from recent analyses of other algorithms based on gradient descent (Collins et al. [sent-47, score-0.403]
16 At each stage of boosting this algorithm computes a real-valued weak hypothesis based on the vector average of the (normalized) examples weighted according to the current distribution; since the sample size is O(1/γ2 ) this can be done in O(log(n/γ)) time using poly(n, 1/γ) processors. [sent-53, score-0.728]
17 Since the boosting ˜ algorithm runs for O(1/γ2 ) stages, the overall running time bound is O(1/γ2 ) · log n. [sent-54, score-0.538]
18 Second, the main question allows the algorithm to use poly(n, 1/γ) processors and to run in poly(log n, log 1 ) time, whereas the hardness result of Vitter and Lin (1992) only rules out γ algorithms that use poly(n, log 1 ) processors and run in poly(log n, log log 1 ) time. [sent-67, score-0.9]
19 If n is fixed to a constant then the efficient parallel algorithm of Alon and Megiddo (1994) for linear programming in constant dimension can be used to learn a γ-margin halfspace using poly(1/γ) processors in polylog(1/γ) running time (see also Vitter and Lin, 1992, Theorem 3. [sent-71, score-0.869]
20 2 Our Results We give positive and negative results on learning halfspaces in parallel that are inspired by the main question stated above. [sent-74, score-0.455]
21 1 P OSITIVE R ESULTS Our main positive result is a parallel algorithm for learning large-margin halfspaces, based on a rapidly converging gradient method due to Nesterov (2004), which uses O(n · poly(1/γ)) processors ˜ to learn γ-margin halfspaces in parallel time O(1/γ) + O(log n) (see Table 1). [sent-77, score-1.04]
22 ) We are not aware of prior parallel algorithms that provably learn γ-margin halfspaces running in time polylogarithmic in n and subquadratic in 1/γ. [sent-79, score-0.544]
23 In contrast, our algorithm requires a ˜ linear number of processors as a function of n, and has parallel running time O(1/γ) + O(log n). [sent-83, score-0.618]
24 In contrast, our main negative result is an information-theoretic argument that suggests that such positive parallel learning results cannot be obtained by boosting alone. [sent-89, score-0.64]
25 An Algorithm Based on Nesterov’s Algorithm In this section we describe and analyze a parallel algorithm for learning a γ-margin halfspace. [sent-93, score-0.389]
26 Directly applying the basic Nesterov algorithm gives us an algorithm that uses O(n) processors, runs in parallel time O(log(n) · (1/γ)), and outputs a halfspace hypothesis that has constant accuracy. [sent-95, score-0.833]
27 By combining the basic algorithm with random projection and boosting we get the following stronger result: Theorem 1 There is a parallel algorithm with the following performance guarantee: Let f , D define an unknown γ-margin halfspace over Bn . [sent-96, score-0.909]
28 It runs in O(((1/γ)polylog(1/γ) + log(n)) log(1/ε)poly(log log(1/ε)) + log log(1/δ)) parallel time, uses n · poly(1/γ, 1/ε, log(1/δ)) processors, and with probability 1 − δ it outputs a hypothesis h satisfying Prx∼D [h(x) = f (x)] ≤ ε. [sent-98, score-0.653]
29 Let A be a parallel learning algorithm, and cδ and cε be absolute positive constants, such that for all D ′ with such issues, in order to fully establish our claimed bounds on the number of processors and the parallel running time of our algorithms. [sent-104, score-0.965]
30 Then there is a parallel algorithm 2 B that, given access to independent labeled examples (x, f (x)) drawn from D , with probability 1 − δ, constructs a (1 − ε)-accurate hypothesis (w. [sent-110, score-0.63]
31 1 we describe the basic way that Nesterov’s algorithm can be used to find a halfspace hypothesis that approximately minimizes a smooth loss function over a set of γ-margin labeled examples. [sent-115, score-0.42]
32 ) Then later we explain how this algorithm is used in the larger context of a parallel algorithm for halfspaces. [sent-117, score-0.41]
33 Set r ˆ ˆ ˆ – vk+1 = zk − rk , and √ √ L− µ ˆ ˆ ˆ v – zk+1 = vk+1 + √L+√µ (ˆ k+1 − vk ). [sent-176, score-0.418]
34 We discuss the details of exactly how this finite-precision algorithm is implemented, and the parallel running time required for such an implementation, at the end of this section. [sent-177, score-0.456]
35 Also, we have ˆ ||zk+1 − zk+1 || = ≤ √ √ L− µ ˆ √ (vk − vk ) L+ µ 1 + µ/L √ √ L− µ 2 ˆ ˆ k+1 ) + √ (vk+1 − v √ (vk − vk ) L+ µ 1 + µ/L 2 ˆ (vk+1 − vk+1 ) − √ ˆ ˆ ≤ 2||vk+1 − vk+1 || + ||vk − vk || ≤ 2β · 7k+1 + β · 7k ≤ 3β · 7k+1 , completing the proof. [sent-191, score-0.747]
36 Thus far the algorithm has used O(log(n/γ)) parallel time and O(n log(1/γ)/γ2 ) many processors. [sent-211, score-0.424]
37 ) It remains to analyze the parallel time complexity of the algorithm. [sent-243, score-0.403]
38 We have already analyzed the parallel time complexity of the initial random projection stage, and shown that we may take the finite-precision iterative algorithm ANfp to run for O(1/γ) stages, so it suffices to analyze the parallel time complexity of each stage ANfp . [sent-244, score-0.904]
39 We will show that each stage runs in parallel time polylog(1/γ) and thus establish the theorem. [sent-245, score-0.513]
40 The invariant we maintain throughout ˆ each iteration k of algorithm ANfp is that each coordinate of vk is a poly(K)-bit rational number and ˆ ˆ each coordinate of zk is a poly(K)-bit rational number. [sent-247, score-0.62]
41 It remains to show that given such values vk ˆ and zk , in parallel time polylog(1/γ) using log(1/γ) processors, 3115 L ONG AND S ERVEDIO √ 1. [sent-248, score-0.79]
42 φ′ (z) = √ Lemma 13 There is an algorithm Ar that, given an L-bit positive rational number z and an L-bit √ positive rational number β as input, outputs Ar (z) for which |Ar (z)− z| ≤ β in poly(log log(1/β), log L) parallel time using poly(log(1/β), L) processors. [sent-257, score-0.731]
43 Lemma 14 There is an algorithm A p that, given an L-bit positive rational number z, and an L-bit positive rational number β ≤ 1/4, outputs A p (z) for which |A p (z) − φ′ (z)| ≤ β in at most poly(log log(1/β), log L) parallel time using poly(log(1/β), L) processors. [sent-259, score-0.731]
44 To bound the numerators of the components of vk and zk , it suffices to bound the norms of vk and zk . [sent-269, score-0.816]
45 We work in the original PAC learning setting (Valiant, 1984; Kearns and Vazirani, 1994; Schapire, 1990) in which a weak learning algorithm is provided as an oracle that is called by the boosting algorithm, which must simulate a distribution over labeled examples for the weak learner. [sent-276, score-0.768]
46 Our main result for this setting is that boosting is inherently sequential; being able to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of sequential boosting stages that are required. [sent-277, score-1.8]
47 In fact we show this in a very strong sense, by proving that a boosting algorithm that runs arbitrarily many copies of the weak learner in parallel in each stage cannot save even one stage over a sequential booster that runs the weak learner just once in each stage. [sent-278, score-1.853]
48 Below we first define the parallel boosting framework and give some examples of parallel boosters. [sent-280, score-1.048]
49 We then state and prove our lower bound on the number of stages required by parallel boosters. [sent-281, score-0.526]
50 A consequence of our lower bound is that Ω(log(1/ε)/γ2 ) stages of parallel boosting are required in order to boost a γ-advantage weak learner to achieve classification accuracy 1 − ε no matter how many copies of the weak learner are used in parallel in each stage. [sent-282, score-1.937]
51 Definition 15 A γ-advantage weak learner L is an algorithm that is given access to a source of independent random labeled examples drawn from an (unknown and arbitrary) probability distribution P over labeled examples {(x, f (x))}x∈X . [sent-285, score-0.627]
52 Intuitively, a boosting algorithm runs the weak learner repeatedly on a sequence of carefully chosen distributions P1 , P2 , . [sent-292, score-0.681]
53 , and combines the weak hypotheses to obtain a final hypothesis h that has high accuracy under P . [sent-298, score-0.398]
54 In a normal (sequential) boosting algorithm, the probability weight that the (t + 1)st distribution Pt+1 puts on a labeled example (x, f (x)) may depend on the values of all the previous weak hypotheses h1 (x), . [sent-303, score-0.602]
55 No other dependence on x is allowed, since intuitively the only interface that the boosting algorithm should have with each data point is through its label and the values of the weak hypotheses. [sent-307, score-0.475]
56 We thus define a sequential booster as follows: Definition 16 (Sequential booster) A T -stage sequential boosting algorithm is defined by a sequence α1 , . [sent-315, score-0.602]
57 In the t-th stage of boosting, the distribution Pt over labeled examples that is given to the weak learner by the booster is obtained from P by doing rejection sampling according to αt . [sent-319, score-0.792]
58 In stage t the booster gives the weak learner access to Pt as defined above, and the weak learner generates a hypothesis ht that has advantage at least γ w. [sent-325, score-1.215]
59 , ht−1 , this ht enables the booster to give the weak learner access to Pt+1 in the next stage. [sent-332, score-0.682]
60 All these boosters use Ω(log(1/ε)/γ2 ) stages of boosting to achieve 1 − ε accuracy, and indeed Freund (1995) has shown that any sequential booster must run for Ω(log(1/ε)/γ2 ) stages. [sent-351, score-0.717]
61 More precisely, Freund (1995) modeled the phenomenon of boosting using the majority function to combine weak hypotheses as an interactive game between a “weightor” and a “chooser” (see Freund, 1995, Section 2). [sent-352, score-0.531]
62 4) that any T -stage sequential booster must have error at least as large as vote(γ, T ), and so consequently any sequential booster that generates a (1 − ε)-accurate final hypothesis must run for Ω(log(1/ε)/γ2 ) stages. [sent-356, score-0.625]
63 Our Theorem 18 below extends this lower bound to parallel boosters. [sent-357, score-0.389]
64 In stage t of a parallel booster the boosting algorithm may simultaneously run the weak learner many times in parallel using different probability distributions. [sent-361, score-1.676]
65 The distributions that are used in stage t may depend on any of the weak hypotheses from earlier stages, but may not depend on any of the weak hypotheses generated by any of the calls to the weak learner in stage t. [sent-362, score-1.055]
66 3118 PARALLEL L ARGE -M ARGIN L EARNING Definition 17 (Parallel booster) A T -stage parallel boosting algorithm with N-fold parallelism is defined by T N functions {αt,k }t∈[T ],k∈[N] and a (randomized) Boolean function h, where αt,k : {−1, 1}(t−1)N+1 → [0, 1] and h : {−1, 1}T N → {−1, 1}. [sent-363, score-0.693]
67 In the t-th stage of boosting the weak learner is run N times in parallel. [sent-364, score-0.704]
68 For each k ∈ [N], the distribution Pt,k over labeled examples that is given to the k-th run of the weak learner is as follows: a draw from Pt,k is made by drawing a labeled example (x, f (x)) from P , computing the value px := αt,k (h1,1 (x), . [sent-365, score-0.619]
69 In stage t, for each k ∈ [N] the booster gives the weak learner access to Pt,k as defined above and the weak learner generates a hypothesis ht,k that has advantage at least γ w. [sent-369, score-1.132]
70 Together with the weak hypotheses {hs, j }s∈[t−1], j∈[N] obtained in earlier stages, these ht,k ’s enable the booster to give the weak learner access to each Pt+1,k in the next stage. [sent-373, score-0.858]
71 After T stages, T N weak hypotheses {ht,k }t∈[T ],k∈[N] have been obtained from the weak learner. [sent-374, score-0.441]
72 The parameter N above corresponds to the number of processors that the parallel booster is using. [sent-382, score-0.745]
73 Parallel boosting algorithms that call the weak learner different numbers of times at different stages fit into our definition simply by taking N to be the max number of parallel calls made at any stage. [sent-383, score-1.16]
74 Several parallel boosting algorithms have been given in the literature; in particular, all boosters that construct branching program or decision tree hypotheses are of this type. [sent-384, score-0.793]
75 The number of stages of these boosting algorithms corresponds to the depth of the branching program or decision tree that is constructed, and the number of nodes at each depth corresponds to the parallelism parameter. [sent-385, score-0.471]
76 Our results in the next subsection will imply that any parallel booster must run for Ω(log(1/ε)/γ2 ) stages no matter how many parallel calls to the weak learner are made in each stage. [sent-387, score-1.471]
77 2 The Lower Bound and Its Proof Our lower bound theorem for parallel boosting is the following: Theorem 18 Let B be any T -stage parallel boosting algorithm with N-fold parallelism. [sent-389, score-1.322]
78 Then for any 0 < γ < 1/2, when B is used to boost a γ-advantage weak learner the resulting final hypothesis may have error as large as vote(γ, T ) (see the discussion after Definition 17). [sent-390, score-0.479]
79 The theorem is proved as follows: fix any 0 < γ < 1/2 and fix B to be any T -stage parallel boosting algorithm. [sent-392, score-0.64]
80 We will exhibit a target function f and a distribution P over {(x, f (x))x∈X , and 3119 L ONG AND S ERVEDIO describe a strategy that a weak learner W can use to generate weak hypotheses ht,k that all have advantage at least γ with respect to the distributions Pt,k . [sent-393, score-0.614]
81 We show that with this weak learner W , the resulting final hypothesis H that B outputs will have accuracy at most 1 − vote(γ, T ). [sent-394, score-0.521]
82 We next describe a way that a weak learner W can generate a γ-advantage weak hypothesis each time it is invoked by B. [sent-405, score-0.697]
83 Fix any t ∈ [T ] and any k ∈ [N], and recall that Pt,k is the distribution over labeled examples that is used for the k-th call to the weak learner in stage t. [sent-406, score-0.543]
84 It is also clear that if the weak learner ever uses option (ii) above at some invocation (t, k) then B may output a zero-error final hypothesis simply by taking H = ht,k = f (x). [sent-414, score-0.479]
85 On the other hand, the following crucial lemma shows that if the weak learner never uses option (ii) for any (t, k) then the accuracy of B is upper bounded by vote(γ, T ): Lemma 19 If W never uses option (ii) then Pr(x, f (x))←P [H(x) = f (x)] ≥ vote(γ, T ). [sent-415, score-0.496]
86 The inductive hypothesis and the weak learner’s strategy together imply that for each labeled example (x = (z, ω), f (x) = z), since hs,ℓ (x) = ωs for s < t, the rejection sampling parameter px = αt,k (h1,1 (x), . [sent-445, score-0.429]
87 But it is not clear to us that such an algorithm must actually exist, and so another intriguing direction is to prove negative results giving evidence that parallel learning of large-margin halfspaces is computationally hard. [sent-481, score-0.451]
88 A stronger result would be that no such algorithm can even output a halfspace hypothesis which is consistent 3121 L ONG AND S ERVEDIO with 99% (or 51%) of the labeled examples. [sent-483, score-0.42]
89 Suppose we have an algorithm that achieves accuracy 1 − ε in parallel time T ′′ with probability cδ . [sent-489, score-0.462]
90 Then we can run O(log(1/δ)) copies of this algorithm in parallel, then test each of their hypotheses in parallel using O(log(1/δ)/ε) examples. [sent-490, score-0.466]
91 Finding the best hypothesis takes at most O(log log(1/δ)) parallel time (with polynomially many processors). [sent-493, score-0.504]
92 The total parallel time taken is then O(T ′′ + log(1/ε) + log log(1/δ)). [sent-494, score-0.527]
93 ) Algorithm B runs a parallel version of a slight variant of the “boosting-by-filtering” algorithm due to Freund (1995), using A′ as a weak learner. [sent-500, score-0.604]
94 In Freund’s description of this algorithm, once the condition which causes the algorithm to output a random hypothesis is reached, the algorithm stops sampling, but, for a parallel version, it is convenient to sample all of the examples for a round in parallel. [sent-521, score-0.551]
95 So the parallel time taken is O(log(1/ε)) times the time taken in each iteration. [sent-523, score-0.438]
96 The rejection step also may be done in O(poly(log log(1/ε))) time in parallel for each example. [sent-527, score-0.437]
97 Using the fact that the initial value u1 is an L-bit rational number, a straightforward analysis using (4) shows that for all k ≤ O(log L + log log(1/β)) the number uk is a rational number with poly(L, log(1/β)) bits (if bk is the number of bits required to represent uk , then bk+1 ≤ 2bk + O(L)). [sent-572, score-0.436]
98 Standard results on the parallel complexity of integer multiplication thus imply that for k ≤ O(log L + log log(1/β)) the exact value of uk can be computed in the parallel time and processor bounds claimed by the Lemma. [sent-573, score-0.942]
99 Algorithms and hardness results for parallel large margin learning. [sent-732, score-0.44]
100 On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. [sent-768, score-0.454]
wordName wordTfidf (topN-words)
[('poly', 0.43), ('parallel', 0.368), ('boosting', 0.272), ('vk', 0.249), ('halfspace', 0.227), ('booster', 0.215), ('weak', 0.182), ('learner', 0.173), ('processors', 0.162), ('zk', 0.138), ('ervedio', 0.137), ('stages', 0.137), ('nesterov', 0.129), ('log', 0.124), ('pr', 0.122), ('freund', 0.112), ('perceptron', 0.109), ('argin', 0.107), ('pac', 0.106), ('hypothesis', 0.101), ('vitter', 0.091), ('ht', 0.083), ('anfp', 0.08), ('arge', 0.079), ('servedio', 0.078), ('polylog', 0.078), ('rational', 0.078), ('hypotheses', 0.077), ('stage', 0.077), ('ong', 0.071), ('labeled', 0.071), ('anes', 0.069), ('soheili', 0.069), ('vote', 0.064), ('halfspaces', 0.062), ('yt', 0.061), ('lemma', 0.057), ('arriaga', 0.057), ('pt', 0.053), ('xt', 0.049), ('uk', 0.047), ('sequential', 0.047), ('boosters', 0.046), ('earning', 0.043), ('draw', 0.041), ('px', 0.041), ('bn', 0.041), ('examples', 0.04), ('xa', 0.039), ('margin', 0.038), ('accuracy', 0.038), ('time', 0.035), ('vempala', 0.035), ('schapire', 0.034), ('blumer', 0.034), ('pe', 0.034), ('prx', 0.034), ('hardness', 0.034), ('rejection', 0.034), ('runs', 0.033), ('bradley', 0.032), ('precision', 0.032), ('parallelism', 0.032), ('running', 0.032), ('bits', 0.031), ('ar', 0.031), ('boolean', 0.031), ('rk', 0.031), ('branching', 0.03), ('ces', 0.03), ('rocco', 0.029), ('mansour', 0.029), ('kearns', 0.029), ('access', 0.029), ('coordinate', 0.028), ('calls', 0.028), ('outputs', 0.027), ('sk', 0.027), ('question', 0.025), ('invoked', 0.024), ('learn', 0.024), ('db', 0.024), ('option', 0.023), ('beame', 0.023), ('domingo', 0.023), ('filterboost', 0.023), ('greenlaw', 0.023), ('madaboost', 0.023), ('polylogarithmic', 0.023), ('weightor', 0.023), ('boost', 0.023), ('lin', 0.022), ('collins', 0.021), ('parallelization', 0.021), ('bound', 0.021), ('convex', 0.021), ('algorithm', 0.021), ('pre', 0.02), ('sums', 0.02), ('proof', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
2 0.23938105 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
3 0.15733138 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.11258279 86 jmlr-2013-Parallel Vector Field Embedding
Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji
Abstract: We propose a novel local isometry based dimensionality reduction method from the perspective of vector fields, which is called parallel vector field embedding (PFE). We first give a discussion on local isometry and global isometry to show the intrinsic connection between parallel vector fields and isometry. The problem of finding an isometry turns out to be equivalent to finding orthonormal parallel vector fields on the data manifold. Therefore, we first find orthonormal parallel vector fields by solving a variational problem on the manifold. Then each embedding function can be obtained by requiring its gradient field to be as close to the corresponding parallel vector field as possible. Theoretical results show that our method can precisely recover the manifold if it is isometric to a connected open subset of Euclidean space. Both synthetic and real data examples demonstrate the effectiveness of our method even if there is heavy noise and high curvature. Keywords: manifold learning, isometry, vector field, covariant derivative, out-of-sample extension
5 0.094988272 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
6 0.090789445 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
7 0.078946285 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
8 0.076077551 78 jmlr-2013-On the Learnability of Shuffle Ideals
9 0.072134316 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
10 0.060401596 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
11 0.054147005 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
12 0.049253386 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
13 0.048507582 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
14 0.044556063 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
15 0.042722773 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
16 0.041244917 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars
17 0.041128259 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
18 0.041126493 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
19 0.040727854 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
20 0.038102217 104 jmlr-2013-Sparse Single-Index Model
topicId topicWeight
[(0, -0.235), (1, 0.149), (2, 0.07), (3, 0.099), (4, -0.181), (5, -0.101), (6, 0.202), (7, -0.143), (8, -0.313), (9, 0.047), (10, 0.116), (11, 0.096), (12, -0.121), (13, -0.076), (14, -0.003), (15, -0.113), (16, 0.149), (17, 0.153), (18, -0.058), (19, -0.049), (20, 0.093), (21, -0.025), (22, -0.037), (23, -0.078), (24, -0.04), (25, 0.038), (26, 0.054), (27, 0.106), (28, 0.081), (29, 0.108), (30, -0.088), (31, 0.024), (32, 0.074), (33, 0.039), (34, -0.089), (35, -0.055), (36, 0.013), (37, -0.025), (38, 0.076), (39, -0.071), (40, 0.099), (41, -0.035), (42, 0.008), (43, -0.074), (44, 0.057), (45, -0.083), (46, -0.077), (47, 0.037), (48, 0.077), (49, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.95166725 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
2 0.79240781 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
3 0.58512759 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.36313656 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
5 0.32243463 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
6 0.3073017 86 jmlr-2013-Parallel Vector Field Embedding
7 0.29642373 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
8 0.27175996 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
9 0.26323208 78 jmlr-2013-On the Learnability of Shuffle Ideals
10 0.26038828 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars
11 0.25486264 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
12 0.24410981 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
13 0.23683806 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
14 0.22885019 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
15 0.20720881 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
16 0.20667201 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
17 0.20582893 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
18 0.19142748 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
19 0.1896797 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
20 0.18560591 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
topicId topicWeight
[(0, 0.03), (5, 0.221), (6, 0.032), (9, 0.031), (10, 0.073), (20, 0.02), (23, 0.036), (53, 0.013), (65, 0.271), (68, 0.021), (70, 0.029), (75, 0.031), (85, 0.059), (87, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.82980722 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
2 0.6642819 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
3 0.66104102 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.66012108 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 0.65883124 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
6 0.65131414 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
7 0.65021759 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
8 0.64971197 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
9 0.6484043 73 jmlr-2013-Multicategory Large-Margin Unified Machines
10 0.64685065 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
11 0.64638335 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
12 0.645257 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
13 0.6431542 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
14 0.64041352 8 jmlr-2013-A Theory of Multiclass Boosting
15 0.64005274 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
16 0.63859338 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
18 0.6341207 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
19 0.63404369 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
20 0.63339365 68 jmlr-2013-Machine Learning with Operational Costs