jmlr jmlr2013 jmlr2013-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
Reference: text
sentIndex sentText sentNum sentScore
1 We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. [sent-10, score-0.371]
2 Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. [sent-12, score-0.738]
3 We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. [sent-14, score-0.838]
4 We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. [sent-15, score-0.339]
5 Introduction We consider pool-based active learning (McCallum and Nigam, 1998), in which a learner receives a pool of unlabeled examples, and can iteratively query a teacher for the labels of examples from the pool. [sent-17, score-0.816]
6 The number of queries used by the learner is termed its label complexity. [sent-19, score-0.342]
7 In particular, it has been shown that when the data is realizable (relative to some assumed hypothesis class), the mellow c 2013 Alon Gonen, Sivan Sabato and Shai Shalev-Shwartz. [sent-29, score-0.34]
8 An advantage of the mellow approach is its ability to obtain label complexity improvements in the agnostic setting, which allows an arbitrary and large labeling error (Balcan et al. [sent-35, score-0.51]
9 In this work we revisit the aggressive approach for the realizable case, and in particular for active learning of half-spaces in Euclidean space. [sent-40, score-0.346]
10 In the first part of this work we construct an efficient aggressive active learner for half-spaces in Euclidean space, which is approximately optimal, that is, achieves near-optimal label complexity, if the pool is separable with a margin. [sent-43, score-0.839]
11 Our algorithm for halfspaces is based on a greedy query selection approach as proposed in Tong and Koller (2002) and Dasgupta (2005). [sent-45, score-0.347]
12 We obtain improved target-dependent approximation guarantees for greedy selection in a general active learning setting. [sent-46, score-0.335]
13 In the second part of this work we compare the greedy approach to the mellow approach. [sent-48, score-0.348]
14 We prove that there are cases in which this highly aggressive greedy approach results in significantly better label complexity compared to the mellow approach. [sent-49, score-0.638]
15 We further demonstrate experimentally that substantial improvements in label complexity can be achieved compared to mellow approaches, for both realizable and low-error settings. [sent-50, score-0.43]
16 The first greedy query selection algorithm for learning halfspaces in Euclidean space was proposed by Tong and Koller (2002). [sent-51, score-0.372]
17 The greedy algorithm is based on the notion of a version space: the set of all hypotheses in the hypothesis class that are consistent with the labels currently known to the learner. [sent-52, score-0.339]
18 Tong and Koller proposed to query the example from the pool that splits the version space as evenly as possible. [sent-55, score-0.472]
19 The label complexity of greedy pool-based active learning algorithms can be analyzed by comparing it to the best possible label complexity of any pool-based active learner on the same pool. [sent-60, score-0.914]
20 The worst-case label complexity of an active learner is the maximal number of queries it would make on the given pool, where the maximum is over all the possible classification rules that can be consistent with the pool according to the given hypothesis class. [sent-61, score-0.93]
21 The average-case label complexity of an active learner is the average number of queries it would make on the given pool, where the average is taken with respect to some fixed probability distribution P over the possible classifiers in the hypothesis class. [sent-62, score-0.637]
22 For each of these definitions, the optimal label complexity is the lowest label complexity that can be achieved by an active learner on the given pool. [sent-63, score-0.604]
23 Since implementing the optimal label complexity is usually computationally intractable, an alternative is to implement an efficient algorithm, and to guarantee a bounded factor of approximation on its label complexity, compared to the optimal label complexity. [sent-64, score-0.42]
24 1 The analysis presented above thus can result in poor approximation factors, since if there are instances in the pool that are very close to each other, then pmin might be very small. [sent-71, score-0.367]
25 (1961), we show that if the examples in the pool are stored using 2 number of a finite accuracy 1/c, then pmin ≥ (c/d)d , where d is the dimensionality of the space. [sent-74, score-0.404]
26 A noteworthy example is when the target hypothesis separates the pool with a margin of γ. [sent-79, score-0.517]
27 Then, an approximate majority vote over the version space, that is, a random rule which approximates the majority vote with high probability, can be used to determine the labels of the pool. [sent-85, score-0.42]
28 The assumption of separation with a margin can be relaxed if a lower bound on the total hinge-loss of the best separator for the pool can be assumed. [sent-92, score-0.461]
29 In the second part of this work, we compare the greedy approach to the mellow approach of CAL in the realizable case, both theoretically and experimentally. [sent-106, score-0.397]
30 In the simple learning setting of thresholds on the line, our margin-based approach is preferable to the mellow approach when the true margin of the target hypothesis is large. [sent-110, score-0.47]
31 There exists a distribution in Euclidean space such that the mellow approach cannot achieve a significant improvement in label complexity over passive learning for halfspaces, while the greedy approach achieves such an improvement using more unlabeled examples. [sent-112, score-0.758]
32 There exists a pool in Euclidean space such that the mellow approach requires exponentially more labels than the greedy approach. [sent-114, score-0.729]
33 It further suggests that aggressive approaches can be significantly better than mellow approaches in some practical settings. [sent-117, score-0.335]
34 On the Challenges in Active Learning for Halfspaces The approach we employ for active learning does not provide absolute guarantees for the label complexity of learning, but a relative guarantee instead, in comparison with the optimal label complexity. [sent-119, score-0.494]
35 In this example, to distinguish between the case in which all points have a negative label and the case in which one of the points has a positive label while the rest have a negative label, any active learning algorithm will have to query every point at least once. [sent-125, score-0.584]
36 On the other hand, the sample complexity of passive learning in this case is order of 1 log 1 , hence no active learner can be significantly better than a ε ε passive learner on this distribution. [sent-127, score-0.664]
37 Consider a pool of m points in Rd , such that all the points are on the unit sphere, and for each pair of points x1 and x2 , x1 , x2 ≤ 1 − 2γ. [sent-131, score-0.38]
38 Thus, if the correct labeling is all-positive, then all m examples need to be queried to label the pool correctly. [sent-138, score-0.61]
39 Yet another possible direction for pool-based active learning is to greedily select a query whose answer would determine the labels of the largest amount of pool examples. [sent-159, score-0.63]
40 The main challenge in this direction is how to analyze the label complexity of such an algorithm: it is unclear whether competitiveness with the optimal label complexity can be guaranteed in this case. [sent-160, score-0.336]
41 An active learning algorithm A obtains (X, L, T ) as input, where T is an integer which represents the label budget of A . [sent-171, score-0.332]
42 h∈H We denote the optimal worst-case label complexity for the given pool by OPTmax . [sent-194, score-0.461]
43 The optimal average label complexity for the given pool X and probability distribution P is defined as OPTavg = minA cavg (A ). [sent-197, score-0.49]
44 For a given active learner, we denote by Vt ⊆ H the version space of an active learner after t queries. [sent-198, score-0.499]
45 j For a given pool example x ∈ X, denote by Vt,x the version spaces that would result if the algorithm now queried x and received label j. [sent-204, score-0.501]
46 , T , the pool example x that A decides to query is one that splits the version space as evenly as j possible. [sent-209, score-0.472]
47 , T , the pool example x that A decides to query satisfies 1 −1 P(Vt,x )P(Vt,x ) ≥ 1 −1 1 max P(Vt,x )P(Vt,x ), ˜ ˜ ˜ α x∈X and the output of the algorithm is (h(x1 ), . [sent-219, score-0.392]
48 We say that A outputs an approximate majority vote if whenever VT is pure enough, the algorithm outputs the majority vote on VT . [sent-233, score-0.386]
49 In the following theorem we provide the target-dependent label complexity bound, which holds for any approximate greedy algorithm that outputs an approximate majority vote. [sent-236, score-0.409]
50 For any algorithm alg, denote by Vt (alg, h) the version space induced by the first n labels it queries if the true labeling of the pool is consistent with h. [sent-250, score-0.638]
51 Denote the average version space reduction of alg after t queries by favg (alg,t) = 1 − Eh∼P [P(Vt (alg, h))]. [sent-251, score-0.412]
52 We show (see Appendix A) that for any hypothesis h ∈ H and any active learner alg, favg (opt, OPTmax ) − favg (alg,t) ≥ P(h)(P(Vt (alg, h)) − P(h)). [sent-254, score-0.654]
53 Moreover, our experimental results indicate that even when such an apriori bound is not known, using a majority vote is preferable to selecting an arbitrary random hypothesis from an impure version space (see Figure 1 in Section 6. [sent-262, score-0.33]
54 For any α > 1 there exists an α-approximately greedy algorithm A such that for any m > 0 there exists a pool X ⊆ [0, 1] of size m, and a threshold c such that P(hc ) = 1/2, while the m label-complexity of A for L ⇚ hc is ⌈log(m)⌉ · OPTmax . [sent-271, score-0.456]
55 Proof For the hypothesis class Hline , the possible version spaces after a partial run of an active learner are all of the form [a, b] ⊆ [0, 1]. [sent-272, score-0.377]
56 First, it is easy to see that binary search on the pool can identify any hypothesis in [0, 1] using ⌈log(m)⌉ example, thus OPTmax = ⌈log(m)⌉. [sent-273, score-0.371]
57 Now, Consider an active learning algorithm that satisfies the following properties: • If the current version space is [a, b], it queries the smallest x that would still make the algorithm αapproximately greedy. [sent-274, score-0.354]
58 Now for a given pool size m ≥ 2, consider a pool of examples defined as follows. [sent-278, score-0.623]
59 Furthermore, suppose the true labeling is induced by h3/4 ; Thus the only pool example with a positive label is x1 , and P(h3/4 ) = 1/2. [sent-282, score-0.522]
60 In this case, the algorithm we just defined will query all the pool examples x4 , x5 , . [sent-283, score-0.429]
61 , xm that does not include x3 , so the algorithm must query this entire pool to identify the correct labeling. [sent-291, score-0.445]
62 We can upper-bound log(1/P(h)) using the familiar notion of margin: For any hypothesis h ∈ W defined by some w ∈ Bd , let γ(h) be the maximal margin of the labeling of X by h, namely 1 γ(h) = maxv: v =1 mini∈[m] h(xi ) v, xi / xi . [sent-305, score-0.364]
63 , 1 − c, 1}2 for c = Θ(γ), and 1 a target hypothesis h∗ ∈ W for which γ(h∗ ) = Ω(γ), such that there exists an exact greedy algorithm that requires Ω(ln(1/γ)) = Ω(ln(1/c)) labels in order to output a correct majority vote, while the optimal algorithm requires only O(log(log(1/γ))) queries. [sent-355, score-0.375]
64 Its inputs are the unlabeled sample X, the labeling oracle L, the maximal allowed number of label queries T , and the desired confidence δ ∈ (0, 1). [sent-362, score-0.408]
65 Denote by It the set of indices corresponding to the elements in the pool whose label was not queried yet (I0 = [m]). [sent-365, score-0.47]
66 To output an approximate majority vote from the final version space V , we would like to uniformly draw several hypotheses from V and label X according to a majority vote over these hypotheses. [sent-395, score-0.54]
67 1 uses O(T m) volume estimations as a black-box procedure, where T is the budget of labels and m is the pool size. [sent-425, score-0.387]
68 (4) In addition, by Hoeffding’s inequality and a union bound over the examples in the pool and the iterations of the algorithm, j P[∃x, |vxi , j − P[hi ∈ Vt,x ]| ≥ λ] ≤ 2m exp(−2kλ2 ). [sent-449, score-0.33]
69 2 Handling Non-Separable Data and Kernel Representations If the data pool X is not separable, but a small upper bound on the total hinge-loss of the best separator can be assumed, then ALuMA can be applied after a preprocessing step, which we describe in detail below. [sent-476, score-0.38]
70 With high probability, the new set of points will be separable with a margin that also depends on the original margin and on the amount of margin error. [sent-496, score-0.428]
71 ALuMA can be characterized by two properties: (1) its “objective” is to reduce the volume of the version space and (2) at each iteration, it aggressively selects an example from the pool so as to (approximately) minimize its objective as much as possible (in a greedy sense). [sent-553, score-0.484]
72 A a pool-based active learner can be used to learn a classifier in the PAC model by first sampling a pool of m unlabeled examples from D, then applying the pool-based active learner to this pool, and finally running a standard passive learner on the ˜ labeled pool to obtain a classifier. [sent-575, score-1.421]
73 For the class of halfspaces, if we sample an unlabeled pool of m = Ω(d/ε) examples, then the learned classifier will have an error of at most ε (with high probability over the choice of the pool). [sent-576, score-0.349]
74 Compare two greedy pool-based active learners for Hline : The first follows a binary search procedure, greedily selecting the example that increases the number of known labels the most. [sent-578, score-0.373]
75 However, a better result holds for this simple case: 2597 G ONEN , S ABATO AND S HALEV-S HWARTZ Theorem 18 In the problem of thresholds on the line, for any pool with labeling L, the exact greedy algorithm requires at most O(log(1/γ(h))) labels. [sent-582, score-0.531]
76 This is also the label complexity of any approximate greedy algorithm that outputs a majority vote. [sent-583, score-0.409]
77 We now show that for every version space [a, b], at most two greedy queries suffice to either reduce the size of the version space by a factor of at least 2/3, or to determine the labels of all the points in the pool. [sent-586, score-0.462]
78 Assume for simplicity that the version space is [0, 1], and denote the pool of examples in the version space 1 by X. [sent-587, score-0.442]
79 This version space includes no more pool points, by the definition of β. [sent-602, score-0.349]
80 A similar argument can be carried for ALuMA, using a smaller bound on α and more iterations due to the approximation, and noting that if the correct answer is in (α, 1 − α) then a majority vote over thresholds drawn randomly from the version space will label the examples correctly. [sent-606, score-0.382]
81 − − + + Dasgupta has demonstrated via this example that active learning can gain in label complexity from having significantly more unlabeled data. [sent-619, score-0.399]
82 In many applications, unlabeled examples are virtually free to sample, thus it can be worthwhile to allow the active learner to sample more examples than the passive sample complexity and use an aggressive strategy. [sent-621, score-0.675]
83 Note that in this example, a passive learner also requires access to all the points in the support of the distribution, thus CAL, passive learning, and optimal active learning all require the same size of a random unlabeled pool. [sent-636, score-0.579]
84 CAL inspects the unlabeled examples sequentially, and queries any example whose label cannot be inferred from previous labels. [sent-659, score-0.342]
85 These examples show that in some cases an aggressive approach is preferable to a mellow approach such as employed by CAL. [sent-672, score-0.405]
86 Our goal is twofold: First, to evaluate ALuMA in practice, and second, to compare the performance of aggressive strategies compared to mellow strategies. [sent-680, score-0.335]
87 In this experiment the pool of examples is taken to be the support of the distribution described in Example 21, with an additional dimension to account for halfspaces with a bias. [sent-742, score-0.443]
88 d 10 12 15 ALuMA 29 38 55 TK 156 735 959 QBC 50 113 150 CAL 308 862 2401 ERM 1008 3958 > 20000 Table 1: Octahedron: number of queries to achieve zero error To summarize, in all of the experiments above, aggressive algorithms performed better than mellow ones. [sent-749, score-0.458]
89 Our theoretical results shed light on the relationship between the margin of the true separator and the number of active queries that the algorithm requires. [sent-791, score-0.466]
90 Lastly, our work shows that for low-error settings, the aggressive approach can be preferable to the mellow approach. [sent-798, score-0.368]
91 If f is adaptive monotone and adaptive submodular, and A is αapproximately greedy, then for any deterministic algorithm A ∗ and for all positive integers t, k, t favg (A ,t) ≥ (1 − e− αk ) favg (A ∗ , k). [sent-826, score-0.358]
92 For any algorithm alg, denote by Vt (alg, h) the version space induced by the first t labels it queries if the true labeling of the pool is consistent with h. [sent-828, score-0.638]
93 (8) The average version space reduction of alg after t queries is favg (alg,t) = 1 − Eh∼P [P(Vt (alg, h))]. [sent-830, score-0.412]
94 For any h ∈ H , any active learner A , and any t, favg (A ∗ , OPTmax ) − favg (A ,t) ≥ P(V (h|X )) (P(Vt (A , h)) − P(V (h|X ))) . [sent-841, score-0.576]
95 Proof Since A ∗ acheives the optimal worst-case cost, the version space induced by the labels that A ∗ queries within the first OPTmax iterations must be exactly the set of hypotheses which are consistent with the true labels of the sample. [sent-842, score-0.337]
96 By definition of favg , favg (A ∗ , OPTmax ) − favg (A ,t) = Eh∼P [P(Vt (A , h)) − P(VOPTmax (A ∗ , h))] = Eh∼P [P(Vt (A , h)) − P(V (h|X ))]. [sent-845, score-0.462]
97 , 2006b) that if a set of m points is separable with margin η and k ≥ O γ √ , 1+ H ln(m/δ) η2 , then with probability 1 − δ, the projected points are separable with margin η/2. [sent-896, score-0.364]
98 If the unlabeled sample contains only points from Sa , then an active learner has to query all the points in Sa to distinguish between a hypothesis that labels all of Sa positively and one that labels positively all but one point in Sa . [sent-966, score-0.685]
99 CAL examines the examples sequentially at a random order, and queries the label of any point whose label is not determined by previous examples. [sent-1037, score-0.412]
100 A bound on the label complexity of agnostic active learning. [sent-1256, score-0.369]
wordName wordTfidf (topN-words)
[('aluma', 0.566), ('pool', 0.293), ('optmax', 0.228), ('mellow', 0.213), ('active', 0.175), ('favg', 0.154), ('cal', 0.151), ('greedy', 0.135), ('label', 0.126), ('margin', 0.123), ('queries', 0.123), ('aggressive', 0.122), ('alfspaces', 0.118), ('onen', 0.118), ('qbc', 0.118), ('halfspaces', 0.113), ('passive', 0.113), ('vt', 0.113), ('labeling', 0.103), ('abato', 0.101), ('hwartz', 0.101), ('query', 0.099), ('learner', 0.093), ('sa', 0.092), ('vx', 0.092), ('vote', 0.087), ('sb', 0.083), ('alg', 0.079), ('fficient', 0.078), ('hypothesis', 0.078), ('majority', 0.076), ('pmin', 0.074), ('golovin', 0.069), ('dasgupta', 0.068), ('px', 0.066), ('labels', 0.063), ('bd', 0.061), ('balcan', 0.057), ('krause', 0.057), ('unlabeled', 0.056), ('xm', 0.053), ('ln', 0.053), ('queried', 0.051), ('vol', 0.051), ('realizable', 0.049), ('separator', 0.045), ('hline', 0.044), ('vxi', 0.044), ('cos', 0.043), ('complexity', 0.042), ('preprocessing', 0.042), ('sin', 0.041), ('earning', 0.041), ('bn', 0.038), ('rd', 0.037), ('opt', 0.037), ('iwal', 0.037), ('examples', 0.037), ('log', 0.035), ('hv', 0.034), ('querying', 0.034), ('utility', 0.033), ('preferable', 0.033), ('tk', 0.033), ('lemma', 0.032), ('uu', 0.032), ('hypotheses', 0.032), ('mnist', 0.032), ('sabato', 0.031), ('pac', 0.031), ('eh', 0.031), ('version', 0.031), ('budget', 0.031), ('tan', 0.031), ('outputs', 0.03), ('separable', 0.03), ('xi', 0.03), ('cavg', 0.029), ('cwc', 0.029), ('submodularity', 0.029), ('vxk', 0.029), ('points', 0.029), ('hc', 0.028), ('rm', 0.027), ('circle', 0.026), ('sgn', 0.026), ('halfspace', 0.026), ('agnostic', 0.026), ('item', 0.026), ('tong', 0.026), ('guarantees', 0.025), ('disagreement', 0.025), ('adaptive', 0.025), ('space', 0.025), ('vw', 0.024), ('evenly', 0.024), ('improvement', 0.024), ('target', 0.023), ('erm', 0.023), ('policy', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
2 0.10678474 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
3 0.090789445 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
4 0.085559323 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
5 0.082470715 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
6 0.06650281 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
7 0.065659843 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
8 0.045770388 76 jmlr-2013-Nonparametric Sparsity and Regularization
9 0.04413797 8 jmlr-2013-A Theory of Multiclass Boosting
10 0.043882262 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
11 0.042052258 78 jmlr-2013-On the Learnability of Shuffle Ideals
12 0.041920692 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
13 0.040314514 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
14 0.038171895 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
15 0.03720447 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
16 0.036279883 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
17 0.035623826 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
18 0.0351795 68 jmlr-2013-Machine Learning with Operational Costs
19 0.034871012 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
20 0.032857802 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
topicId topicWeight
[(0, -0.191), (1, 0.121), (2, 0.03), (3, 0.047), (4, -0.066), (5, -0.028), (6, 0.088), (7, -0.058), (8, -0.162), (9, -0.055), (10, 0.062), (11, -0.027), (12, -0.039), (13, 0.033), (14, -0.012), (15, 0.119), (16, 0.021), (17, 0.144), (18, -0.032), (19, -0.048), (20, -0.031), (21, 0.014), (22, 0.055), (23, 0.183), (24, 0.121), (25, -0.005), (26, -0.057), (27, -0.215), (28, 0.102), (29, 0.005), (30, 0.165), (31, 0.116), (32, -0.063), (33, 0.225), (34, 0.075), (35, -0.167), (36, -0.207), (37, 0.073), (38, -0.124), (39, -0.007), (40, 0.025), (41, -0.047), (42, 0.037), (43, 0.079), (44, -0.014), (45, -0.084), (46, -0.085), (47, 0.011), (48, -0.036), (49, 0.108)]
simIndex simValue paperId paperTitle
same-paper 1 0.91645402 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
2 0.62677532 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
3 0.46826455 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
Author: Joachim Niehren, Jérôme Champavère, Aurélien Lemay, Rémi Gilleron
Abstract: Inference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction. Keywords: XML information extraction, XML schemas, interactive learning, tree automata, grammatical inference
4 0.38255563 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
5 0.34485567 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
Author: Partha Niyogi
Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on finite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates
6 0.32869813 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
7 0.30793419 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.29617107 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
9 0.28228515 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
10 0.26222888 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
11 0.26155543 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections
12 0.258524 68 jmlr-2013-Machine Learning with Operational Costs
13 0.23904802 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
14 0.23737523 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
15 0.23697937 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
16 0.23432669 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
17 0.22366948 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
18 0.22307725 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
19 0.21991436 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
20 0.21313374 78 jmlr-2013-On the Learnability of Shuffle Ideals
topicId topicWeight
[(0, 0.017), (5, 0.135), (6, 0.036), (10, 0.068), (20, 0.015), (23, 0.03), (50, 0.361), (53, 0.018), (68, 0.021), (70, 0.027), (75, 0.037), (85, 0.107), (87, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.6681155 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
2 0.4602606 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
3 0.45476973 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
4 0.44638565 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.43912086 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.43659282 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
7 0.43612057 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
8 0.43369067 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
9 0.42977101 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
10 0.42577854 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
11 0.42433247 73 jmlr-2013-Multicategory Large-Margin Unified Machines
12 0.42390996 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
13 0.42348003 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
15 0.42207852 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
16 0.42194712 114 jmlr-2013-The Rate of Convergence of AdaBoost
17 0.42098671 32 jmlr-2013-Differential Privacy for Functions and Functional Data
18 0.42082778 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
19 0.41898236 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
20 0.4182007 68 jmlr-2013-Machine Learning with Operational Costs