jmlr jmlr2012 jmlr2012-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sivan Sabato, Naftali Tishby
Abstract: In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. Keywords: multiple-instance learning, learning theory, sample complexity, PAC learning, supervised classification
Reference: text
sentIndex sentText sentNum sentScore
1 The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. [sent-8, score-1.416]
2 The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. [sent-15, score-0.586]
3 In MIL additional structure is assumed, whereby the examples are received as bags of instances, such that each bag is composed of several instances. [sent-23, score-0.756]
4 In classical MIL the label of a bag is the Boolean OR of the labels of the instances the bag contains. [sent-25, score-1.184]
5 S ABATO AND T ISBHY other than Boolean OR determines bag labels based on instance labels. [sent-31, score-0.572]
6 It is possible, in principle, to view MIL as a regular supervised classification task, where a bag is a single example, and the instances in a bag are merely part of its internal representation. [sent-34, score-1.149]
7 Thus, a molecule can be thought of as a bag of conformations, where each conformation is an instance in the bag representing the molecule. [sent-56, score-1.136]
8 d from a single distribution over instances, so that the instances in each bag are statistically independent. [sent-64, score-0.602]
9 The computational complexity of learning is polynomial in the bag size and in the sample size. [sent-74, score-0.643]
10 3000 M ULTI -I NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS The assumption of statistical independence of the instances in each bag is, however, very limiting, as it is irrelevant to many applications. [sent-75, score-0.602]
11 In the second setting one assumes that bags are drawn from an arbitrary distribution over bags, so that the instances within a bag may be statistically dependent. [sent-76, score-0.811]
12 This result implies that it is not possible to PAC-learn MIL on APRs using an algorithm which is efficient in both the bag size and the problem’s dimensionality. [sent-81, score-0.547]
13 Our sample complexity analysis shows that for binary hypothesis and thresholded real-valued hypotheses, the distribution-free sample complexity for generalized MIL grows only logarithmically with the maximal bag size. [sent-98, score-0.836]
14 These bound are useful when only the average bag size is bounded. [sent-101, score-0.573]
15 The construction is simple to implement, and provides a computationally efficient PAC-learning of MIL, with only a polynomial dependence of the run time on the bag size. [sent-105, score-0.571]
16 We point out 3001 S ABATO AND T ISBHY that it is not generally possible to find a low-error classification rule for instances based on a bag sample. [sent-108, score-0.602]
17 As a simple counter example, assume that the label of a bag is the Boolean OR of the labels of its instances, and that every bag includes both a positive instance and a negative instance. [sent-109, score-1.135]
18 In this case all bags are labeled as positive, and it is not possible to distinguish the two types of instances by observing only bag labels. [sent-110, score-0.856]
19 Distribution-dependent results for binary learning and real-valued learning based on the average bag size are presented in Section 6. [sent-116, score-0.564]
20 A bag is a finite ordered set of instances from X . [sent-133, score-0.602]
21 Every MIL problem is defined by a fixed bag-labeling function ψ : I (R) → I that determines the bag labels given the instance labels. [sent-152, score-0.572]
22 Formally, every instance hypothesis h : X → I defines a bag hypothesis, denoted by h : X (R) → I and defined by ∀¯ ∈ X (R) , x h(¯ ) x ψ(h(x[1]), . [sent-153, score-0.7]
23 We assume the labeled bags are drawn from a fixed distribution D over X (R) × {−1, +1}, where each pair drawn from D constitutes a bag and its binary label. [sent-160, score-0.818]
24 The goal of the learner is to return h that ˆ D) compared to the minimal loss that can be achieved with the bag hypothesis has a low loss ℓ(h, class, denoted by ℓ∗ (H , D) infh∈H ℓ(h, D). [sent-168, score-0.73]
25 1 Classes of Real-Valued bag-functions In classical MIL the bag function is the Boolean OR over binary labels, that is I = {−1, +1} and ψ = OR : {−1, +1}(R) → {−1, +1}. [sent-173, score-0.583]
26 We further consider two classes of bag functions over reals, each representing a different generalization of the max function, which conserves a different subset of its properties. [sent-175, score-0.563]
27 The second class of bag functions we consider generalizes the max function by noting that for bounded inputs, the max function can be seen as a variant of the infinity-norm z ∞ = max |z[i]|. [sent-190, score-0.595]
28 More generally, we treat the case where the hypotheses map into I = [−1, 1], and consider the class of bag functions inspired by a p-norm, defined as follows. [sent-192, score-0.607]
29 Definition 3 For p ∈ [1, ∞), the p-norm bag function ψ p : [−1, +1](R) → [−1, +1] is defined by: n ∀z ∈ R , ψ p (z) 1 n ∑ (z[i] + 1) p n i=1 For p = ∞, Define ψ∞ ≡ lim p→∞ ψ p . [sent-193, score-0.547]
30 This is formalized in the following definition of a Lipschitz bag function. [sent-200, score-0.547]
31 Definition 4 A bag function ψ : R(R) → R is c-Lipschitz with respect to the infinity norm for c > 0 if ∀n ∈ R, ∀a, b ∈ Rn , |ψ(a) − ψ(b)| ≤ c a − b ∞ . [sent-201, score-0.547]
32 The average bag-function and the max bag functions are 1-Lipschitz. [sent-202, score-0.563]
33 All p-norm bag functions are also 1-Lipschitz, as the following derivation shows: |ψ p (a) − ψ p (b)| = n−1/p · | a + 1 p− b + 1 p | ≤ n−1/p · a − b p ≤ a−b ∞. [sent-204, score-0.547]
34 Thus we bound the VC-dimension of H as a function of the maximal possible bag size r = max R, and of the VC-dimension of H . [sent-211, score-0.589]
35 3004 M ULTI -I NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS Lemma 5 For any R ⊆ N and any bag function ψ : {−1, +1}(R) → {−1, +1}, and for any hypothesis class H ⊆ {−1, +1}X and a finite set of bags S ⊆ X (R) , H |S ≤ H |S∪ . [sent-222, score-0.884]
36 Thus there exists at least one bag x ∈ S such that g2 (x) = g2 (x). [sent-228, score-0.547]
37 In this case dr < dr / log dr < 2d, thus d log dr = 2d log dr ≤ 2d log 2d. [sent-255, score-0.638]
38 , kd ] is defined as follows (see illustration in Table 1): For x ∈ X which is not an instance of any bag in S, h[k1 , . [sent-325, score-0.595]
39 Each line represents a bag in S, each column represents an instance in the bag. [sent-348, score-0.572]
40 The numbers next to the instances denote the bag to which an instance belongs, and match the sequence N defined in the proof. [sent-395, score-0.627]
41 Similarly, for a bag xi = k )r , define the bag s(¯ ,t) n )r . [sent-425, score-1.109]
42 Assume that the bag function is ψ = OR and the set of allowed bag sizes is R. [sent-451, score-1.094]
43 In Section 6 we provide sample complexity analysis for distribution-dependent binary MIL, as a function of the average bag size. [sent-493, score-0.636]
44 In this section we provide a lemma that relates the covering numbers of bag hypothesis classes with those of the underlying instance hypothesis class. [sent-498, score-0.877]
45 The following lemma bounds the covering number of bag hypothesis-classes whenever the bag function is Lipschitz with respect to the infinity norm (see Definition 4). [sent-508, score-1.143]
46 Let S ⊆ X (R) be a finite set of bags, and let r be the average size of a bag in S. [sent-511, score-0.57]
47 ar1/p ¯ Proof First, note that by the Lipschitz condition on ψ, for any bag x of size n and hypotheses h, g ∈ H , |h(¯ ) − g(¯ )| = |ψ(h(x[1]), . [sent-513, score-0.607]
48 , 2002) attempts to optimize an adaptation of the soft-margin SVM objective (Cortes and Vapnik, 1995) to MIL, in which the margin of a bag is the maximal margin achieved by any of its instances. [sent-537, score-0.607]
49 The objective of MI-SVM amounts to replacing the hypothesis class H of separating hyperplanes with the class of bag-hypotheses H where the bag function is ψ = max. [sent-542, score-0.693]
50 Thus, we provide an upper bound on the fat-shattering dimension 3011 S ABATO AND T ISBHY of MIL as a function of the fat-shattering dimension of the underlying hypothesis class, and of the maximal allowed bag size. [sent-546, score-0.701]
51 Let H ⊆ [0, B]X be a real-valued hypothesis class and assume that the bag function ψ : [0, B](R) → [0, aB] is a-lipschitz with respect to the infinity norm. [sent-551, score-0.675]
52 (5) This theorem shows that for margin learning as well, the dependence of the bag size on the sample complexity is poly-logarithmic. [sent-553, score-0.674]
53 Sample Complexity by Average Bag Size: Rademacher Complexity The upper bounds we have shown so far provide distribution-free sample complexity bounds, which depend only on the maximal possible bag size. [sent-592, score-0.619]
54 In this section we show that even if the bag size is unbounded, we can still have a sample complexity guarantee, if the average bag size for the input distribution is bounded. [sent-593, score-1.166]
55 Our sample complexity bound for binary MIL depends on the average ¯ bag size and the VC dimension of the instance hypothesis class. [sent-615, score-0.815]
56 Let r be the average bag size under distribution D over labeled bags. [sent-620, score-0.592]
57 (15) Let r(S) be the average bag size in the sample S, that is r(S) = |S∪ |/|S|. [sent-632, score-0.58]
58 m S ABATO AND T ISBHY We conclude that even when the bag size is not bounded, the sample complexity of binary MIL with a specific distribution depends only logarithmically on the average bag size in this distribution, and linearly on the VC-dimension of the underlying instance hypothesis class. [sent-638, score-1.336]
59 This bound will depend on the average bag size, and on the Rademacher complexity of the instance hypothesis class. [sent-641, score-0.765]
60 For the bag function, recall that all extensions of monotone Boolean functions are Lipschitz with respect to the infinity norm. [sent-643, score-0.582]
61 ˆ ˆ The following lemma provides a bound on the empirical Rademacher complexity of MIL, as a function of the average bag size in the sample and of the behavior of the worst-case Rademacher complexity over instances. [sent-648, score-0.699]
62 To avoid degenerate cases, we consider only losses such that there exists at least one labeled bag (¯ , y) ⊆ x X (R) × {−1, +1} and hypotheses h, g ∈ H such that hℓ (¯ , y) = 0 and gℓ (¯ , y) = 1. [sent-651, score-0.652]
63 Let R ⊆ N, and let the bag function ψ : R(R) → R be a1 -Lipschitz with respect to the infinity norm. [sent-654, score-0.57]
64 sup Let S be a labeled bag-sample of size m, with an average bag size r. [sent-658, score-0.634]
65 We now show that the assumption γ ≤ B does not restrict us: By the assumptions on ℓ, there x x x are h, g ∈ H and a labeled bag (¯ , y) such that hℓ (¯ , y) = 1 and gℓ (¯ , y) = 0. [sent-680, score-0.592]
66 √ a+b ≤ √ √ a + b for non-negative a and b, and from Based on Lemma 19, we will now bound the average Rademacher complexity of MIL, as a function of the worst-case Rademacher complexity over instances, and the expected bag size. [sent-693, score-0.651]
67 Since the number of instances in a bag sample of a certain size is not fixed, but depends on the bag sizes in sup the specific sample, we will need to consider the behavior of Rm (H ) for different values of m. [sent-694, score-1.224]
68 The resulting bound indicates that here too there is a poly-logarithmic dependence of the sample complexity on the average bag size. [sent-697, score-0.645]
69 Let R ⊆ N, and let the bag function ψ : R(R) → R be a1 -Lipschitz with respect to the infinity norm. [sent-700, score-0.57]
70 Rm (H ℓ , D) ≤ m Proof Let S be a labeled bag sample of size m, and let r be its average bag size. [sent-705, score-1.195]
71 m Now, for a given sample S denote its average bag size by r(S). [sent-729, score-0.58]
72 By Theorem 20 there exists a number N such that for any 1-Lipschitz bag-function ψ (such as max) and for any distribution D over labeled bags with an average bag size of r, we have Rm (H ℓ , D) ≤ 4 + 10 log(16eC2 rm2 ) (N +C ln(16m)) √ . [sent-740, score-0.801]
73 PAC-Learning for MIL In the previous sections we addressed the sample complexity of generalized MIL, showing that it grows only logarithmically with the bag size. [sent-743, score-0.619]
74 Furthermore, if A is efficient then the resulting Boosting algorithm is also efficient, with a polynomial dependence on the maximal bag size. [sent-751, score-0.571]
75 The algorithm MILearn, listed as Algorithm 1 below, accepts as input a bag sample S and a bounded bag-function ψ. [sent-849, score-0.58]
76 MILearn constructs a sample of instances SI from the instances that make up the bags in S, labeling each instance in SI with the label of the bag it came from. [sent-852, score-0.94]
77 The weights of the instances depend on whether the bag they came from was positive or negative, and on the boundedness properties of ψ. [sent-853, score-0.628]
78 h∈Ω(H ,S) Denote the weight of the positive bags in the input sample S by W+ = ∑i:yi =+1 wi and the weight of the negative bags by W− = ∑i:yi =−1 wi . [sent-879, score-0.605]
79 The next theorem provides a guarantee for MILearn that depends on the tightness of this inequality for the given bag function. [sent-890, score-0.572]
80 We subsequently show how these general guarantees translate to a specific result for the max function, and other bag functions with the same boundedness properties. [sent-893, score-0.589]
81 A with a weighted bag sample S of total weight 1, and let Consider running the algorithm MILearn h◦ be the hypothesis returned by MILearnA . [sent-898, score-0.747]
82 We then show that given an algorithm on instances that satisfies one of these definitions, we can construct an algorithm for MIL which approximately maximizes the margin on an input bag sample. [sent-977, score-0.632]
83 3027 S ABATO AND T ISBHY Specifically, if the input bag sample is realizable by H , then the MIL algorithm we propose will find a linear combination of bag hypotheses that classifies the sample with zero error, and with a positive margin. [sent-978, score-1.24]
84 We also show below that under certain conditions this linear combination induces a positive margin on the input bag sample with high probability. [sent-999, score-0.61]
85 Therefore, the computational complexity of MILearnOε,δ is polynomial in c(ε, δ) and in N, where N is the total number of instances in the input bag sample S. [sent-1002, score-0.698]
86 Therefore, by case (2) of Corollary 28, if MILearn ε,δ/k receives a weighted bag sample Sw , with probability 1 − δ/2k it returns a bag hypothesis h◦ ∈ H + such that Γ(h◦ , Sw ) ≥ suph∈Ω(H ,S) Γ(h, Sw ) − r2 ε 2r − 1 . [sent-1021, score-1.321]
87 (28) Due to the realizability assumption for D, there is an h ∈ Ω(H , S) that classifies correctly the bag sample S. [sent-1023, score-0.58]
88 We have shown that the dependence of the sample complexity of 3031 S ABATO AND T ISBHY generalized MIL on the number of instances in a bag is only poly-logarithmic, thus implying that the statistical performance of MIL is only mildly sensitive to the size of the bag. [sent-1068, score-0.674]
89 Our sample complexity results can be summarized as follows, where d is the VC dimension or pseudo-dimension of the underlying hypothesis class, and r is the maximal/average bag size. [sent-1070, score-0.747]
90 • For non-trivial bag functions, there are hypothesis classes such that the VC dimension of binary MIL is Ω(d log(r)). [sent-1072, score-0.692]
91 • The pseudo-dimension of binary MIL for bag functions that are extensions of monotone Boolean functions is O(d log(r)). [sent-1074, score-0.599]
92 • Covering numbers for MIL hypotheses with Lipschitz bag functions can be bounded by covering numbers for the single instance hypothesis class. [sent-1075, score-0.794]
93 • The fat-shattering dimension of real-valued MIL with Lipschitz bag-functions is poly-logarithmic in the bag size and quasilinear in the fat shattering dimension of the single instance hypothesis class. [sent-1076, score-0.811]
94 • The Rademacher complexity of binary MIL with a bounded average bag size is O( d log(r)/m) where m is the sample size. [sent-1077, score-0.636]
95 • The Rademacher complexity of real-valued MIL with a Lipschitz loss function and a Lipschitz bag function is upper bounded by a logarithmic dependence on the average bag size and a quasilinear dependence on the Rademacher complexity of the instance hypothesis class. [sent-1078, score-1.325]
96 Proof of Theorem 27 The first step in providing a guarantee for the edge achieved by MILearn, is to prove a guarantee for the edge achieved on the bag sample by the hypothesis returned by A in step (3) of the algorithm. [sent-1092, score-0.774]
97 Consider running the algorithm MILearn with a weighted bag sample S of total weight 1. [sent-1095, score-0.596]
98 (31) x∈¯ x ¯ Assume the input bag sample is S = {(wi , xi , yi )}i∈[m] . [sent-1117, score-0.617]
99 We have Γ(h, S) = x x ∑ wi h(¯ i ) − ∑ wi h(¯ i ) i∈I+ ≥ = i∈I− ∑ wi (a ∑ h(x) + b) − ∑ wi (c ∑ h(x) + d) i∈I+ x∈¯ i x i∈I− (32) x∈¯ i x ∑ wi a ∑ h(x) − ∑ wi c ∑ h(x) + ∑ wi b − ∑ wi d. [sent-1120, score-0.616]
100 As evident by steps (1,2) of MILearn, In the sample SI all instances from positive bags have weight α(+1) = a, and all instances from negative bags have weight α(−1) = c. [sent-1122, score-0.561]
wordName wordTfidf (topN-words)
[('mil', 0.615), ('bag', 0.547), ('bags', 0.209), ('milearn', 0.139), ('ln', 0.129), ('hypothesis', 0.128), ('fat', 0.111), ('abato', 0.111), ('isbhy', 0.111), ('lass', 0.111), ('nstance', 0.111), ('ypothesis', 0.111), ('zb', 0.104), ('boolean', 0.102), ('dr', 0.102), ('wi', 0.077), ('sx', 0.064), ('ulti', 0.063), ('hypotheses', 0.06), ('rc', 0.058), ('zn', 0.056), ('shattered', 0.055), ('hpos', 0.055), ('instances', 0.055), ('learner', 0.055), ('nm', 0.055), ('tf', 0.051), ('adaboost', 0.051), ('aprs', 0.05), ('sabato', 0.05), ('labeled', 0.045), ('hi', 0.045), ('rf', 0.045), ('sw', 0.044), ('rademacher', 0.043), ('sup', 0.042), ('si', 0.039), ('complexity', 0.039), ('zi', 0.038), ('er', 0.037), ('earning', 0.036), ('hw', 0.035), ('monotone', 0.035), ('covering', 0.034), ('edge', 0.033), ('sample', 0.033), ('weak', 0.032), ('log', 0.032), ('boosting', 0.032), ('margin', 0.03), ('agnostic', 0.029), ('co', 0.029), ('ny', 0.028), ('milearna', 0.028), ('nrm', 0.028), ('suph', 0.028), ('swt', 0.028), ('sign', 0.027), ('equation', 0.026), ('returns', 0.026), ('boundedness', 0.026), ('bound', 0.026), ('theorem', 0.025), ('instance', 0.025), ('receives', 0.024), ('andrews', 0.024), ('polynomial', 0.024), ('kd', 0.023), ('rm', 0.023), ('let', 0.023), ('maron', 0.022), ('sivan', 0.022), ('yi', 0.022), ('hn', 0.021), ('dm', 0.021), ('lipschitz', 0.02), ('realizable', 0.02), ('classical', 0.019), ('em', 0.019), ('nity', 0.019), ('vc', 0.018), ('separating', 0.018), ('binary', 0.017), ('schapire', 0.017), ('ak', 0.017), ('auer', 0.017), ('bartlett', 0.017), ('awi', 0.017), ('conformations', 0.017), ('improperly', 0.017), ('milearno', 0.017), ('molecule', 0.017), ('label', 0.016), ('max', 0.016), ('dudley', 0.016), ('weighted', 0.016), ('lemma', 0.015), ('ht', 0.015), ('mn', 0.015), ('xi', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
Author: Sivan Sabato, Naftali Tishby
Abstract: In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. Keywords: multiple-instance learning, learning theory, sample complexity, PAC learning, supervised classification
2 0.095551126 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
3 0.078046329 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
4 0.076114863 13 jmlr-2012-Active Learning via Perfect Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We discover a strong relation between two known learning models: stream-based active learning and perfect selective classification (an extreme case of ‘classification with a reject option’). For these models, restricted to the realizable case, we show a reduction of active learning to selective classification that preserves fast rates. Applying this reduction to recent results for selective classification, we derive exponential target-independent label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high dimensional mixture of Gaussians. Finally, we study the relation between the proposed technique and existing label complexity measures, including teaching dimension and disagreement coefficient. Keywords: classification with a reject option, perfect classification, selective classification, active learning, selective sampling, disagreement coefficient, teaching dimension, exploration vs. exploitation 1. Introduction and Related Work Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, potentially leading to significantly faster learning. In recent years it has been gaining considerable recognition as a vital technique for efficiently implementing inductive learning in many industrial applications where abundance of unlabeled data exists, and/or in cases where labeling costs are high. In this paper we expose a strong relation between active learning and selective classification, another known alternative learning model (Chow, 1970; El-Yaniv and Wiener, 2010). Focusing on binary classification in realizable settings we consider standard stream-based active learning, which is also referred to as online selective sampling (Atlas et al., 1990; Cohn et al., 1994). In this model the learner is given an error objective ε and then sequentially receives unlabeled examples. At each step, after observing an unlabeled example x, the learner decides whether or not to request the label of x. The learner should terminate the learning process and output a binary classifier whose true error is guaranteed to be at most ε with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. A label complexity bound of O(d log(d/ε)) for actively learning ε-good classifier from a concept class with VC-dimension d, provides an exponential speedup in terms of 1/ε relative to standard (passive) supervised learning where the sample complexity is typically O(d/ε). The study of (stream-based, realizable) active learning is paved with very interesting theoretical results. Initially, only a few cases were known where active learning provides significant advanc 2012 Ran El-Yaniv and Yair Wiener. E L -YANIV AND W IENER tage over passive learning. Perhaps the most favorable result was an exponential label complexity speedup for learning homogeneous linear classifiers where the (linearly separable) data is uniformly distributed over the unit sphere. This result was manifested by various authors using various analysis techniques, for a number of strategies that can all be viewed in hindsight as approximations or variations of the “CAL algorithm” of Cohn et al. (1994). Among these studies, the earlier theoretical results (Seung et al., 1992; Freund et al., 1993, 1997; Fine et al., 2002; Gilad-Bachrach, 2007) considered Bayesian settings and studied the speedup obtained by the Query by Committee (QBC) algorithm. The more recent results provided PAC style analyses (Dasgupta et al., 2009; Hanneke, 2007a, 2009). Lack of positive results for other non-toy problems, as well as various additional negative results that were discovered, led some researchers to believe that active learning is not necessarily advantageous in general. Among the striking negative results is Dasgupta’s negative example for actively learning general (non-homogeneous) linear classifiers (even in two dimensions) under the uniform distribution over the sphere (Dasgupta, 2005). A number of recent innovative papers proposed alternative models for active learning. Balcan et al. (2008) introduced a subtle modification of the traditional label complexity definition, which opened up avenues for new positive results. According to their new definition of “non-verifiable” label complexity, the active learner is not required to know when to stop the learning process with a guaranteed ε-good classifier. Their main result, under this definition, is that active learning is asymptotically better than passive learning in the sense that only o(1/ε) labels are required for actively learning an ε-good classifier from a concept class that has a finite VC-dimension. Another result they accomplished is an exponential label complexity speedup for (non-verifiable) active learning of non-homogeneous linear classifiers under the uniform distribution over the the unit sphere. Based on Hanneke’s characterization of active learning in terms of the “disagreement coefficient” (Hanneke, 2007a), Friedman (2009) recently extended the Balcan et al. results and proved that a target-dependent exponential speedup can be asymptotically achieved for a wide range of “smooth” learning problems (in particular, the hypothesis class, the instance space and the distribution should all be expressible by smooth functions). He proved that under such smoothness conditions, for any target hypothesis h∗ , Hanneke’s disagreement coefficient is bounded above in terms of a constant c(h∗ ) that depends on the unknown target hypothesis h∗ (and is independent of δ and ε). The resulting label complexity is O (c(h∗ ) d polylog(d/ε)) (Hanneke, 2011b). This is a very general result but the target-dependent constant involved in this bound is only guaranteed to be finite. With this impressive progress in the case of target-dependent bounds for active learning, the current state of affairs in the target-independent bounds for active learning arena leaves much to be desired. To date the most advanced result in this model, which was already essentially established by Seung et al. and Freund et al. more than fifteen years ago (Seung et al., 1992; Freund et al., 1993, 1997), is still a target-independent exponential speed up bound for homogeneous linear classifiers under the uniform distribution over the sphere. The other learning model we contemplate that will be shown to have strong ties to active learning, is selective classification, which is mainly known in the literature as ‘classification with a reject option.’ This old-timer model, that was already introduced more than fifty years ago (Chow, 1957, 1970), extends standard supervised learning by allowing the classifier to opt out from predictions in cases where it is not confident. The incentive is to increase classification reliability over instances that are not rejected by the classifier. Thus, using selective classification one can potentially achieve 256 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION a lower error rate using the same labeling “budget.” The main quantities that characterize a selective classifier are its (true) error and coverage rate (or its complement, the rejection rate). There is already substantial volume of research publications on selective classification, that kept emerging through the years. The main theme in many of these publications is the implementation of certain reject mechanisms for specific learning algorithms like support vector machines and neural networks. Among the few theoretical studies on selective classification, there are various excess risk bounds for ERM learning (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Wegkamp, 2007), and certain coverage/risk guarantees for selective ensemble methods (Freund et al., 2004). In a recent work (El-Yaniv and Wiener, 2010) the trade-off between error and coverage was examined and in particular, a new extreme case of selective learning was introduced. In this extreme case, termed here “perfect selective classification,” the classifier is given m labeled examples and is required to instantly output a classifier whose true error is perfectly zero with certainty. This is of course potentially doable only if the classifier rejects a sufficient portion of the instance space. A non-trivial result for perfect selective classification is a high probability lower bound on the classifier coverage (or equivalently, an upper bound on its rejection rate). Such bounds have recently been presented in El-Yaniv and Wiener (2010). In Section 3 we present a reduction of active learning to perfect selective classification that preserves “fast rates.” This reduction enables the luxury of analyzing dynamic active learning problems as static problems. Relying on a recent result on perfect selective classification from El-Yaniv and Wiener (2010), in Section 4 we then apply our reduction and conclude that general (non-homogeneous) linear classifiers are actively learnable at exponential (in 1/ε) label complexity rate when the data distribution is an arbitrary unknown finite mixture of high dimensional Gaussians. While we obtain exponential label complexity speedup in 1/ε, we incur exponential slowdown in d 2 , where d is the problem dimension. Nevertheless, in Section 5 we prove a lower bound of Ω((log m)(d−1)/2 (1 + o(1)) on the label complexity, when considering the class of unrestricted linear classifiers under a Gaussian distribution. Thus, an exponential slowdown in d is unavoidable in such settings. Finally, in Section 6 we relate the proposed technique to other complexity measures for active learning. Proving and using a relation to the teaching dimension (Goldman and Kearns, 1995) we show, by relying on a known bound for the teaching dimension, that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show that the coverage of perfect selective classification can be bounded below using the disagreement coefficient. Conversely, we show that the disagreement coefficient can be bounded above using any coverage bound for perfect selective classification. Consequently, the results here imply that the disagreement coefficient can be sufficiently bounded to ensure fast active learning for the case of linear classifiers under a mixture of Gaussians. 2. Active Learning and Perfect Selective Classification In binary classification the goal is to learn an accurate binary classifier, h : X → {±1}, from a finite labeled training sample. Here X is some instance space and the standard assumption is that the training sample, Sm = {(xi , yi )}m , containing m labeled examples, is drawn i.i.d. from some i=1 unknown distribution P(X,Y ) defined over X × {±1}. The classifier h is chosen from some hypothesis class H . In this paper we focus on the realizable setting whereby labels are defined by 257 E L -YANIV AND W IENER some unknown target hypothesis h∗ ∈ H . Thus, the underlying distribution reduces to P(X). The performance of a classifier h is quantified by its true zero-one error, R(h) Pr{h(X) = h∗ (X)}. A positive result for a classification problem (H , P) is a learning algorithm that given an error target ε and a confidence parameter δ can output, based on Sm , an hypothesis h whose error R(h) ≤ ε, with probability of at least 1 − δ. A bound B(ε, δ) on the size m of labeled training sample sufficient for achieving this is called the sample complexity of the learning algorithm. A classical result is that any consistent learning algorithm has sample complexity of O( 1 (d log( 1 ) + log( 1 ))), where d is ε ε δ the VC-dimension of H (see, e.g., Anthony and Bartlett, 1999). 2.1 Active Learning We consider the following standard active learning model. In this model the learner sequentially observes unlabeled instances, x1 , x2 , . . ., that are sampled i.i.d. from P(X). After receiving each xi , the learning algorithm decides whether or not to request its label h∗ (xi ), where h∗ ∈ H is an unknown target hypothesis. Before the start of the game the algorithm is provided with some desired error rate ε and confidence level δ. We say that the learning algorithm actively learned the problem instance (H , P) if at some point it can terminate this process, after observing m instances and requesting k labels, and output an hypothesis h ∈ H whose error R(h) ≤ ε, with probability of at least 1 − δ. The quality of the algorithm is quantified by the number k of requested labels, which is called the label complexity. A positive result for a learning problem (H , P) is a learning algorithm that can actively learn this problem for any given ε and δ, and for every h∗ , with label complexity bounded above by L(ε, δ, h∗ ). If there is a label complexity bound that is O(polylog(1/ε)) we say that the problem is actively learnable at exponential rate. 2.2 Selective Classification Following the formulation in El-Yaniv and Wiener (2010) the goal in selective classification is to learn a pair of functions (h, g) from a labeled training sample Sm (as defined above for passive learning). The pair (h, g), which is called a selective classifier, consists of a binary classifier h ∈ H , and a selection function, g : X → {0, 1}, which qualifies the classifier h as follows. For any sample x ∈ X , the output of the selective classifier is (h, g)(x) h(x) iff g(x) = 1, and (h, g)(x) abstain iff g(x) = 0. Thus, the function g is a filter that determines a sub-domain of X over which the selective classifier will abstain from classifications. A selective classifier is thus characterized by its coverage, Φ(h, g) EP {g(x)}, which is the P-weighted volume of the sub-domain of X that is not filtered out, and its error, R(h, g) = E{I(h(X) = h∗ (X)) · g(X)}/Φ(h, g), which is the zero-one loss restricted to the covered sub-domain. Note that this is a “smooth” generalization of passive learning and, in particular, R(h, g) reduces to R(h) (standard classification) if g(x) ≡ 1. We expect to see a trade-off between R(h, g) and Φ(h, g) in the sense that smaller error should be obtained by compromising the coverage. A major issue in selective classification is how to optimally control this trade-off. In this paper we are concerned with an extreme case of this trade-off whereby (h, g) is required to achieve a perfect score of zero error with certainty. This extreme learning objective is termed perfect learning in El-Yaniv and Wiener (2010). Thus, for a perfect selective classifier (h, g) we always have R(h, g) = 0, and its quality is determined by its guaranteed coverage. A positive result for (perfect) selective classification problem (H , P) is a learning algorithm that uses a labeled training sample Sm (as in passive learning) to output a perfect selective classifier (h, g) for which Φ(h, g) ≥ BΦ (H , δ, m) with probability of at least 1 − δ, for any given δ. The bound 258 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION BΦ = BΦ (H , δ, m) is called a coverage bound (or coverage rate) and its complement, 1 − BΦ , is called a rejection bound (or rate). A coverage rate BΦ = 1 − O( polylog(m) ) (and the corresponding m 1 − BΦ rejection rate) are qualified as fast. 2.3 The CAL Algorithm and the Consistent Selective Strategy (CSS) The major players in active learning and in perfect selective classification are the CAL algorithm and the consistent selective strategy (CSS), respectively. To define them we need the following definitions. Definition 1 (Version space, Mitchell, 1977) Given an hypothesis class H and a training sample Sm , the version space V SH ,Sm is the set of all hypotheses in H that classify Sm correctly. Definition 2 (Disagreement set, Hanneke, 2007a; El-Yaniv and Wiener, 2010) Let G ⊂ H . The disagreement set w.r.t. G is defined as DIS(G ) {x ∈ X : ∃h1 , h2 ∈ G The agreement set w.r.t. G is AGR(G ) s.t. h1 (x) = h2 (x)} . X \ DIS(G ). The main strategy for active learning in the realizable setting (Cohn et al., 1994) is to request labels only for instances belonging to the disagreement set and output any (consistent) hypothesis belonging to the version space. This strategy is often called the CAL algorithm. A related strategy for perfect selective classification was proposed in El-Yaniv and Wiener (2010) and termed consistent selective strategy (CSS). Given a training set Sm , CSS takes the classifier h to be any hypothesis in V SH ,Sm (i.e., a consistent learner), and takes a selection function g that equals one for all points in the agreement set with respect to V SH ,Sm , and zero otherwise. 3. From Coverage Bound to Label Complexity Bound In this section we present a reduction from stream-based active learning to perfect selective classification. Particularly, we show that if there exists for H a perfect selective classifier with a fast rejection rate of O(polylog(m)/m), then the CAL algorithm will actively learn H with exponential label complexity rate of O(polylog(1/ε)). Lemma 3 Let Sm = {(x1 , y1 ), . . . , (xm , ym )} be a sequence of m labeled samples drawn i.i.d. from an unknown distribution P(X) and let Si = {(x1 , y1 ), . . . , (xi , yi )} be the i-prefix of Sm . Then, with probability of at least 1 − δ over random choices of Sm , the following bound holds simultaneously for all i = 1, . . . , m − 1, Pr xi+1 ∈ DIS(V SH ,Si )|Si ≤ 1 − BΦ H , δ , 2⌊log2 (i)⌋ , log2 (m) where BΦ (H , δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class H , confidence δ and sample size m . 259 E L -YANIV AND W IENER Proof For j = 1, . . . , m, abbreviate DIS j DIS(V SH ,S j ) and AGR j AGR(V SH ,S j ). By definition, DIS j = X \ AGR j . By the definitions of a coverage bound and agreement/disagreement sets, with probability of at least 1 − δ over random choices of S j BΦ (H , δ, j) ≤ Pr{x ∈ AGR j |S j } = Pr{x ∈ DIS j |S j } = 1 − Pr{x ∈ DIS j |S j }. Applying the union bound we conclude that the following inequality holds simultaneously with high probability for t = 0, . . . , ⌊log2 (m)⌋ − 1, Pr{x2t +1 ∈ DIS2t |S2t } ≤ 1 − BΦ H , δ , 2t . log2 (m) (1) For all j ≤ i, S j ⊆ Si , so DISi ⊆ DIS j . Therefore, since the samples in Sm are all drawn i.i.d., for any j ≤ i, Pr {xi+1 ∈ DISi |Si } ≤ Pr xi+1 ∈ DIS j |S j = Pr x j+1 ∈ DIS j |S j . The proof is complete by setting j = 2⌊log2 (i)⌋ ≤ i, and applying inequality (1). Lemma 4 (Bernstein’s inequality Hoeffding, 1963) Let X1 , . . . , Xn be independent zero-mean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then, for all positive t, n 2 /2 t . Pr ∑ Xi > t ≤ exp − 2 + Mt/3 i=1 ∑E Xj Lemma 5 Let Zi , i = 1, . . . , m, be independent Bernoulli random variables with success probabilities pi . Then, for any 0 < δ < 1, with probability of at least 1 − δ, m ∑ (Zi − E{Zi }) ≤ 2 ln i=1 Proof Define Wi 1 2 1 ∑ pi + 3 ln δ . δ Zi − E{Zi } = Zi − pi . Clearly, E{Wi } = 0, |Wi | ≤ 1, E{Wi2 } = pi (1 − pi ). Applying Bernstein’s inequality (Lemma 4) on the Wi , n t 2 /2 t 2 /2 = exp − Pr ∑ Wi > t ≤ exp − ∑ pi (1 − pi ) + t/3 i=1 ∑ E W j2 + t/3 ≤ exp − t 2 /2 . ∑ pi + t/3 Equating the right-hand side to δ and solving for t, we have t 2 /2 1 = ln δ ∑ pi + t/3 ⇐⇒ 2 1 1 t 2 − t · ln − 2 ln ∑ pi = 0, 3 δ δ 260 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and the positive solution of this quadratic equation is t= 1 1 ln + 3 δ 1 21 1 2 1 ln + 2 ln ∑ pi < ln + 9 δ δ 3 δ 2 ln 1 pi . δ∑ Lemma 6 Let Z1 , Z2 , . . . , Zm be a high order Markov sequence of dependent binary random variables defined in the same probability space. Let X1 , X2 , . . . , Xm be a sequence of independent random variables such that, Pr {Zi = 1|Zi−1 , . . . , Z1 , Xi−1 , . . . , X1 } = Pr {Zi = 1|Xi−1 , . . . , X1 } . Define P1 Pr {Z1 = 1}, and for i = 2, . . . , m, Pi Pr {Zi = 1|Xi−1 , . . . , X1 } . Let b1 , b2 . . . bm be given constants independent of X1 , X2 , . . . , Xm .1 Assume that Pi ≤ bi simultaneously for all i with probability of at least 1 − δ/2, δ ∈ (0, 1). Then, with probability of at least 1 − δ, m m 2 2 2 ∑ Zi ≤ ∑ bi + 2 ln δ ∑ bi + 3 ln δ . i=1 i=1 We proceed with a direct proof of Lemma 6. An alternative proof of this lemma, using supermartingales, appears in Appendix B. Proof For i = 1, . . . , m, let Wi be binary random variables satisfying bi + I(Pi ≤ bi ) · (Pi − bi ) , Pi bi − Pi ,0 , Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } max 1 − Pi Pr{Wi = 1|Wi−1 , . . . ,W1 , Xi−1 , . . . , X1 } = Pr{Wi = 1|Xi−1 , . . . , X1 }. Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } We notice that Pr{Wi = 1|Xi−1 , . . . , X1 } = Pr{Wi = 1, Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1, Zi = 0|Xi−1 , . . . , X1 } = Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } Pr{Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } Pr{Zi = 0|Xi−1 , . . . , X1 } = Pi + bi −Pii (1 − Pi ) = bi , Pi ≤ bi ; 1−P bi · Pi + 0 = bi , else. Pi Hence the distribution of each Wi is independent of Xi−1 , . . . , X1 , and the Wi are independent Bernoulli random variables with success probabilities bi . By construction if Pi ≤ bi then Pr{Wi = 1|Zi = 1} = X Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } = 1. 1. Precisely we require that each of the bi were selected before Xi are chosen 261 E L -YANIV AND W IENER By assumption Pi ≤ bi for all i simultaneously with probability of at least 1−δ/2. Therefore, Zi ≤ Wi simultaneously with probability of at least 1 − δ/2. We now apply Lemma 5 on the Wi . The proof is then completed using the union bound. Theorem 7 Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Then with probability of at least 1 − δ over choices of Sm , the number of label requests k by the CAL algorithm is bounded by k ≤ Ψ(H , δ, m) + where Ψ(H , δ, m) 2 2 2 2 ln Ψ(H , δ, m) + ln , δ 3 δ m ∑ i=1 1 − BΦ H , δ , 2⌊log2 (i)⌋ 2 log2 (m) and BΦ (H , δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class H , confidence δ and sample size m . Proof According to CAL, the label of sample xi will be requested iff xi ∈ DIS(V SH ,Si−1 ). For i = 1, . . . , m, let Zi be binary random variables such that Zi 1 iff CAL requests a label for sample xi . Applying Lemma 3 we get that for all i = 2, . . . , m, with probability of at least 1 − δ/2 Pr{Zi = 1|Si−1 } = Pr xi ∈ DIS(V SH ,Si−1 )|Si−1 ≤ 1 − BΦ H , δ , 2⌊log2 (i−1)⌋ . 2 log2 (m) For i = 1, BΦ (H , δ, 1) = 0 and the above inequality trivially holds. An application of Lemma 6 on the variables Zi completes the proof. Theorem 7 states an upper bound on the label complexity expressed in terms of m, the size of the sample provided to CAL. This upper bound is very convenient for directly analyzing the active learning speedup relative to supervised learning. A standard label complexity upper bound, which depends on 1/ε, can be extracted using the following simple observation. Lemma 8 (Hanneke, 2009; Anthony and Bartlett, 1999) Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Let H be a hypothesis class whose finite VC dimension is d, and let ε and δ be given. If m≥ 4 2 12 d ln + ln , ε ε δ then, with probability of at least 1 − δ, CAL will output a classifier whose true error is at most ε. Proof Hanneke (2009) observed that since CAL requests a label whenever there is a disagreement in the version space, it is guaranteed that after processing m examples, CAL will output a classifier that is consistent with all the m examples introduced to it. Therefore, CAL is a consistent learner. A classical result (Anthony and Bartlett, 1999, Theorem 4.8) is that any consistent learner will achieve, with probability of at least 1 − δ, a true error not exceeding ε after observing at most 12 2 4 ε d ln ε + ln δ labeled examples. 262 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 9 Let H be a hypothesis class whose finite VC dimension is d. If the rejection rate of CSS polylog( m ) δ (see definition in Section 2.3) is O , then (H , P) is actively learnable with exponential m label complexity speedup. Proof Plugging this rejection rate into Ψ (defined in Theorem 7) we have, m m polylog δ Ψ(H , δ, m) ∑ 1 − BΦ (H , , 2⌊log2 (i)⌋ ) = ∑ O log2 (m) i i=1 i=1 Applying Lemma 41 we get Ψ(H , δ, m) = O polylog By Theorem 7, k = O polylog m δ m log(m) δ i log(m) δ . . , and an application of Lemma 8 concludes the proof. 4. Label Complexity Bounding Technique and Its Applications In this section we present a novel technique for deriving target-independent label complexity bounds for active learning. The technique combines the reduction of Theorem 7 and a general datadependent coverage bound for selective classification from El-Yaniv and Wiener (2010). For some learning problems it is a straightforward technical exercise, involving VC-dimension calculations, to arrive with exponential label complexity bounds. We show a few applications of this technique resulting in both reproductions of known label complexity exponential rates as well as a new one. The following definitions (El-Yaniv and Wiener, 2010) are required for introducing the technique. Definition 10 (Version space compression set) For any hypothesis class H , let Sm be a labeled sample of m points inducing a version space V SH ,Sm . The version space compression set, S′ ⊆ Sm , ˆ ˆ is a smallest subset of Sm satisfying V SH ,Sm = V SH ,S′ . The (unique) number n = n(H , Sm ) = |S′ | is called the version space compression set size. Remark 11 Our ”version space compression set” is precisely Hanneke’s ”minimum specifying set” (Hanneke, 2007b) for f on U with respect to V , where, f = h∗ , U = Sm , V = H [Sm ] (see Definition 23). Definition 12 (Characterizing hypothesis) For any subset of hypotheses G ⊆ H , the characterizing hypothesis of G , denoted fG (x), is a binary hypothesis over X (not restricted to H ) obtaining positive values over the agreement set AGR(G ) (Definition 2), and zero otherwise. Definition 13 (Order-n characterizing set) For each n, let Σn be the set of all possible labeled samples of size n (all n-subsets, each with all 2n possible labelings). The order-n characterizing set of H , denoted Fn , is the set of all characterizing hypotheses fG (x), where G ⊆ H is a version space induced by some member of Σn . 263 E L -YANIV AND W IENER Definition 14 (Characterizing set complexity) Let Fn be the order-n characterizing set of H . The order-n characterizing set complexity of H , denoted γ (H , n), is the VC-dimension of Fn . The following theorem, credited to (El-Yaniv and Wiener, 2010, Theorem 21), is a powerful data-dependent coverage bound for perfect selective learning, expressed in terms of the version space compression set size and the characterizing set complexity. Theorem 15 (Data-dependent coverage guarantee) For any m, let a1 , a2 , . . . , am ∈ R be given, such that ai ≥ 0 and ∑m ai ≤ 1. Let (h, g) be perfect selective classifier (CSS, see Section 2.3). i=1 Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − 2 γ (H , n) ln+ ˆ m 2em 2 , + ln an δ γ (H , n) ˆ ˆ where n is the size of the version space compression set and γ (H , n) is the order-n characterizing ˆ ˆ ˆ set complexity of H . Given an hypothesis class H , our recipe to deriving active learning label complexity bounds for H is: (i) calculate both n and γ (H , n); (ii) apply Theorem 15, obtaining a bound BΦ for the ˆ ˆ coverage; (iii) plug BΦ in Theorem 7 to get a label complexity bound expressed as a summation; (iv) Apply Lemma 41 to obtain a label complexity bound in a closed form. 4.1 Examples In the following example we derive a label complexity bound for the concept class of thresholds (linear separators in R). Although this is a toy example (for which an exponential rate is well known) it does exemplify the technique, and in many other cases the application of the technique is not much harder. Let H be the class of thresholds. We first show that the corresponding version space compression set size n ≤ 2. Assume w.l.o.g. that h∗ (x) I(x > w) for some w ∈ (0, 1). Let ˆ x− max{xi ∈ Sm |yi = −1} and x+ min(xi ∈ Sm |yi = +1). At least one of x− or x+ exist. Let ′ ′ Sm = {(x− , −1), (x+ , +1)}. Then V SH ,Sm = V SH ,Sm , and n = |Sm | ≤ 2. Now, γ (H , 2) = 2, because ˆ ′ the order-2 characterizing set of H is the class of intervals in R whose VC-dimension is 2. Plugging these numbers in Theorem 15, and using the assignment a1 = a2 = 1/2, BΦ (H , δ, m) = 1 − 2 4 ln (m/δ) 2 ln (em) + ln = 1−O . m δ m Next we plug BΦ in Theorem 7 obtaining a raw label complexity m Ψ(H , δ, m) = ∑ 1 − BΦ H , i=1 δ , 2⌊log2 (i)⌋ 2 log2 (m) m = ∑O i=1 ln (log2 (m) · i/δ) . i Finally, by applying Lemma 41, with a = 1 and b = log2 m/δ, we conclude that Ψ(H , δ, m) = O ln2 m δ . Thus, H is actively learnable with exponential speedup, and this result applies to any distribution. In Table 1 we summarize the n and γ (H , n) values we calculated for four other hypothesis classes. The ˆ ˆ 264 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Hypothesis class Distribution n ˆ γ (H , n) ˆ Linear separators in R Intervals in R Linear separators in R2 any any (target-dependent)2 any distribution on the unit circle (target-dependent)2 2 4 4 2 4 4 Linear separators in Rd Balanced axis-aligned rectangles in Rd mixture of Gaussians product distribution O (log m)d−1 /δ O (log (dm/δ)) O nd/2+1 ˆ O (d n log n) ˆ ˆ Table 1: The n and γ of various hypothesis spaces achieving exponential rates. ˆ last two cases are fully analyzed in Sections 4.2 and 6.1, respectively. For the other classes, where γ and n are constants, it is clear (Theorem 15) that exponential rates are obtained. We emphasize that ˆ the bounds for these two classes are target-dependent as they require that Sm include at least one sample from each class. 4.2 Linear Separators in Rd Under Mixture of Gaussians In this section we state and prove our main example, an exponential label complexity bound for linear classifiers in Rd . Theorem 16 Let H be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by 2 k=O (log m)d +1 δ(d+3)/2 . Therefore by Lemma 8 we get k = O (poly(1/δ) · polylog(1/ε)) . Proof The following is a coverage bound for linear classifiers in d dimensions that holds in our setting with probability of at least 1 − δ (El-Yaniv and Wiener, 2010, Corollary 33),3 2 Φ(h, g) ≥ 1 − O 1 (log m)d · (d+3)/2 m δ . (2) 2. Target-dependent with at least one sample in each class. 3. This bound uses the fact that for linear classifiers in d dimensions n = O (log m)d−1 /δ (El-Yaniv and Wiener, 2010, ˆ Lemma 32), and that γ (H , n) = O nd/2+1 (El-Yaniv and Wiener, 2010, Lemma 27). ˆ ˆ 265 E L -YANIV AND W IENER Plugging this bound in Theorem 7 we obtain, Ψ(H , δ, m) = m ∑ i=1 1 − BΦ H , m = ∑O i=1 = O δ , 2⌊log2 (i)⌋ 2 log2 (m) 2 log2 (m) (log i)d · i δ log2 (m) δ d+3 2 d+3 2 m (log(i))d ·∑ i i=1 2 . Finally, an application of Lemma 41 with a = d 2 and b = 1 completes the proof. 5. Lower Bound on Label Complexity In the previous section we have derived an upper bound on the label complexity of CAL for various classifiers and distributions. In the case of linear classifiers in Rd we have shown an exponential speed up in terms of 1/ε but also an exponential slow down in terms of the dimension d. In passive learning there is a linear dependency in the dimension while in our case (active learning using CAL) there is an exponential one. Is it an artifact of our bounding technique or a fundamental phenomenon? To answer this question we derive an asymptotic lower bound on the label complexity. We show that the exponential dependency in d is unavoidable (at least asymptotically) for every bounding technique when considering linear classifier even under a single Gaussian (isotropic) distribution. The argument is obtained by the observation that CAL has to request a label to any point on the convex hull of a sample Sm . The bound is obtained using known results from probabilistic geometry, which bound the first two moments of the number of vertices of a random polytope under the Gaussian distribution. Definition 17 (Gaussian polytope) Let X1 , ..., Xm be i.i.d. random points in Rd with common stan1 dard normal distribution (with zero mean and covariance matrix 2 Id ). A Gaussian polytope Pm is the convex hull of these random points. Denote by fk (Pm ) the number of k-faces in the Gaussian polytope Pm . Note that f0 (Pm ) is the number of vertices in Pm . The following two Theorems asymptotically bound the average and variance of fk (Pm ). Theorem 18 (Hug et al., 2004, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then E fk (Pm ) = c(k,d) (log m) d−1 2 · (1 + o(1)) as m → ∞, where c(k,d) is a constant depending only on k and d. 266 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 19 (Hug and Reitzner, 2005, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then there exists a positive constant cd , depending only on the dimension, such that d−1 Var ( fk (Pm )) ≤ cd (log m) 2 for all k ∈ {0, . . . , d − 1}. We can now use Chebyshev’s inequality to lower bound the number of vertices in Pm ( f0 (Pm )) with high probability. Theorem 20 Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution and δ > 0 be given. Then with probability of at least 1 − δ, f0 (Pm ) ≥ cd (log m) d−1 2 d−1 cd ˜ − √ (log m) 4 δ · (1 + o(1)) as m → ∞, where cd and cd are constants depending only on d. ˜ Proof Using Chebyshev’s inequality (in the second inequality), as well as Theorem 19 we get Pr ( f0 (Pm ) > E f0 (Pm ) − t) = 1 − Pr ( f0 (Pm ) ≤ E f0 (Pm ) − t) ≥ 1 − Pr (| f0 (Pm ) − E f0 (Pm )| ≥ t) d−1 cd Var ( f0 (Pm )) ≥ 1 − 2 (log m) 2 . ≥ 1− 2 t t Equating the RHS to 1 − δ and solving for t we get t= (log m) cd δ d−1 2 . Applying Theorem 18 completes the proof. Theorem 21 (Lower bound) Let H be the class of linear binary classifiers in Rd , and let the underlying distribution be standard normal distribution in Rd . Then there exists a target hypothesis such that, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by d−1 cd k ≥ (log m) 2 · (1 + o(1)). 2 as m → ∞, where cd is a constant depending only on d. Proof Let us look at the Gaussian polytope Pm induced by the random sample Sm . As long as all labels requested by CAL have the same value (the case of minuscule minority class) we note that every vertex of Pm falls in the region of disagreement with respect to any subset of Sm that do not include that specific vertex. Therefore, CAL will request label at least for each vertex of Pm . For sufficiently large m, in particular, 4 2cd d−1 ˜ √ log m ≥ , cd δ we conclude the proof by applying Theorem 20. 267 E L -YANIV AND W IENER 6. Relation to Existing Label Complexity Measures A number of complexity measures to quantify the speedup in active learning have been proposed. In this section we show interesting relations between our techniques and two well known measures, namely the teaching dimension (Goldman and Kearns, 1995) and the disagreement coefficient (Hanneke, 2009). Considering first the teaching dimension, we prove in Lemma 26 that the version space compression set size is bounded above, with high probability, by the extended teaching dimension growth function (introduced by Hanneke, 2007b). Consequently, it follows that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show in Theorem 34 that the coverage of CSS can be bounded below using the disagreement coefficient. Conversely, in Corollary 39 we show that the disagreement coefficient can be bounded above using any coverage bound for CSS. Consequently, the results here imply that the disagreement coefficient, θ(ε) grows slowly with 1/ε for the case of linear classifiers under a mixture of Gaussians. 6.1 Teaching Dimension The teaching dimension is a label complexity measure proposed by Goldman and Kearns (1995). The dimension of the hypothesis class H is the minimum number of examples required to present to any consistent learner in order to uniquely identify any hypothesis in the class. We now define the following variation of the extended teaching dimension (Heged¨ s, 1995) u due to Hanneke. Throughout we use the notation h1 (S) = h2 (S) to denote the fact that the two hypotheses agree on the classification of all instances in S. ¨ Definition 22 (Extended Teaching Dimension, Hegedus, 1995; Hanneke, 2007b) Let V ⊆ H , m ≥ m, 0, U ∈ X ∀f ∈ H , XT D( f ,V,U) = inf {t | ∃R ⊆ U : | {h ∈ V : h(R) = f (R)} | ≤ 1 ∧ |R| ≤ t} . Definition 23 (Hanneke, 2007b) For V ⊆ H , V [Sm ] denotes any subset of V such that ∀h ∈ V, | h′ ∈ V [Sm ] : h′ (Sm ) = h(Sm ) | = 1. Claim 24 Let Sm be a sample of size m, H an hypothesis class, and n = n(H , Sm ), the version space ˆ compression set size. Then, XT D(h∗ , H [Sm ], Sm ) = n. ˆ Proof Let Sn ⊆ Sm be a version space compression set. Assume, by contradiction, that there exist ˆ two hypotheses h1 , h2 ∈ H [Sm ], each of which agrees on the given classifications of all examples in Sn . Therefore, h1 , h2 ∈ V SH ,Sn , and by the definition of version space compression set, we get ˆ ˆ h1 , h2 ∈ V SH ,Sm . Hence, | h ∈ H [Sm ] : h(Sm ) = h∗ (Sm ) | ≥ 2, which contradicts definition 23. Therefore, | h ∈ H [Sm ] : h(Sn ) = h∗ (Sn ) | ≤ 1, ˆ ˆ 268 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and XT D(h∗ , H [Sm ], Sm ) ≤ |Sn | = n. ˆ ˆ Let R ⊂ Sm be any subset of size |R| < n. Consequently, V SH ,Sm ⊂ V SH ,R , and there exist hypothesis, ˆ ′ ∈ VS h H ,R , that agrees with all labeled examples in R, but disagrees with at least one example in Sm . Thus, h′ (Sm ) = h∗ (Sm ), and according to definition 23, there exist hypotheses h1 , h2 ∈ H [Sm ] such that h1 (Sm ) = h′ (Sm ) = h∗ (Sm ) = h2 (Sm ). But h1 (R) = h2 (R) = h∗ (R), so | {h ∈ V [Sm ] : h(R) = h∗ (R)} | ≥ 2. It follows that XT D(h∗ , H [Sm ], Sm ) ≥ n. ˆ Definition 25 (XTD Growth Function, Hanneke, 2007b) For m ≥ 0, V ⊆ H , δ ∈ [0, 1], XT D(V, P, m, δ) = inf t|∀h ∈ H , Pr {XT D(h,V [Sm ], Sm ) > t} ≤ δ . Lemma 26 Let H be an hypothesis class, P an unknown distribution, and δ > 0. Then, with probability of at least 1 − δ, n ≤ XT D(H , P, m, δ). ˆ Proof According to Definition 25, with probability of at least 1 − δ, XT D(h∗ , H [Sm ], Sm ) ≤ XT D(H , P, m, δ). Applying Claim 24 completes the proof. Lemma 27 (Balanced Axis-Aligned Rectangles, Hanneke, 2007b, Lemma 4) If P is a product distribution on Rd with continuous CDF, and H is the set of axis-aligned rectangles such that ∀h ∈ H , PrX∼P {h(X) = +1} ≥ λ, then, XT D(H , P, m, δ) ≤ O d2 dm log . λ δ Lemma 28 Blumer et al., 1989, Lemma 3.2.3 Let F be a binary hypothesis class of finite VC dimension d ≥ 1. For all k ≥ 1, define the k-fold union, Fk∪ Then, for all k ≥ 1, ∪k f i : f i ∈ F , 1 ≤ i ≤ k . i=1 VC(Fk∪ ) ≤ 2dk log2 (3k). Lemma 29 (order-n characterizing set complexity) Let H be the class of axis-aligned rectangles in Rd . Then, γ(H , n) ≤ O (dn log n) . 269 E L -YANIV AND W IENER − + Proof Let Sn = Sk ∪ Sn−k be a sample of size n composed of k negative examples, {x1 , x2 , . . . xk }, and n − k positive ones. Let H be the class of axis-aligned rectangles. We define, ∀1 ≤ i ≤ k, + Sn−k ∪ {(xi , −1)} . Ri Notice that V SH ,Ri includes all axis aligned rectangles that classify all samples in S+ as positive, and xi as negative. Therefore, the agreement region of V SH ,Ri is composed of two components as depicted in Figure 1. The first component is the smallest rectangle that bounds the positive samples, and the second is an unbounded convex polytope defined by up to d hyperplanes intersecting at xi . Let AGRi be the agreement region of V SH ,Ri and AGR the agreement region of V SH ,Sn . Clearly, Ri ⊆ Sn , so V SH ,Sn ⊆ V SH ,Ri , and AGRi ⊆ AGR, and it follows that k i=1 AGRi ⊆ AGR. Assume, by contradiction, that x ∈ AGR but x ∈ k AGRi . Therefore, for any 1 ≤ i ≤ k, there exist i=1 (i) (i) (i) (i) two hypotheses h1 , h2 ∈ V SH ,Ri , such that, h1 (x) = h2 (x). Assume, without loss of generality, (i) that h1 (x) = 1. We define k h1 (i) h1 k and (i) h2 , h2 i=1 i=1 (i) meaning that h1 classifies a sample as positive if and only if all hypotheses h1 classify it as positive. Noting that the intersection of axis-aligned rectangles is itself an axis-aligned rectangle, we know (i) (i) that h1 , h2 ∈ H . Moreover, for any xi we have, h1 (xi ) = h2 (xi ) = −1, so also h1 (xi ) = h2 (xi ) = −1, and h1 , h2 ∈ V SH ,Sn . But h1 (x) = h2 (x). Contradiction. Therefore, k AGRi . AGR = i=1 It is well known that the VC dimension of a hyper-rectangle in Rd is 2d. The VC dimension of AGRi is bounded by the VC dimension of the union of two hyper-rectangles in Rd . Furthermore, the VC dimension of AGR is bounded by the VC dimension of the union of all AGRi . Applying Lemma 28 twice we get, VCdim {AGR} ≤ 42dk log2 (3k) ≤ 42dn log2 (3n). If k = 0 then the entire sample is positive and the region of agreement is an hyper-rectangle. Therefore, VCdim {AGR} = 2d. If k = n then the entire sample is negative and the region of agreement is the points of the samples themselves. Hence, VCdim {AGR} = n. Overall we get that in all cases, VCdim {AGR} ≤ 42dn log2 (3n) = O (dn log n) . 270 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Figure 1: Agreement region of V SH ,Ri . Corollary 30 (Balanced Axis-Aligned Rectangles) Under the same conditions of Lemma 27, the class of balanced axis-aligned rectangles in Rd can be perfectly selectively learned with fast coverage rate. Proof Applying Lemmas 26 and 27 we get that with probability of at least 1 − δ, dm d2 log . λ δ n≤O ˆ Any balanced axis-aligned rectangle belongs to the class of all axis-aligned rectangles. Therefore, the coverage of CSS for the class of balanced axis-aligned rectangles is bounded bellow by the coverage of the class of axis-aligned rectangles. Applying Lemma 29, and assuming m ≥ d, we obtain, γ (H , n) ≤ O d ˆ d2 d2 dm dm log log log λ δ λ δ ≤O dm d3 log2 . λ λδ Applying Theorem 15 completes the proof. 6.2 Disagreement Coefficient In this section we show interesting relations between the disagreement coefficient and coverage bounds in perfect selective classification. We begin by defining, for an hypothesis h ∈ H , the set of all hypotheses that are r-close to h. Definition 31 (Hanneke, 2011b, p.337) For any hypothesis h ∈ H , distribution P over X , and r > 0, define the set B(h, r) of all hypotheses that reside in a ball of radius r around h, B(h, r) h′ ∈ H : Pr X∼P h′ (X) = h(X) ≤ r . Theorem 32 (Vapnik and Chervonenkis, 1971; Anthony and Bartlett, 1999, p.53) Let H be a hypothesis class with VC-dimension d. For any probability distribution P on X × {±1}, with probability of at least 1 − δ over the choice of Sm , any hypothesis h ∈ H consistent with Sm satisfies R(h) ≤ η(d, m, δ) 2 2em 2 + ln . d ln m d δ 271 E L -YANIV AND W IENER For any G ⊆ H and distribution P we denote by ∆G the volume of the disagreement region of G, ∆G Pr {DIS(G)} . Definition 33 (Disagreement coefficient, Hanneke, 2009) Let ε ≥ 0. The disagreement coefficient of the hypothesis class H with respect to the target distribution P is θ(ε) θh∗ (ε) = sup r>ε ∆B(h∗ , r) . r The following theorem formulates an intimate relation between active learning (disagreement coefficient) and selective classification. Theorem 34 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, ε ≥ 0, and θ(ε), the corresponding disagreement coefficient. Let (h, g) be a perfect selective classifier (CSS, see Section 2.3). Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − θ(ε) · max {η(d, m, δ), ε} . Proof Clearly, R(h, g) = 0, and it remains to prove the coverage bound. By Theorem 32, with probability of at least 1 − δ, ∀h ∈ V SH ,Sm R(h) ≤ η(d, m, δ) ≤ max {η(d, m, δ), ε} . Therefore, V SH ,Sm ⊆ B (h∗ , max {η(d, m, δ), ε}) , ∆V SH ,Sm ≤ ∆B (h∗ , max {η(d, m, δ), ε}) . By Definition 33, for any r′ > ε, ∆B(h∗ , r′ ) ≤ θ(ε)r′ . Thus, the proof is complete by recalling that Φ(h, g) = 1 − ∆V SH ,Sm . Theorem 34 tells us that whenever our learning problem (specified by the pair (H , P)) has a disagreement coefficient that grows slowly with respect to 1/ε , it can be (perfectly) selectively learned with a “fast” coverage bound. Consequently, through Theorem 9 we also know that in each case where there exists a disagreement coefficient that grows slowly with respect to 1/ε, active learning with a fast rate can also be deduced directly through a reduction from perfect selective classification. It follows that as far as fast rates in active learning are concerned, whatever can be accomplished by bounding the disagreement coefficient, can be accomplished also using perfect selective classification. This result is summarized in the following corollary. Corollary 35 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, and θ(ε), the corresponding disagreement coefficient. If θ(ε) = O(polylog(1/ε)), there exists a coverage bound such that an application of Theorem 7 ensures that (H , P) is actively learnable with exponential label complexity speedup. 272 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Proof The proof is established by straightforward applications of Theorems 34 with ε = 1/m and 9. The following result, due to Hanneke (2011a), implies a coverage upper bound for CSS. Lemma 36 (Hanneke, 2011a, Proof of Lemma 47) Let H be an hypothesis class, P an unknown distribution, and r ∈ (0, 1). Then, EP ∆Dm ≥ (1 − r)m ∆B (h∗ , r) , where Dm V SH ,Sm ∩ B (h∗ , r) . (3) Theorem 37 (Coverage upper bound) Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1). Then, for any r ∈ (0, 1), 1 > α > δ, BΦ (H , δ, m) ≤ 1 − where BΦ (H , δ, m) is any coverage bound. (1 − r)m − α ∆B (h∗ , r) , 1−α Proof Recalling the definition of Dm (3), clearly Dm ⊆ V SH ,Sm and Dm ⊆ B(h∗ , r). These inclusions imply (respectively), by the definition of disagreement set, ∆Dm ≤ ∆V SH ,Sm , and ∆Dm ≤ ∆B(h∗ , r). (4) Using Markov’s inequality (in inequality (5) of the following derivation) and applying (4) (in equality (6)), we thus have, (1 − r)m − α (1 − r)m − α ∆B (h∗ , r) ≤ Pr ∆Dm ≤ ∆B (h∗ , r) 1−α 1−α 1 − (1 − r)m Pr ∆B (h∗ , r) − ∆Dm ≥ ∆B (h∗ , r) 1−α 1 − (1 − r)m Pr |∆B (h∗ , r) − ∆Dm | ≥ ∆B (h∗ , r) 1−α E {|∆B (h∗ , r) − ∆Dm |} (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) ∆B (h∗ , r) − E∆Dm . (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) Pr ∆V SH ,Sm ≤ = ≤ ≤ = Applying Lemma 36 we therefore obtain, ≤ (1 − α) · ∆B (h∗ , r) − (1 − r)m ∆B(h∗ , r) = 1 − α < 1 − δ. (1 − (1 − r)m ) ∆B (h∗ , r) Observing that for any coverage bound, Pr ∆V SH ,Sm ≤ 1 − BΦ (H , δ, m) ≥ 1 − δ, completes the proof. 273 (5) (6) E L -YANIV AND W IENER Corollary 38 Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1/8). Then for any m ≥ 2, 1 1 , BΦ (H , δ, m) ≤ 1 − ∆B h∗ , 7 m where BΦ (H , δ, m) is any coverage bound. Proof The proof is established by a straightforward application of Theorem 37 with α = 1/8 and r = 1/m. With Corollary 38 we can bound the disagreement coefficient for settings whose coverage bound is known. Corollary 39 Let H be an hypothesis class, P an unknown distribution, and BΦ (H , δ, m) a coverage bound. Then the disagreement coefficient is bounded by, θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Proof Applying Corollary 38 we get that for any r ∈ (0, 1/2), 1 − BΦ (H , 1/9, ⌊1/r⌋) ∆B(h∗ , r) ∆B(h∗ , 1/⌊1/r⌋) ≤ ≤ 7· . r r r Therefore, θ(ε) = sup r>ε ∆B(h∗ , r) ≤ max r sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Corollary 40 Let H be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then θ(ε) ≤ O polylog 1 ε . Proof Applying Corollary 39 together with inequality 2 we get that θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 r 7 ≤ max sup ·O r∈(ε,1/2) r 2 d+3 (log ⌊1/r⌋)d ·9 2 ⌊1/r⌋ 274 ,2 ≤O 1 log ε d2 . ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION 7. Concluding Remarks For quite a few years, since its inception, the theory of target-independent bounds for noise-free active learning managed to handle relatively simple settings, mostly revolving around homogeneous linear classifiers under the uniform distribution over the sphere. It is likely that this distributional uniformity assumption was often adapted to simplify analyses. However, it was shown by Dasgupta (2005) that under this distribution, exponential speed up cannot be achieved when considering general (non homogeneous) linear classifiers. The reason for this behavior is related to the two tasks that a good active learner should successfully accomplish: exploration and exploitation. Intuitively (and oversimplifying things) exploration is the task of obtaining at least one sample in each class, and exploitation is the process of refining the decision boundary by requesting labels of points around the boundary. Dasgupta showed that exploration cannot be achieved fast enough under the uniform distribution on the sphere. The source of this difficulty is the fact that under this distribution all training points reside on their convex hull. In general, the speed of exploration (using linear classifiers) depends on the size (number of vertices) of the convex hull of the training set. When using homogeneous linear classifiers, exploration is trivially achieved (under the uniform distribution) and exploitation can achieve exponential speedup. So why in the non-verifiable model (Balcan et al., 2008) it is possible to achieve exponential speedup even when using non homogeneous linear classifiers under the uniform distribution? The answer is that in the non-verifiable model, label complexity attributed to exploration is encapsulated in a target-dependent “constant.” Specifically, in Balcan et al. (2008) this constant is explicitly defined to be the probability mass of the minority class. Indeed, in certain noise free settings using linear classifiers, where the minority class is large enough, exploration is a non issue. In general, however, exploration is a major bottleneck in practical active learning (Baram et al., 2004; Begleiter et al., 2008). The present results show how exponential speedup can be achieved, including exploration, when using different (and perhaps more natural) distributions. With these good news, a somewhat pessimistic picture arises from the lower bound we obtained for the exponential dependency on the dimension d. This negative result is not restricted to streambased active learning and readily applies also to the pool-based model. While the bound is only asymptotic, we conjecture that it also holds for finite samples. Moreover, we believe that within the stream- or pool-based settings a similar statement should hold true for any active learning method (and not necessarily CAL-based querying strategies). This result indicates that when performing noise free active learning of linear classifiers, aggressive feature selection is beneficial for exploration speedup. We note, however, that it remains open whether a slowdown exponent of d (rather than d 2 ) is achievable. We have exposed interesting relations of the present technique to well known complexity measures for active learning, namely, the teaching dimension and the disagreement coefficient. These developments were facilitated by observations made by Hanneke on the teaching dimension and the disagreement coefficient. These relations gave rise to further observations on active learning, which are discussed in Section 6 and include exponential speedup for balanced axis-aligned rectangles. Finally, we note that the intimate relation between selective classification and the disagreement coefficient was recently exposed in another result for selective classification where the disagreement coefficient emerged as a dominating factor in a coverage bound for agnostic selective classification (El-Yaniv and Wiener, 2011). 275 E L -YANIV AND W IENER Acknowledgments We thank the anonymous reviewers for their good comments. This paper particularly benefited from insightful observations made by one of the reviewers, which are summarized in Section 6, including the proof of Theorem 37 and the link between our n and the extended teaching dimension ˆ (Lemmas 26 and 27). Appendix A. Lemma 41 For any m ≥ 3, a ≥ 1, b ≥ 1 we get lna (bi) i m ∑ i=1 Proof Setting f (x) lna (bx) x , < 4 a+1 ln (b(m + 1)). a we have lna−1 (bx) df = (a − ln bx) · . dx x2 Therefore, f is monotonically increasing when x < ea /b, monotonically decreasing function when x ≥ ea /b and its attains its maximum at x = ea /b. Consequently, for i < ea /b − 1, or i ≥ ea /b + 1, i+1 f (i) ≤ f (x)dx. x=i−1 For ea /b − 1 ≤ i < ea /b + 1, f (i) ≤ f (ea /b) = b a e a ≤ aa . (7) Therefore, if m < ea − 1 we have, m ∑ i=1 m f (i) = lna (b) + ∑ f (i) < 2 · i=2 m+1 x=1 f (x)dx ≤ 2 lna+1 (b(m + 1)). a+1 Otherwise, m ≥ ea /b, in which case we overcome the change of slope by adding twice the (upper bound on the) maximal value (7), m ∑ f (i) < i=1 ≤ 2 2 2 lna+1 (b(m + 1)) + 2aa = lna+1 (b(m + 1)) + aa+1 a+1 a+1 a 2 4 2 lna+1 (b(m + 1)) + lna+1 bm ≤ lna+1 (b(m + 1)). a+1 a a 276 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Appendix B. Alternative Proof of Lemma 6 Using Super Martingales Define Wk ∑k (Zi − bi ). We assume that with probability of at least 1 − δ/2, i=1 Pr{Zi |Z1 , . . . , Zi−1 } ≤ bi , simultaneously for all i. Since Zi is a binary random variable it is easy to see that (w.h.p.), EZi {Wi |Z1 , . . . , Zi−1 } = Pr{Zi |Z1 , . . . , Zi−1 } − bi +Wi−1 ≤ Wi−1 , and the sequence W1m W1 , . . . ,Wm is a super-martingale with high probability. We apply the following theorem by McDiarmid that refers to martingales (but can be shown to apply to supermartingales, by following its original proof). Theorem 42 (McDiarmid, 1998, Theorem 3.12) Let Y1 , . . . ,Yn be a martingale difference sequence with −ak ≤ Yk ≤ 1 − ak for each k; let A = 1 ∑ ak . Then, for any ε > 0, n Pr ∑ Yk ≥ Anε ≤ exp (−[(1 + ε) ln(1 + ε) − ε]An) ≤ exp − Anε2 . 2(1 + ε/3) In our case, Yk = Wk −Wk−1 = Zk − bk ≤ 1 − bk and we apply the (revised) theorem with ak and An ∑ bk B. We thus obtain, for any 0 < ε < 1, Pr ∑ Zk ≥ B + Bε ≤ exp − bk Bε2 . 2(1 + ε/3) Equating the right-hand side to δ/2, we obtain ε = 2 2 ln ± 3 δ 4 22 2 ln + 8B ln 9 δ δ ≤ 1 2 ln + 3 δ 1 22 ln + 9 δ = 2 2 ln + 3 δ 2B ln 2 δ 2B ln /2B 2 δ /B /B. Applying the union bound completes the proof. References M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. L. Atlas, D. Cohn, R. Ladner, A.M. El-Sharkawi, and R.J. Marks. Training connectionist networks with queries and selective sampling. In Neural Information Processing Systems (NIPS), pages 566–573, 1990. M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In 21st Annual Conference on Learning Theory (COLT), pages 45–56, 2008. 277 E L -YANIV AND W IENER Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255–291, 2004. P.L. Bartlett and M.H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840, 2008. R. Begleiter, R. El-Yaniv, and D. Pechyony. Repairing self-confident active-transductive learners using systematic exploration. Pattern Recognition Letters, 29(9):1245–1251, 2008. A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Chervonenkis dimension. Journal of the ACM, 36, 1989. Learnability and the Vapnik- C.K. Chow. An optimum character recognition system using decision function. IEEE Transactions on Computers, 6(4):247–254, 1957. C.K. Chow. On optimum recognition error and reject trade-off. IEEE Transactions on Information Theory, 16:41–36, 1970. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, pages 235–242, 2005. S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11:1605–1641, 2010. R. El-Yaniv and Y. Wiener. Agnostic selective classification. In Neural Information Processing Systems (NIPS), 2011. S. Fine, R. Gilad-Bachrach, and E. Shamir. Query by committee, linear separation and random walks. Theoretical Computer Science, 284(1):25–51, 2002. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In Advances in Neural Information Processing Systems (NIPS) 5, pages 483–490, 1993. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for averaged classifiers. Annals of Statistics, 32(4):1698–1722, 2004. E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, the Hebrew University of Jerusalem, 2007. S. Goldman and M. Kearns. On the complexity of teaching. JCSS: Journal of Computer and System Sciences, 50, 1995. 278 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 353–360, 2007a. S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), volume 4539 of Lecture Notes in Artificial Intelligence, pages 66–81, 2007b. S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009. S. Hanneke. Activized learning: Transforming passive to active with improved label complexity. CoRR, abs/1108.1766, 2011a. URL http://arxiv.org/abs/1108.1766. informal publication. S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 37(1):333–361, 2011b. T. Heged¨ s. Generalized teaching dimensions and the query complexity of learning. In COLT: u Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1995. R. Herbei and M.H. Wegkamp. Classification with reject option. The Canadian Journal of Statistics, 34(4):709–721, 2006. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963. D. Hug and M. Reitzner. Gaussian polytopes: variances and limit theorems, June 2005. D. Hug, G. O. Munsonious, and M. Reitzner. Asymptotic mean values of Gaussian polytopes. Beitr¨ ge Algebra Geom., 45:531–548, 2004. a C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195– 248. Springer-Verlag, 1998. T. Mitchell. Version spaces: a candidate elimination approach to rule learning. In IJCAI’77: Proceedings of the 5th international joint conference on Artificial Intelligence, pages 305–310, 1977. H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning theory (COLT), pages 287–294, 1992. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. M.H. Wegkamp. Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1: 155–168, 2007. 279
5 0.071486622 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
6 0.068583801 80 jmlr-2012-On Ranking and Generalization Bounds
7 0.064299665 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
8 0.051551647 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
9 0.048809469 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
10 0.048330929 73 jmlr-2012-Multi-task Regression using Minimal Penalties
11 0.044406738 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
12 0.043452036 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
13 0.042455681 111 jmlr-2012-Structured Sparsity and Generalization
14 0.03807598 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
15 0.037869498 91 jmlr-2012-Plug-in Approach to Active Learning
16 0.037613343 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
17 0.034546304 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
18 0.033720594 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
19 0.03121593 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
20 0.030862629 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
topicId topicWeight
[(0, -0.167), (1, 0.034), (2, -0.111), (3, 0.056), (4, -0.077), (5, -0.09), (6, 0.111), (7, 0.081), (8, 0.008), (9, 0.013), (10, -0.055), (11, -0.028), (12, 0.03), (13, -0.032), (14, -0.073), (15, -0.165), (16, 0.127), (17, -0.065), (18, -0.022), (19, -0.022), (20, 0.07), (21, -0.088), (22, 0.041), (23, 0.018), (24, -0.044), (25, 0.012), (26, -0.094), (27, 0.061), (28, -0.104), (29, 0.031), (30, 0.079), (31, -0.076), (32, -0.166), (33, 0.031), (34, -0.093), (35, 0.12), (36, 0.077), (37, -0.124), (38, -0.048), (39, -0.136), (40, -0.018), (41, 0.015), (42, -0.15), (43, 0.132), (44, -0.069), (45, 0.143), (46, 0.181), (47, 0.032), (48, 0.034), (49, -0.144)]
simIndex simValue paperId paperTitle
same-paper 1 0.90503287 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
Author: Sivan Sabato, Naftali Tishby
Abstract: In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. Keywords: multiple-instance learning, learning theory, sample complexity, PAC learning, supervised classification
2 0.59972996 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
3 0.44168797 13 jmlr-2012-Active Learning via Perfect Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We discover a strong relation between two known learning models: stream-based active learning and perfect selective classification (an extreme case of ‘classification with a reject option’). For these models, restricted to the realizable case, we show a reduction of active learning to selective classification that preserves fast rates. Applying this reduction to recent results for selective classification, we derive exponential target-independent label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high dimensional mixture of Gaussians. Finally, we study the relation between the proposed technique and existing label complexity measures, including teaching dimension and disagreement coefficient. Keywords: classification with a reject option, perfect classification, selective classification, active learning, selective sampling, disagreement coefficient, teaching dimension, exploration vs. exploitation 1. Introduction and Related Work Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, potentially leading to significantly faster learning. In recent years it has been gaining considerable recognition as a vital technique for efficiently implementing inductive learning in many industrial applications where abundance of unlabeled data exists, and/or in cases where labeling costs are high. In this paper we expose a strong relation between active learning and selective classification, another known alternative learning model (Chow, 1970; El-Yaniv and Wiener, 2010). Focusing on binary classification in realizable settings we consider standard stream-based active learning, which is also referred to as online selective sampling (Atlas et al., 1990; Cohn et al., 1994). In this model the learner is given an error objective ε and then sequentially receives unlabeled examples. At each step, after observing an unlabeled example x, the learner decides whether or not to request the label of x. The learner should terminate the learning process and output a binary classifier whose true error is guaranteed to be at most ε with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. A label complexity bound of O(d log(d/ε)) for actively learning ε-good classifier from a concept class with VC-dimension d, provides an exponential speedup in terms of 1/ε relative to standard (passive) supervised learning where the sample complexity is typically O(d/ε). The study of (stream-based, realizable) active learning is paved with very interesting theoretical results. Initially, only a few cases were known where active learning provides significant advanc 2012 Ran El-Yaniv and Yair Wiener. E L -YANIV AND W IENER tage over passive learning. Perhaps the most favorable result was an exponential label complexity speedup for learning homogeneous linear classifiers where the (linearly separable) data is uniformly distributed over the unit sphere. This result was manifested by various authors using various analysis techniques, for a number of strategies that can all be viewed in hindsight as approximations or variations of the “CAL algorithm” of Cohn et al. (1994). Among these studies, the earlier theoretical results (Seung et al., 1992; Freund et al., 1993, 1997; Fine et al., 2002; Gilad-Bachrach, 2007) considered Bayesian settings and studied the speedup obtained by the Query by Committee (QBC) algorithm. The more recent results provided PAC style analyses (Dasgupta et al., 2009; Hanneke, 2007a, 2009). Lack of positive results for other non-toy problems, as well as various additional negative results that were discovered, led some researchers to believe that active learning is not necessarily advantageous in general. Among the striking negative results is Dasgupta’s negative example for actively learning general (non-homogeneous) linear classifiers (even in two dimensions) under the uniform distribution over the sphere (Dasgupta, 2005). A number of recent innovative papers proposed alternative models for active learning. Balcan et al. (2008) introduced a subtle modification of the traditional label complexity definition, which opened up avenues for new positive results. According to their new definition of “non-verifiable” label complexity, the active learner is not required to know when to stop the learning process with a guaranteed ε-good classifier. Their main result, under this definition, is that active learning is asymptotically better than passive learning in the sense that only o(1/ε) labels are required for actively learning an ε-good classifier from a concept class that has a finite VC-dimension. Another result they accomplished is an exponential label complexity speedup for (non-verifiable) active learning of non-homogeneous linear classifiers under the uniform distribution over the the unit sphere. Based on Hanneke’s characterization of active learning in terms of the “disagreement coefficient” (Hanneke, 2007a), Friedman (2009) recently extended the Balcan et al. results and proved that a target-dependent exponential speedup can be asymptotically achieved for a wide range of “smooth” learning problems (in particular, the hypothesis class, the instance space and the distribution should all be expressible by smooth functions). He proved that under such smoothness conditions, for any target hypothesis h∗ , Hanneke’s disagreement coefficient is bounded above in terms of a constant c(h∗ ) that depends on the unknown target hypothesis h∗ (and is independent of δ and ε). The resulting label complexity is O (c(h∗ ) d polylog(d/ε)) (Hanneke, 2011b). This is a very general result but the target-dependent constant involved in this bound is only guaranteed to be finite. With this impressive progress in the case of target-dependent bounds for active learning, the current state of affairs in the target-independent bounds for active learning arena leaves much to be desired. To date the most advanced result in this model, which was already essentially established by Seung et al. and Freund et al. more than fifteen years ago (Seung et al., 1992; Freund et al., 1993, 1997), is still a target-independent exponential speed up bound for homogeneous linear classifiers under the uniform distribution over the sphere. The other learning model we contemplate that will be shown to have strong ties to active learning, is selective classification, which is mainly known in the literature as ‘classification with a reject option.’ This old-timer model, that was already introduced more than fifty years ago (Chow, 1957, 1970), extends standard supervised learning by allowing the classifier to opt out from predictions in cases where it is not confident. The incentive is to increase classification reliability over instances that are not rejected by the classifier. Thus, using selective classification one can potentially achieve 256 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION a lower error rate using the same labeling “budget.” The main quantities that characterize a selective classifier are its (true) error and coverage rate (or its complement, the rejection rate). There is already substantial volume of research publications on selective classification, that kept emerging through the years. The main theme in many of these publications is the implementation of certain reject mechanisms for specific learning algorithms like support vector machines and neural networks. Among the few theoretical studies on selective classification, there are various excess risk bounds for ERM learning (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Wegkamp, 2007), and certain coverage/risk guarantees for selective ensemble methods (Freund et al., 2004). In a recent work (El-Yaniv and Wiener, 2010) the trade-off between error and coverage was examined and in particular, a new extreme case of selective learning was introduced. In this extreme case, termed here “perfect selective classification,” the classifier is given m labeled examples and is required to instantly output a classifier whose true error is perfectly zero with certainty. This is of course potentially doable only if the classifier rejects a sufficient portion of the instance space. A non-trivial result for perfect selective classification is a high probability lower bound on the classifier coverage (or equivalently, an upper bound on its rejection rate). Such bounds have recently been presented in El-Yaniv and Wiener (2010). In Section 3 we present a reduction of active learning to perfect selective classification that preserves “fast rates.” This reduction enables the luxury of analyzing dynamic active learning problems as static problems. Relying on a recent result on perfect selective classification from El-Yaniv and Wiener (2010), in Section 4 we then apply our reduction and conclude that general (non-homogeneous) linear classifiers are actively learnable at exponential (in 1/ε) label complexity rate when the data distribution is an arbitrary unknown finite mixture of high dimensional Gaussians. While we obtain exponential label complexity speedup in 1/ε, we incur exponential slowdown in d 2 , where d is the problem dimension. Nevertheless, in Section 5 we prove a lower bound of Ω((log m)(d−1)/2 (1 + o(1)) on the label complexity, when considering the class of unrestricted linear classifiers under a Gaussian distribution. Thus, an exponential slowdown in d is unavoidable in such settings. Finally, in Section 6 we relate the proposed technique to other complexity measures for active learning. Proving and using a relation to the teaching dimension (Goldman and Kearns, 1995) we show, by relying on a known bound for the teaching dimension, that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show that the coverage of perfect selective classification can be bounded below using the disagreement coefficient. Conversely, we show that the disagreement coefficient can be bounded above using any coverage bound for perfect selective classification. Consequently, the results here imply that the disagreement coefficient can be sufficiently bounded to ensure fast active learning for the case of linear classifiers under a mixture of Gaussians. 2. Active Learning and Perfect Selective Classification In binary classification the goal is to learn an accurate binary classifier, h : X → {±1}, from a finite labeled training sample. Here X is some instance space and the standard assumption is that the training sample, Sm = {(xi , yi )}m , containing m labeled examples, is drawn i.i.d. from some i=1 unknown distribution P(X,Y ) defined over X × {±1}. The classifier h is chosen from some hypothesis class H . In this paper we focus on the realizable setting whereby labels are defined by 257 E L -YANIV AND W IENER some unknown target hypothesis h∗ ∈ H . Thus, the underlying distribution reduces to P(X). The performance of a classifier h is quantified by its true zero-one error, R(h) Pr{h(X) = h∗ (X)}. A positive result for a classification problem (H , P) is a learning algorithm that given an error target ε and a confidence parameter δ can output, based on Sm , an hypothesis h whose error R(h) ≤ ε, with probability of at least 1 − δ. A bound B(ε, δ) on the size m of labeled training sample sufficient for achieving this is called the sample complexity of the learning algorithm. A classical result is that any consistent learning algorithm has sample complexity of O( 1 (d log( 1 ) + log( 1 ))), where d is ε ε δ the VC-dimension of H (see, e.g., Anthony and Bartlett, 1999). 2.1 Active Learning We consider the following standard active learning model. In this model the learner sequentially observes unlabeled instances, x1 , x2 , . . ., that are sampled i.i.d. from P(X). After receiving each xi , the learning algorithm decides whether or not to request its label h∗ (xi ), where h∗ ∈ H is an unknown target hypothesis. Before the start of the game the algorithm is provided with some desired error rate ε and confidence level δ. We say that the learning algorithm actively learned the problem instance (H , P) if at some point it can terminate this process, after observing m instances and requesting k labels, and output an hypothesis h ∈ H whose error R(h) ≤ ε, with probability of at least 1 − δ. The quality of the algorithm is quantified by the number k of requested labels, which is called the label complexity. A positive result for a learning problem (H , P) is a learning algorithm that can actively learn this problem for any given ε and δ, and for every h∗ , with label complexity bounded above by L(ε, δ, h∗ ). If there is a label complexity bound that is O(polylog(1/ε)) we say that the problem is actively learnable at exponential rate. 2.2 Selective Classification Following the formulation in El-Yaniv and Wiener (2010) the goal in selective classification is to learn a pair of functions (h, g) from a labeled training sample Sm (as defined above for passive learning). The pair (h, g), which is called a selective classifier, consists of a binary classifier h ∈ H , and a selection function, g : X → {0, 1}, which qualifies the classifier h as follows. For any sample x ∈ X , the output of the selective classifier is (h, g)(x) h(x) iff g(x) = 1, and (h, g)(x) abstain iff g(x) = 0. Thus, the function g is a filter that determines a sub-domain of X over which the selective classifier will abstain from classifications. A selective classifier is thus characterized by its coverage, Φ(h, g) EP {g(x)}, which is the P-weighted volume of the sub-domain of X that is not filtered out, and its error, R(h, g) = E{I(h(X) = h∗ (X)) · g(X)}/Φ(h, g), which is the zero-one loss restricted to the covered sub-domain. Note that this is a “smooth” generalization of passive learning and, in particular, R(h, g) reduces to R(h) (standard classification) if g(x) ≡ 1. We expect to see a trade-off between R(h, g) and Φ(h, g) in the sense that smaller error should be obtained by compromising the coverage. A major issue in selective classification is how to optimally control this trade-off. In this paper we are concerned with an extreme case of this trade-off whereby (h, g) is required to achieve a perfect score of zero error with certainty. This extreme learning objective is termed perfect learning in El-Yaniv and Wiener (2010). Thus, for a perfect selective classifier (h, g) we always have R(h, g) = 0, and its quality is determined by its guaranteed coverage. A positive result for (perfect) selective classification problem (H , P) is a learning algorithm that uses a labeled training sample Sm (as in passive learning) to output a perfect selective classifier (h, g) for which Φ(h, g) ≥ BΦ (H , δ, m) with probability of at least 1 − δ, for any given δ. The bound 258 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION BΦ = BΦ (H , δ, m) is called a coverage bound (or coverage rate) and its complement, 1 − BΦ , is called a rejection bound (or rate). A coverage rate BΦ = 1 − O( polylog(m) ) (and the corresponding m 1 − BΦ rejection rate) are qualified as fast. 2.3 The CAL Algorithm and the Consistent Selective Strategy (CSS) The major players in active learning and in perfect selective classification are the CAL algorithm and the consistent selective strategy (CSS), respectively. To define them we need the following definitions. Definition 1 (Version space, Mitchell, 1977) Given an hypothesis class H and a training sample Sm , the version space V SH ,Sm is the set of all hypotheses in H that classify Sm correctly. Definition 2 (Disagreement set, Hanneke, 2007a; El-Yaniv and Wiener, 2010) Let G ⊂ H . The disagreement set w.r.t. G is defined as DIS(G ) {x ∈ X : ∃h1 , h2 ∈ G The agreement set w.r.t. G is AGR(G ) s.t. h1 (x) = h2 (x)} . X \ DIS(G ). The main strategy for active learning in the realizable setting (Cohn et al., 1994) is to request labels only for instances belonging to the disagreement set and output any (consistent) hypothesis belonging to the version space. This strategy is often called the CAL algorithm. A related strategy for perfect selective classification was proposed in El-Yaniv and Wiener (2010) and termed consistent selective strategy (CSS). Given a training set Sm , CSS takes the classifier h to be any hypothesis in V SH ,Sm (i.e., a consistent learner), and takes a selection function g that equals one for all points in the agreement set with respect to V SH ,Sm , and zero otherwise. 3. From Coverage Bound to Label Complexity Bound In this section we present a reduction from stream-based active learning to perfect selective classification. Particularly, we show that if there exists for H a perfect selective classifier with a fast rejection rate of O(polylog(m)/m), then the CAL algorithm will actively learn H with exponential label complexity rate of O(polylog(1/ε)). Lemma 3 Let Sm = {(x1 , y1 ), . . . , (xm , ym )} be a sequence of m labeled samples drawn i.i.d. from an unknown distribution P(X) and let Si = {(x1 , y1 ), . . . , (xi , yi )} be the i-prefix of Sm . Then, with probability of at least 1 − δ over random choices of Sm , the following bound holds simultaneously for all i = 1, . . . , m − 1, Pr xi+1 ∈ DIS(V SH ,Si )|Si ≤ 1 − BΦ H , δ , 2⌊log2 (i)⌋ , log2 (m) where BΦ (H , δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class H , confidence δ and sample size m . 259 E L -YANIV AND W IENER Proof For j = 1, . . . , m, abbreviate DIS j DIS(V SH ,S j ) and AGR j AGR(V SH ,S j ). By definition, DIS j = X \ AGR j . By the definitions of a coverage bound and agreement/disagreement sets, with probability of at least 1 − δ over random choices of S j BΦ (H , δ, j) ≤ Pr{x ∈ AGR j |S j } = Pr{x ∈ DIS j |S j } = 1 − Pr{x ∈ DIS j |S j }. Applying the union bound we conclude that the following inequality holds simultaneously with high probability for t = 0, . . . , ⌊log2 (m)⌋ − 1, Pr{x2t +1 ∈ DIS2t |S2t } ≤ 1 − BΦ H , δ , 2t . log2 (m) (1) For all j ≤ i, S j ⊆ Si , so DISi ⊆ DIS j . Therefore, since the samples in Sm are all drawn i.i.d., for any j ≤ i, Pr {xi+1 ∈ DISi |Si } ≤ Pr xi+1 ∈ DIS j |S j = Pr x j+1 ∈ DIS j |S j . The proof is complete by setting j = 2⌊log2 (i)⌋ ≤ i, and applying inequality (1). Lemma 4 (Bernstein’s inequality Hoeffding, 1963) Let X1 , . . . , Xn be independent zero-mean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then, for all positive t, n 2 /2 t . Pr ∑ Xi > t ≤ exp − 2 + Mt/3 i=1 ∑E Xj Lemma 5 Let Zi , i = 1, . . . , m, be independent Bernoulli random variables with success probabilities pi . Then, for any 0 < δ < 1, with probability of at least 1 − δ, m ∑ (Zi − E{Zi }) ≤ 2 ln i=1 Proof Define Wi 1 2 1 ∑ pi + 3 ln δ . δ Zi − E{Zi } = Zi − pi . Clearly, E{Wi } = 0, |Wi | ≤ 1, E{Wi2 } = pi (1 − pi ). Applying Bernstein’s inequality (Lemma 4) on the Wi , n t 2 /2 t 2 /2 = exp − Pr ∑ Wi > t ≤ exp − ∑ pi (1 − pi ) + t/3 i=1 ∑ E W j2 + t/3 ≤ exp − t 2 /2 . ∑ pi + t/3 Equating the right-hand side to δ and solving for t, we have t 2 /2 1 = ln δ ∑ pi + t/3 ⇐⇒ 2 1 1 t 2 − t · ln − 2 ln ∑ pi = 0, 3 δ δ 260 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and the positive solution of this quadratic equation is t= 1 1 ln + 3 δ 1 21 1 2 1 ln + 2 ln ∑ pi < ln + 9 δ δ 3 δ 2 ln 1 pi . δ∑ Lemma 6 Let Z1 , Z2 , . . . , Zm be a high order Markov sequence of dependent binary random variables defined in the same probability space. Let X1 , X2 , . . . , Xm be a sequence of independent random variables such that, Pr {Zi = 1|Zi−1 , . . . , Z1 , Xi−1 , . . . , X1 } = Pr {Zi = 1|Xi−1 , . . . , X1 } . Define P1 Pr {Z1 = 1}, and for i = 2, . . . , m, Pi Pr {Zi = 1|Xi−1 , . . . , X1 } . Let b1 , b2 . . . bm be given constants independent of X1 , X2 , . . . , Xm .1 Assume that Pi ≤ bi simultaneously for all i with probability of at least 1 − δ/2, δ ∈ (0, 1). Then, with probability of at least 1 − δ, m m 2 2 2 ∑ Zi ≤ ∑ bi + 2 ln δ ∑ bi + 3 ln δ . i=1 i=1 We proceed with a direct proof of Lemma 6. An alternative proof of this lemma, using supermartingales, appears in Appendix B. Proof For i = 1, . . . , m, let Wi be binary random variables satisfying bi + I(Pi ≤ bi ) · (Pi − bi ) , Pi bi − Pi ,0 , Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } max 1 − Pi Pr{Wi = 1|Wi−1 , . . . ,W1 , Xi−1 , . . . , X1 } = Pr{Wi = 1|Xi−1 , . . . , X1 }. Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } We notice that Pr{Wi = 1|Xi−1 , . . . , X1 } = Pr{Wi = 1, Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1, Zi = 0|Xi−1 , . . . , X1 } = Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } Pr{Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } Pr{Zi = 0|Xi−1 , . . . , X1 } = Pi + bi −Pii (1 − Pi ) = bi , Pi ≤ bi ; 1−P bi · Pi + 0 = bi , else. Pi Hence the distribution of each Wi is independent of Xi−1 , . . . , X1 , and the Wi are independent Bernoulli random variables with success probabilities bi . By construction if Pi ≤ bi then Pr{Wi = 1|Zi = 1} = X Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } = 1. 1. Precisely we require that each of the bi were selected before Xi are chosen 261 E L -YANIV AND W IENER By assumption Pi ≤ bi for all i simultaneously with probability of at least 1−δ/2. Therefore, Zi ≤ Wi simultaneously with probability of at least 1 − δ/2. We now apply Lemma 5 on the Wi . The proof is then completed using the union bound. Theorem 7 Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Then with probability of at least 1 − δ over choices of Sm , the number of label requests k by the CAL algorithm is bounded by k ≤ Ψ(H , δ, m) + where Ψ(H , δ, m) 2 2 2 2 ln Ψ(H , δ, m) + ln , δ 3 δ m ∑ i=1 1 − BΦ H , δ , 2⌊log2 (i)⌋ 2 log2 (m) and BΦ (H , δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class H , confidence δ and sample size m . Proof According to CAL, the label of sample xi will be requested iff xi ∈ DIS(V SH ,Si−1 ). For i = 1, . . . , m, let Zi be binary random variables such that Zi 1 iff CAL requests a label for sample xi . Applying Lemma 3 we get that for all i = 2, . . . , m, with probability of at least 1 − δ/2 Pr{Zi = 1|Si−1 } = Pr xi ∈ DIS(V SH ,Si−1 )|Si−1 ≤ 1 − BΦ H , δ , 2⌊log2 (i−1)⌋ . 2 log2 (m) For i = 1, BΦ (H , δ, 1) = 0 and the above inequality trivially holds. An application of Lemma 6 on the variables Zi completes the proof. Theorem 7 states an upper bound on the label complexity expressed in terms of m, the size of the sample provided to CAL. This upper bound is very convenient for directly analyzing the active learning speedup relative to supervised learning. A standard label complexity upper bound, which depends on 1/ε, can be extracted using the following simple observation. Lemma 8 (Hanneke, 2009; Anthony and Bartlett, 1999) Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Let H be a hypothesis class whose finite VC dimension is d, and let ε and δ be given. If m≥ 4 2 12 d ln + ln , ε ε δ then, with probability of at least 1 − δ, CAL will output a classifier whose true error is at most ε. Proof Hanneke (2009) observed that since CAL requests a label whenever there is a disagreement in the version space, it is guaranteed that after processing m examples, CAL will output a classifier that is consistent with all the m examples introduced to it. Therefore, CAL is a consistent learner. A classical result (Anthony and Bartlett, 1999, Theorem 4.8) is that any consistent learner will achieve, with probability of at least 1 − δ, a true error not exceeding ε after observing at most 12 2 4 ε d ln ε + ln δ labeled examples. 262 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 9 Let H be a hypothesis class whose finite VC dimension is d. If the rejection rate of CSS polylog( m ) δ (see definition in Section 2.3) is O , then (H , P) is actively learnable with exponential m label complexity speedup. Proof Plugging this rejection rate into Ψ (defined in Theorem 7) we have, m m polylog δ Ψ(H , δ, m) ∑ 1 − BΦ (H , , 2⌊log2 (i)⌋ ) = ∑ O log2 (m) i i=1 i=1 Applying Lemma 41 we get Ψ(H , δ, m) = O polylog By Theorem 7, k = O polylog m δ m log(m) δ i log(m) δ . . , and an application of Lemma 8 concludes the proof. 4. Label Complexity Bounding Technique and Its Applications In this section we present a novel technique for deriving target-independent label complexity bounds for active learning. The technique combines the reduction of Theorem 7 and a general datadependent coverage bound for selective classification from El-Yaniv and Wiener (2010). For some learning problems it is a straightforward technical exercise, involving VC-dimension calculations, to arrive with exponential label complexity bounds. We show a few applications of this technique resulting in both reproductions of known label complexity exponential rates as well as a new one. The following definitions (El-Yaniv and Wiener, 2010) are required for introducing the technique. Definition 10 (Version space compression set) For any hypothesis class H , let Sm be a labeled sample of m points inducing a version space V SH ,Sm . The version space compression set, S′ ⊆ Sm , ˆ ˆ is a smallest subset of Sm satisfying V SH ,Sm = V SH ,S′ . The (unique) number n = n(H , Sm ) = |S′ | is called the version space compression set size. Remark 11 Our ”version space compression set” is precisely Hanneke’s ”minimum specifying set” (Hanneke, 2007b) for f on U with respect to V , where, f = h∗ , U = Sm , V = H [Sm ] (see Definition 23). Definition 12 (Characterizing hypothesis) For any subset of hypotheses G ⊆ H , the characterizing hypothesis of G , denoted fG (x), is a binary hypothesis over X (not restricted to H ) obtaining positive values over the agreement set AGR(G ) (Definition 2), and zero otherwise. Definition 13 (Order-n characterizing set) For each n, let Σn be the set of all possible labeled samples of size n (all n-subsets, each with all 2n possible labelings). The order-n characterizing set of H , denoted Fn , is the set of all characterizing hypotheses fG (x), where G ⊆ H is a version space induced by some member of Σn . 263 E L -YANIV AND W IENER Definition 14 (Characterizing set complexity) Let Fn be the order-n characterizing set of H . The order-n characterizing set complexity of H , denoted γ (H , n), is the VC-dimension of Fn . The following theorem, credited to (El-Yaniv and Wiener, 2010, Theorem 21), is a powerful data-dependent coverage bound for perfect selective learning, expressed in terms of the version space compression set size and the characterizing set complexity. Theorem 15 (Data-dependent coverage guarantee) For any m, let a1 , a2 , . . . , am ∈ R be given, such that ai ≥ 0 and ∑m ai ≤ 1. Let (h, g) be perfect selective classifier (CSS, see Section 2.3). i=1 Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − 2 γ (H , n) ln+ ˆ m 2em 2 , + ln an δ γ (H , n) ˆ ˆ where n is the size of the version space compression set and γ (H , n) is the order-n characterizing ˆ ˆ ˆ set complexity of H . Given an hypothesis class H , our recipe to deriving active learning label complexity bounds for H is: (i) calculate both n and γ (H , n); (ii) apply Theorem 15, obtaining a bound BΦ for the ˆ ˆ coverage; (iii) plug BΦ in Theorem 7 to get a label complexity bound expressed as a summation; (iv) Apply Lemma 41 to obtain a label complexity bound in a closed form. 4.1 Examples In the following example we derive a label complexity bound for the concept class of thresholds (linear separators in R). Although this is a toy example (for which an exponential rate is well known) it does exemplify the technique, and in many other cases the application of the technique is not much harder. Let H be the class of thresholds. We first show that the corresponding version space compression set size n ≤ 2. Assume w.l.o.g. that h∗ (x) I(x > w) for some w ∈ (0, 1). Let ˆ x− max{xi ∈ Sm |yi = −1} and x+ min(xi ∈ Sm |yi = +1). At least one of x− or x+ exist. Let ′ ′ Sm = {(x− , −1), (x+ , +1)}. Then V SH ,Sm = V SH ,Sm , and n = |Sm | ≤ 2. Now, γ (H , 2) = 2, because ˆ ′ the order-2 characterizing set of H is the class of intervals in R whose VC-dimension is 2. Plugging these numbers in Theorem 15, and using the assignment a1 = a2 = 1/2, BΦ (H , δ, m) = 1 − 2 4 ln (m/δ) 2 ln (em) + ln = 1−O . m δ m Next we plug BΦ in Theorem 7 obtaining a raw label complexity m Ψ(H , δ, m) = ∑ 1 − BΦ H , i=1 δ , 2⌊log2 (i)⌋ 2 log2 (m) m = ∑O i=1 ln (log2 (m) · i/δ) . i Finally, by applying Lemma 41, with a = 1 and b = log2 m/δ, we conclude that Ψ(H , δ, m) = O ln2 m δ . Thus, H is actively learnable with exponential speedup, and this result applies to any distribution. In Table 1 we summarize the n and γ (H , n) values we calculated for four other hypothesis classes. The ˆ ˆ 264 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Hypothesis class Distribution n ˆ γ (H , n) ˆ Linear separators in R Intervals in R Linear separators in R2 any any (target-dependent)2 any distribution on the unit circle (target-dependent)2 2 4 4 2 4 4 Linear separators in Rd Balanced axis-aligned rectangles in Rd mixture of Gaussians product distribution O (log m)d−1 /δ O (log (dm/δ)) O nd/2+1 ˆ O (d n log n) ˆ ˆ Table 1: The n and γ of various hypothesis spaces achieving exponential rates. ˆ last two cases are fully analyzed in Sections 4.2 and 6.1, respectively. For the other classes, where γ and n are constants, it is clear (Theorem 15) that exponential rates are obtained. We emphasize that ˆ the bounds for these two classes are target-dependent as they require that Sm include at least one sample from each class. 4.2 Linear Separators in Rd Under Mixture of Gaussians In this section we state and prove our main example, an exponential label complexity bound for linear classifiers in Rd . Theorem 16 Let H be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by 2 k=O (log m)d +1 δ(d+3)/2 . Therefore by Lemma 8 we get k = O (poly(1/δ) · polylog(1/ε)) . Proof The following is a coverage bound for linear classifiers in d dimensions that holds in our setting with probability of at least 1 − δ (El-Yaniv and Wiener, 2010, Corollary 33),3 2 Φ(h, g) ≥ 1 − O 1 (log m)d · (d+3)/2 m δ . (2) 2. Target-dependent with at least one sample in each class. 3. This bound uses the fact that for linear classifiers in d dimensions n = O (log m)d−1 /δ (El-Yaniv and Wiener, 2010, ˆ Lemma 32), and that γ (H , n) = O nd/2+1 (El-Yaniv and Wiener, 2010, Lemma 27). ˆ ˆ 265 E L -YANIV AND W IENER Plugging this bound in Theorem 7 we obtain, Ψ(H , δ, m) = m ∑ i=1 1 − BΦ H , m = ∑O i=1 = O δ , 2⌊log2 (i)⌋ 2 log2 (m) 2 log2 (m) (log i)d · i δ log2 (m) δ d+3 2 d+3 2 m (log(i))d ·∑ i i=1 2 . Finally, an application of Lemma 41 with a = d 2 and b = 1 completes the proof. 5. Lower Bound on Label Complexity In the previous section we have derived an upper bound on the label complexity of CAL for various classifiers and distributions. In the case of linear classifiers in Rd we have shown an exponential speed up in terms of 1/ε but also an exponential slow down in terms of the dimension d. In passive learning there is a linear dependency in the dimension while in our case (active learning using CAL) there is an exponential one. Is it an artifact of our bounding technique or a fundamental phenomenon? To answer this question we derive an asymptotic lower bound on the label complexity. We show that the exponential dependency in d is unavoidable (at least asymptotically) for every bounding technique when considering linear classifier even under a single Gaussian (isotropic) distribution. The argument is obtained by the observation that CAL has to request a label to any point on the convex hull of a sample Sm . The bound is obtained using known results from probabilistic geometry, which bound the first two moments of the number of vertices of a random polytope under the Gaussian distribution. Definition 17 (Gaussian polytope) Let X1 , ..., Xm be i.i.d. random points in Rd with common stan1 dard normal distribution (with zero mean and covariance matrix 2 Id ). A Gaussian polytope Pm is the convex hull of these random points. Denote by fk (Pm ) the number of k-faces in the Gaussian polytope Pm . Note that f0 (Pm ) is the number of vertices in Pm . The following two Theorems asymptotically bound the average and variance of fk (Pm ). Theorem 18 (Hug et al., 2004, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then E fk (Pm ) = c(k,d) (log m) d−1 2 · (1 + o(1)) as m → ∞, where c(k,d) is a constant depending only on k and d. 266 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 19 (Hug and Reitzner, 2005, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then there exists a positive constant cd , depending only on the dimension, such that d−1 Var ( fk (Pm )) ≤ cd (log m) 2 for all k ∈ {0, . . . , d − 1}. We can now use Chebyshev’s inequality to lower bound the number of vertices in Pm ( f0 (Pm )) with high probability. Theorem 20 Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution and δ > 0 be given. Then with probability of at least 1 − δ, f0 (Pm ) ≥ cd (log m) d−1 2 d−1 cd ˜ − √ (log m) 4 δ · (1 + o(1)) as m → ∞, where cd and cd are constants depending only on d. ˜ Proof Using Chebyshev’s inequality (in the second inequality), as well as Theorem 19 we get Pr ( f0 (Pm ) > E f0 (Pm ) − t) = 1 − Pr ( f0 (Pm ) ≤ E f0 (Pm ) − t) ≥ 1 − Pr (| f0 (Pm ) − E f0 (Pm )| ≥ t) d−1 cd Var ( f0 (Pm )) ≥ 1 − 2 (log m) 2 . ≥ 1− 2 t t Equating the RHS to 1 − δ and solving for t we get t= (log m) cd δ d−1 2 . Applying Theorem 18 completes the proof. Theorem 21 (Lower bound) Let H be the class of linear binary classifiers in Rd , and let the underlying distribution be standard normal distribution in Rd . Then there exists a target hypothesis such that, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by d−1 cd k ≥ (log m) 2 · (1 + o(1)). 2 as m → ∞, where cd is a constant depending only on d. Proof Let us look at the Gaussian polytope Pm induced by the random sample Sm . As long as all labels requested by CAL have the same value (the case of minuscule minority class) we note that every vertex of Pm falls in the region of disagreement with respect to any subset of Sm that do not include that specific vertex. Therefore, CAL will request label at least for each vertex of Pm . For sufficiently large m, in particular, 4 2cd d−1 ˜ √ log m ≥ , cd δ we conclude the proof by applying Theorem 20. 267 E L -YANIV AND W IENER 6. Relation to Existing Label Complexity Measures A number of complexity measures to quantify the speedup in active learning have been proposed. In this section we show interesting relations between our techniques and two well known measures, namely the teaching dimension (Goldman and Kearns, 1995) and the disagreement coefficient (Hanneke, 2009). Considering first the teaching dimension, we prove in Lemma 26 that the version space compression set size is bounded above, with high probability, by the extended teaching dimension growth function (introduced by Hanneke, 2007b). Consequently, it follows that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show in Theorem 34 that the coverage of CSS can be bounded below using the disagreement coefficient. Conversely, in Corollary 39 we show that the disagreement coefficient can be bounded above using any coverage bound for CSS. Consequently, the results here imply that the disagreement coefficient, θ(ε) grows slowly with 1/ε for the case of linear classifiers under a mixture of Gaussians. 6.1 Teaching Dimension The teaching dimension is a label complexity measure proposed by Goldman and Kearns (1995). The dimension of the hypothesis class H is the minimum number of examples required to present to any consistent learner in order to uniquely identify any hypothesis in the class. We now define the following variation of the extended teaching dimension (Heged¨ s, 1995) u due to Hanneke. Throughout we use the notation h1 (S) = h2 (S) to denote the fact that the two hypotheses agree on the classification of all instances in S. ¨ Definition 22 (Extended Teaching Dimension, Hegedus, 1995; Hanneke, 2007b) Let V ⊆ H , m ≥ m, 0, U ∈ X ∀f ∈ H , XT D( f ,V,U) = inf {t | ∃R ⊆ U : | {h ∈ V : h(R) = f (R)} | ≤ 1 ∧ |R| ≤ t} . Definition 23 (Hanneke, 2007b) For V ⊆ H , V [Sm ] denotes any subset of V such that ∀h ∈ V, | h′ ∈ V [Sm ] : h′ (Sm ) = h(Sm ) | = 1. Claim 24 Let Sm be a sample of size m, H an hypothesis class, and n = n(H , Sm ), the version space ˆ compression set size. Then, XT D(h∗ , H [Sm ], Sm ) = n. ˆ Proof Let Sn ⊆ Sm be a version space compression set. Assume, by contradiction, that there exist ˆ two hypotheses h1 , h2 ∈ H [Sm ], each of which agrees on the given classifications of all examples in Sn . Therefore, h1 , h2 ∈ V SH ,Sn , and by the definition of version space compression set, we get ˆ ˆ h1 , h2 ∈ V SH ,Sm . Hence, | h ∈ H [Sm ] : h(Sm ) = h∗ (Sm ) | ≥ 2, which contradicts definition 23. Therefore, | h ∈ H [Sm ] : h(Sn ) = h∗ (Sn ) | ≤ 1, ˆ ˆ 268 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and XT D(h∗ , H [Sm ], Sm ) ≤ |Sn | = n. ˆ ˆ Let R ⊂ Sm be any subset of size |R| < n. Consequently, V SH ,Sm ⊂ V SH ,R , and there exist hypothesis, ˆ ′ ∈ VS h H ,R , that agrees with all labeled examples in R, but disagrees with at least one example in Sm . Thus, h′ (Sm ) = h∗ (Sm ), and according to definition 23, there exist hypotheses h1 , h2 ∈ H [Sm ] such that h1 (Sm ) = h′ (Sm ) = h∗ (Sm ) = h2 (Sm ). But h1 (R) = h2 (R) = h∗ (R), so | {h ∈ V [Sm ] : h(R) = h∗ (R)} | ≥ 2. It follows that XT D(h∗ , H [Sm ], Sm ) ≥ n. ˆ Definition 25 (XTD Growth Function, Hanneke, 2007b) For m ≥ 0, V ⊆ H , δ ∈ [0, 1], XT D(V, P, m, δ) = inf t|∀h ∈ H , Pr {XT D(h,V [Sm ], Sm ) > t} ≤ δ . Lemma 26 Let H be an hypothesis class, P an unknown distribution, and δ > 0. Then, with probability of at least 1 − δ, n ≤ XT D(H , P, m, δ). ˆ Proof According to Definition 25, with probability of at least 1 − δ, XT D(h∗ , H [Sm ], Sm ) ≤ XT D(H , P, m, δ). Applying Claim 24 completes the proof. Lemma 27 (Balanced Axis-Aligned Rectangles, Hanneke, 2007b, Lemma 4) If P is a product distribution on Rd with continuous CDF, and H is the set of axis-aligned rectangles such that ∀h ∈ H , PrX∼P {h(X) = +1} ≥ λ, then, XT D(H , P, m, δ) ≤ O d2 dm log . λ δ Lemma 28 Blumer et al., 1989, Lemma 3.2.3 Let F be a binary hypothesis class of finite VC dimension d ≥ 1. For all k ≥ 1, define the k-fold union, Fk∪ Then, for all k ≥ 1, ∪k f i : f i ∈ F , 1 ≤ i ≤ k . i=1 VC(Fk∪ ) ≤ 2dk log2 (3k). Lemma 29 (order-n characterizing set complexity) Let H be the class of axis-aligned rectangles in Rd . Then, γ(H , n) ≤ O (dn log n) . 269 E L -YANIV AND W IENER − + Proof Let Sn = Sk ∪ Sn−k be a sample of size n composed of k negative examples, {x1 , x2 , . . . xk }, and n − k positive ones. Let H be the class of axis-aligned rectangles. We define, ∀1 ≤ i ≤ k, + Sn−k ∪ {(xi , −1)} . Ri Notice that V SH ,Ri includes all axis aligned rectangles that classify all samples in S+ as positive, and xi as negative. Therefore, the agreement region of V SH ,Ri is composed of two components as depicted in Figure 1. The first component is the smallest rectangle that bounds the positive samples, and the second is an unbounded convex polytope defined by up to d hyperplanes intersecting at xi . Let AGRi be the agreement region of V SH ,Ri and AGR the agreement region of V SH ,Sn . Clearly, Ri ⊆ Sn , so V SH ,Sn ⊆ V SH ,Ri , and AGRi ⊆ AGR, and it follows that k i=1 AGRi ⊆ AGR. Assume, by contradiction, that x ∈ AGR but x ∈ k AGRi . Therefore, for any 1 ≤ i ≤ k, there exist i=1 (i) (i) (i) (i) two hypotheses h1 , h2 ∈ V SH ,Ri , such that, h1 (x) = h2 (x). Assume, without loss of generality, (i) that h1 (x) = 1. We define k h1 (i) h1 k and (i) h2 , h2 i=1 i=1 (i) meaning that h1 classifies a sample as positive if and only if all hypotheses h1 classify it as positive. Noting that the intersection of axis-aligned rectangles is itself an axis-aligned rectangle, we know (i) (i) that h1 , h2 ∈ H . Moreover, for any xi we have, h1 (xi ) = h2 (xi ) = −1, so also h1 (xi ) = h2 (xi ) = −1, and h1 , h2 ∈ V SH ,Sn . But h1 (x) = h2 (x). Contradiction. Therefore, k AGRi . AGR = i=1 It is well known that the VC dimension of a hyper-rectangle in Rd is 2d. The VC dimension of AGRi is bounded by the VC dimension of the union of two hyper-rectangles in Rd . Furthermore, the VC dimension of AGR is bounded by the VC dimension of the union of all AGRi . Applying Lemma 28 twice we get, VCdim {AGR} ≤ 42dk log2 (3k) ≤ 42dn log2 (3n). If k = 0 then the entire sample is positive and the region of agreement is an hyper-rectangle. Therefore, VCdim {AGR} = 2d. If k = n then the entire sample is negative and the region of agreement is the points of the samples themselves. Hence, VCdim {AGR} = n. Overall we get that in all cases, VCdim {AGR} ≤ 42dn log2 (3n) = O (dn log n) . 270 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Figure 1: Agreement region of V SH ,Ri . Corollary 30 (Balanced Axis-Aligned Rectangles) Under the same conditions of Lemma 27, the class of balanced axis-aligned rectangles in Rd can be perfectly selectively learned with fast coverage rate. Proof Applying Lemmas 26 and 27 we get that with probability of at least 1 − δ, dm d2 log . λ δ n≤O ˆ Any balanced axis-aligned rectangle belongs to the class of all axis-aligned rectangles. Therefore, the coverage of CSS for the class of balanced axis-aligned rectangles is bounded bellow by the coverage of the class of axis-aligned rectangles. Applying Lemma 29, and assuming m ≥ d, we obtain, γ (H , n) ≤ O d ˆ d2 d2 dm dm log log log λ δ λ δ ≤O dm d3 log2 . λ λδ Applying Theorem 15 completes the proof. 6.2 Disagreement Coefficient In this section we show interesting relations between the disagreement coefficient and coverage bounds in perfect selective classification. We begin by defining, for an hypothesis h ∈ H , the set of all hypotheses that are r-close to h. Definition 31 (Hanneke, 2011b, p.337) For any hypothesis h ∈ H , distribution P over X , and r > 0, define the set B(h, r) of all hypotheses that reside in a ball of radius r around h, B(h, r) h′ ∈ H : Pr X∼P h′ (X) = h(X) ≤ r . Theorem 32 (Vapnik and Chervonenkis, 1971; Anthony and Bartlett, 1999, p.53) Let H be a hypothesis class with VC-dimension d. For any probability distribution P on X × {±1}, with probability of at least 1 − δ over the choice of Sm , any hypothesis h ∈ H consistent with Sm satisfies R(h) ≤ η(d, m, δ) 2 2em 2 + ln . d ln m d δ 271 E L -YANIV AND W IENER For any G ⊆ H and distribution P we denote by ∆G the volume of the disagreement region of G, ∆G Pr {DIS(G)} . Definition 33 (Disagreement coefficient, Hanneke, 2009) Let ε ≥ 0. The disagreement coefficient of the hypothesis class H with respect to the target distribution P is θ(ε) θh∗ (ε) = sup r>ε ∆B(h∗ , r) . r The following theorem formulates an intimate relation between active learning (disagreement coefficient) and selective classification. Theorem 34 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, ε ≥ 0, and θ(ε), the corresponding disagreement coefficient. Let (h, g) be a perfect selective classifier (CSS, see Section 2.3). Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − θ(ε) · max {η(d, m, δ), ε} . Proof Clearly, R(h, g) = 0, and it remains to prove the coverage bound. By Theorem 32, with probability of at least 1 − δ, ∀h ∈ V SH ,Sm R(h) ≤ η(d, m, δ) ≤ max {η(d, m, δ), ε} . Therefore, V SH ,Sm ⊆ B (h∗ , max {η(d, m, δ), ε}) , ∆V SH ,Sm ≤ ∆B (h∗ , max {η(d, m, δ), ε}) . By Definition 33, for any r′ > ε, ∆B(h∗ , r′ ) ≤ θ(ε)r′ . Thus, the proof is complete by recalling that Φ(h, g) = 1 − ∆V SH ,Sm . Theorem 34 tells us that whenever our learning problem (specified by the pair (H , P)) has a disagreement coefficient that grows slowly with respect to 1/ε , it can be (perfectly) selectively learned with a “fast” coverage bound. Consequently, through Theorem 9 we also know that in each case where there exists a disagreement coefficient that grows slowly with respect to 1/ε, active learning with a fast rate can also be deduced directly through a reduction from perfect selective classification. It follows that as far as fast rates in active learning are concerned, whatever can be accomplished by bounding the disagreement coefficient, can be accomplished also using perfect selective classification. This result is summarized in the following corollary. Corollary 35 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, and θ(ε), the corresponding disagreement coefficient. If θ(ε) = O(polylog(1/ε)), there exists a coverage bound such that an application of Theorem 7 ensures that (H , P) is actively learnable with exponential label complexity speedup. 272 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Proof The proof is established by straightforward applications of Theorems 34 with ε = 1/m and 9. The following result, due to Hanneke (2011a), implies a coverage upper bound for CSS. Lemma 36 (Hanneke, 2011a, Proof of Lemma 47) Let H be an hypothesis class, P an unknown distribution, and r ∈ (0, 1). Then, EP ∆Dm ≥ (1 − r)m ∆B (h∗ , r) , where Dm V SH ,Sm ∩ B (h∗ , r) . (3) Theorem 37 (Coverage upper bound) Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1). Then, for any r ∈ (0, 1), 1 > α > δ, BΦ (H , δ, m) ≤ 1 − where BΦ (H , δ, m) is any coverage bound. (1 − r)m − α ∆B (h∗ , r) , 1−α Proof Recalling the definition of Dm (3), clearly Dm ⊆ V SH ,Sm and Dm ⊆ B(h∗ , r). These inclusions imply (respectively), by the definition of disagreement set, ∆Dm ≤ ∆V SH ,Sm , and ∆Dm ≤ ∆B(h∗ , r). (4) Using Markov’s inequality (in inequality (5) of the following derivation) and applying (4) (in equality (6)), we thus have, (1 − r)m − α (1 − r)m − α ∆B (h∗ , r) ≤ Pr ∆Dm ≤ ∆B (h∗ , r) 1−α 1−α 1 − (1 − r)m Pr ∆B (h∗ , r) − ∆Dm ≥ ∆B (h∗ , r) 1−α 1 − (1 − r)m Pr |∆B (h∗ , r) − ∆Dm | ≥ ∆B (h∗ , r) 1−α E {|∆B (h∗ , r) − ∆Dm |} (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) ∆B (h∗ , r) − E∆Dm . (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) Pr ∆V SH ,Sm ≤ = ≤ ≤ = Applying Lemma 36 we therefore obtain, ≤ (1 − α) · ∆B (h∗ , r) − (1 − r)m ∆B(h∗ , r) = 1 − α < 1 − δ. (1 − (1 − r)m ) ∆B (h∗ , r) Observing that for any coverage bound, Pr ∆V SH ,Sm ≤ 1 − BΦ (H , δ, m) ≥ 1 − δ, completes the proof. 273 (5) (6) E L -YANIV AND W IENER Corollary 38 Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1/8). Then for any m ≥ 2, 1 1 , BΦ (H , δ, m) ≤ 1 − ∆B h∗ , 7 m where BΦ (H , δ, m) is any coverage bound. Proof The proof is established by a straightforward application of Theorem 37 with α = 1/8 and r = 1/m. With Corollary 38 we can bound the disagreement coefficient for settings whose coverage bound is known. Corollary 39 Let H be an hypothesis class, P an unknown distribution, and BΦ (H , δ, m) a coverage bound. Then the disagreement coefficient is bounded by, θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Proof Applying Corollary 38 we get that for any r ∈ (0, 1/2), 1 − BΦ (H , 1/9, ⌊1/r⌋) ∆B(h∗ , r) ∆B(h∗ , 1/⌊1/r⌋) ≤ ≤ 7· . r r r Therefore, θ(ε) = sup r>ε ∆B(h∗ , r) ≤ max r sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Corollary 40 Let H be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then θ(ε) ≤ O polylog 1 ε . Proof Applying Corollary 39 together with inequality 2 we get that θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 r 7 ≤ max sup ·O r∈(ε,1/2) r 2 d+3 (log ⌊1/r⌋)d ·9 2 ⌊1/r⌋ 274 ,2 ≤O 1 log ε d2 . ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION 7. Concluding Remarks For quite a few years, since its inception, the theory of target-independent bounds for noise-free active learning managed to handle relatively simple settings, mostly revolving around homogeneous linear classifiers under the uniform distribution over the sphere. It is likely that this distributional uniformity assumption was often adapted to simplify analyses. However, it was shown by Dasgupta (2005) that under this distribution, exponential speed up cannot be achieved when considering general (non homogeneous) linear classifiers. The reason for this behavior is related to the two tasks that a good active learner should successfully accomplish: exploration and exploitation. Intuitively (and oversimplifying things) exploration is the task of obtaining at least one sample in each class, and exploitation is the process of refining the decision boundary by requesting labels of points around the boundary. Dasgupta showed that exploration cannot be achieved fast enough under the uniform distribution on the sphere. The source of this difficulty is the fact that under this distribution all training points reside on their convex hull. In general, the speed of exploration (using linear classifiers) depends on the size (number of vertices) of the convex hull of the training set. When using homogeneous linear classifiers, exploration is trivially achieved (under the uniform distribution) and exploitation can achieve exponential speedup. So why in the non-verifiable model (Balcan et al., 2008) it is possible to achieve exponential speedup even when using non homogeneous linear classifiers under the uniform distribution? The answer is that in the non-verifiable model, label complexity attributed to exploration is encapsulated in a target-dependent “constant.” Specifically, in Balcan et al. (2008) this constant is explicitly defined to be the probability mass of the minority class. Indeed, in certain noise free settings using linear classifiers, where the minority class is large enough, exploration is a non issue. In general, however, exploration is a major bottleneck in practical active learning (Baram et al., 2004; Begleiter et al., 2008). The present results show how exponential speedup can be achieved, including exploration, when using different (and perhaps more natural) distributions. With these good news, a somewhat pessimistic picture arises from the lower bound we obtained for the exponential dependency on the dimension d. This negative result is not restricted to streambased active learning and readily applies also to the pool-based model. While the bound is only asymptotic, we conjecture that it also holds for finite samples. Moreover, we believe that within the stream- or pool-based settings a similar statement should hold true for any active learning method (and not necessarily CAL-based querying strategies). This result indicates that when performing noise free active learning of linear classifiers, aggressive feature selection is beneficial for exploration speedup. We note, however, that it remains open whether a slowdown exponent of d (rather than d 2 ) is achievable. We have exposed interesting relations of the present technique to well known complexity measures for active learning, namely, the teaching dimension and the disagreement coefficient. These developments were facilitated by observations made by Hanneke on the teaching dimension and the disagreement coefficient. These relations gave rise to further observations on active learning, which are discussed in Section 6 and include exponential speedup for balanced axis-aligned rectangles. Finally, we note that the intimate relation between selective classification and the disagreement coefficient was recently exposed in another result for selective classification where the disagreement coefficient emerged as a dominating factor in a coverage bound for agnostic selective classification (El-Yaniv and Wiener, 2011). 275 E L -YANIV AND W IENER Acknowledgments We thank the anonymous reviewers for their good comments. This paper particularly benefited from insightful observations made by one of the reviewers, which are summarized in Section 6, including the proof of Theorem 37 and the link between our n and the extended teaching dimension ˆ (Lemmas 26 and 27). Appendix A. Lemma 41 For any m ≥ 3, a ≥ 1, b ≥ 1 we get lna (bi) i m ∑ i=1 Proof Setting f (x) lna (bx) x , < 4 a+1 ln (b(m + 1)). a we have lna−1 (bx) df = (a − ln bx) · . dx x2 Therefore, f is monotonically increasing when x < ea /b, monotonically decreasing function when x ≥ ea /b and its attains its maximum at x = ea /b. Consequently, for i < ea /b − 1, or i ≥ ea /b + 1, i+1 f (i) ≤ f (x)dx. x=i−1 For ea /b − 1 ≤ i < ea /b + 1, f (i) ≤ f (ea /b) = b a e a ≤ aa . (7) Therefore, if m < ea − 1 we have, m ∑ i=1 m f (i) = lna (b) + ∑ f (i) < 2 · i=2 m+1 x=1 f (x)dx ≤ 2 lna+1 (b(m + 1)). a+1 Otherwise, m ≥ ea /b, in which case we overcome the change of slope by adding twice the (upper bound on the) maximal value (7), m ∑ f (i) < i=1 ≤ 2 2 2 lna+1 (b(m + 1)) + 2aa = lna+1 (b(m + 1)) + aa+1 a+1 a+1 a 2 4 2 lna+1 (b(m + 1)) + lna+1 bm ≤ lna+1 (b(m + 1)). a+1 a a 276 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Appendix B. Alternative Proof of Lemma 6 Using Super Martingales Define Wk ∑k (Zi − bi ). We assume that with probability of at least 1 − δ/2, i=1 Pr{Zi |Z1 , . . . , Zi−1 } ≤ bi , simultaneously for all i. Since Zi is a binary random variable it is easy to see that (w.h.p.), EZi {Wi |Z1 , . . . , Zi−1 } = Pr{Zi |Z1 , . . . , Zi−1 } − bi +Wi−1 ≤ Wi−1 , and the sequence W1m W1 , . . . ,Wm is a super-martingale with high probability. We apply the following theorem by McDiarmid that refers to martingales (but can be shown to apply to supermartingales, by following its original proof). Theorem 42 (McDiarmid, 1998, Theorem 3.12) Let Y1 , . . . ,Yn be a martingale difference sequence with −ak ≤ Yk ≤ 1 − ak for each k; let A = 1 ∑ ak . Then, for any ε > 0, n Pr ∑ Yk ≥ Anε ≤ exp (−[(1 + ε) ln(1 + ε) − ε]An) ≤ exp − Anε2 . 2(1 + ε/3) In our case, Yk = Wk −Wk−1 = Zk − bk ≤ 1 − bk and we apply the (revised) theorem with ak and An ∑ bk B. We thus obtain, for any 0 < ε < 1, Pr ∑ Zk ≥ B + Bε ≤ exp − bk Bε2 . 2(1 + ε/3) Equating the right-hand side to δ/2, we obtain ε = 2 2 ln ± 3 δ 4 22 2 ln + 8B ln 9 δ δ ≤ 1 2 ln + 3 δ 1 22 ln + 9 δ = 2 2 ln + 3 δ 2B ln 2 δ 2B ln /2B 2 δ /B /B. Applying the union bound completes the proof. References M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. L. Atlas, D. Cohn, R. Ladner, A.M. El-Sharkawi, and R.J. Marks. Training connectionist networks with queries and selective sampling. In Neural Information Processing Systems (NIPS), pages 566–573, 1990. M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In 21st Annual Conference on Learning Theory (COLT), pages 45–56, 2008. 277 E L -YANIV AND W IENER Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255–291, 2004. P.L. Bartlett and M.H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840, 2008. R. Begleiter, R. El-Yaniv, and D. Pechyony. Repairing self-confident active-transductive learners using systematic exploration. Pattern Recognition Letters, 29(9):1245–1251, 2008. A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Chervonenkis dimension. Journal of the ACM, 36, 1989. Learnability and the Vapnik- C.K. Chow. An optimum character recognition system using decision function. IEEE Transactions on Computers, 6(4):247–254, 1957. C.K. Chow. On optimum recognition error and reject trade-off. IEEE Transactions on Information Theory, 16:41–36, 1970. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, pages 235–242, 2005. S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11:1605–1641, 2010. R. El-Yaniv and Y. Wiener. Agnostic selective classification. In Neural Information Processing Systems (NIPS), 2011. S. Fine, R. Gilad-Bachrach, and E. Shamir. Query by committee, linear separation and random walks. Theoretical Computer Science, 284(1):25–51, 2002. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In Advances in Neural Information Processing Systems (NIPS) 5, pages 483–490, 1993. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for averaged classifiers. Annals of Statistics, 32(4):1698–1722, 2004. E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, the Hebrew University of Jerusalem, 2007. S. Goldman and M. Kearns. On the complexity of teaching. JCSS: Journal of Computer and System Sciences, 50, 1995. 278 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 353–360, 2007a. S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), volume 4539 of Lecture Notes in Artificial Intelligence, pages 66–81, 2007b. S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009. S. Hanneke. Activized learning: Transforming passive to active with improved label complexity. CoRR, abs/1108.1766, 2011a. URL http://arxiv.org/abs/1108.1766. informal publication. S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 37(1):333–361, 2011b. T. Heged¨ s. Generalized teaching dimensions and the query complexity of learning. In COLT: u Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1995. R. Herbei and M.H. Wegkamp. Classification with reject option. The Canadian Journal of Statistics, 34(4):709–721, 2006. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963. D. Hug and M. Reitzner. Gaussian polytopes: variances and limit theorems, June 2005. D. Hug, G. O. Munsonious, and M. Reitzner. Asymptotic mean values of Gaussian polytopes. Beitr¨ ge Algebra Geom., 45:531–548, 2004. a C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195– 248. Springer-Verlag, 1998. T. Mitchell. Version spaces: a candidate elimination approach to rule learning. In IJCAI’77: Proceedings of the 5th international joint conference on Artificial Intelligence, pages 305–310, 1977. H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning theory (COLT), pages 287–294, 1992. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. M.H. Wegkamp. Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1: 155–168, 2007. 279
4 0.41443253 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
5 0.39338332 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
6 0.39000317 97 jmlr-2012-Regularization Techniques for Learning with Matrices
7 0.3597638 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
8 0.32118133 80 jmlr-2012-On Ranking and Generalization Bounds
9 0.31790349 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
10 0.31073615 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
11 0.2937898 111 jmlr-2012-Structured Sparsity and Generalization
12 0.29033929 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
13 0.2693266 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
14 0.25096485 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
15 0.2438266 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
16 0.22888361 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
17 0.22502412 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
18 0.21926063 73 jmlr-2012-Multi-task Regression using Minimal Penalties
19 0.20497087 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
20 0.1974102 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
topicId topicWeight
[(7, 0.033), (21, 0.032), (26, 0.056), (27, 0.011), (29, 0.035), (30, 0.341), (44, 0.011), (49, 0.031), (56, 0.016), (59, 0.021), (69, 0.012), (75, 0.081), (79, 0.018), (81, 0.011), (92, 0.09), (96, 0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.66104138 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
Author: Sivan Sabato, Naftali Tishby
Abstract: In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. Keywords: multiple-instance learning, learning theory, sample complexity, PAC learning, supervised classification
2 0.42809445 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
3 0.42682755 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
4 0.42613763 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
5 0.42361459 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan
Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning
6 0.42186657 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
7 0.42186177 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
8 0.41548634 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
9 0.4140577 97 jmlr-2012-Regularization Techniques for Learning with Matrices
10 0.4128474 82 jmlr-2012-On the Necessity of Irrelevant Variables
11 0.41248113 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
12 0.41162625 13 jmlr-2012-Active Learning via Perfect Selective Classification
13 0.41108188 111 jmlr-2012-Structured Sparsity and Generalization
14 0.41020131 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
15 0.40981483 80 jmlr-2012-On Ranking and Generalization Bounds
16 0.4087522 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
17 0.40774992 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
18 0.40751094 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
19 0.4062269 103 jmlr-2012-Sampling Methods for the Nyström Method
20 0.40588012 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection