jmlr jmlr2012 jmlr2012-109 knowledge-graph by maker-knowledge-mining

109 jmlr-2012-Stability of Density-Based Clustering


Source: pdf

Author: Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman

Abstract: High density clusters can be characterized by the connected components of a level set L(λ) = {x : p(x) > λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. Keywords: clustering, density estimation, level sets, stability, model selection

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. [sent-10, score-0.694]

2 We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. [sent-11, score-0.486]

3 Introduction A common approach to identifying high density clusters is based on using level sets of the density function (see, for instance, Hartigan, 1975; Rigollet and Vert, 2009). [sent-13, score-0.362]

4 We call the collection of clusters T = Cλ λ≥0 the cluster tree of the density p. [sent-24, score-0.326]

5 If the density does not contain any jumps or flat parts, then there is a one-to-one correspondence between the level sets indexed by the density level and the probability content. [sent-32, score-0.436]

6 Figure 1: The cluster tree for a Gaussian kernel density estimate (normal reference rule bandwidth) of a sample from the mixture (4/7)N(0, 1) + (2/7)N(3. [sent-44, score-0.368]

7 09 and of the total variation instability Γn (h) (bottom) for the mixture distribution in Figure 1 as functions of the bandwidth h. [sent-68, score-0.365]

8 • We consider plug-in estimates of the level sets L(λ) corresponding to fixed density levels λ and also to the level sets L(λα ) corresponding to fixed probability contents α using kernel density estimators. [sent-70, score-0.459]

9 We construct an estimator of the cluster instability and analyze its performance as n become large, and argue that stability can provide a guidance on the optimal choice of the bandwidth parameter. [sent-73, score-0.519]

10 • We formulate and analyze a stronger notion of cluster stability that is based on the total variation distance between kernel density estimates computed over different data subsamples. [sent-76, score-0.395]

11 In Section 3 we construct plug-in estimates L(λ) of the level set L(λ), T of the cluster tree T , and M(α) of the level set indexed by probability content M(α). [sent-84, score-0.387]

12 2 Assumptions We will use the following assumptions on the density p and its local behavior around a given density level λ. [sent-114, score-0.342]

13 (A2) Local density regularity at λ- For a given density level of interest λ, there exist constants 0 < κ1 ≤ κ2 < ∞ and 0 < ε0 such that, for all ε < ε0 , κ1 ε ≤ P({x : |p(x) − λ| ≤ ε}) ≤ κ2 ε. [sent-117, score-0.339]

14 Assumptions (A1) and (A2) impose some mild regularity conditions on the density: (A1) implies that the density cannot change drastically anywhere, while (A2) implies that the density cannot be too flat or steep locally around the level set. [sent-120, score-0.339]

15 Note that ph is the Lebesgue density of the probability measure Ph = P ∗ Kh , where ∗ denotes convolution of probability measures and Kh denotes the probability measure of a random variable with density Kh (z) = h−d K(z/h), z ∈ Rd . [sent-139, score-0.825]

16 We note that the compactness of K and assumption (A0) on p imply that the support of Ph is compact, while assumption (A1) on p further yields that ph ∈ Σ(A), both statements holding for all h ≥ 0 (for a formal proof of the second claim, see the end of the proof of Lemma 5). [sent-140, score-0.516]

17 (B2) Local density regularity at λ- For a given density level λ, there exist positive constants κ′ ≤ 1 κ′ , ε0 and H bounded away from 0 and ∞, such that, for all 0 ≤ ε < ε0 , 2 κ′ ε ≤ inf P ({x : |ph (x) − λ| ≤ ε}) ≤ sup P({x : |ph (X) − λ| ≤ ε}) ≤ κ′ ε. [sent-142, score-0.386]

18 1 2 0≤h≤H 0≤h≤H (B3) Local density regularity at α- For a given probability value α, there exist positive constants κ3 , η0 and H bounded away from 0 and ∞, such that, for all 0 ≤ η < |η0 |, sup d∞ (Mh (α), Mh (α + η)) ≤ κ3 |η|, 0≤h≤H where Mh (α) = {u : ph (u) > λα }. [sent-143, score-0.711]

19 Assumption (B3) characterizes the regularity of the level sets of ph and essentially states that the boundary of these level sets is well-behaved and not space-filling (see Tsybakov, 1997; Singh et al. [sent-151, score-0.663]

20 Our analysis depends crucially on the quantity ph,X − ph ∞ = supu∈Rd | ph,X (u) − ph (u)|, for which we use a probabilistic upper established by Gin´ and Guillou (2002), to which the reader is e referred for details. [sent-154, score-1.032]

21 For every ε > 0 and h > 0, there exists n(ε, h) such that, for all n ≥ n(ε, h) PX ( ph,X − ph 2. [sent-161, score-0.516]

22 911 ∞ 1 > εn ) ≤ , n (1) (2) K3 log n , nhd n (3) R INALDO , S INGH , N UGENT AND WASSERMAN The numbers n(ε, h) and n0 depend also on the VC characteristic of K and on B. [sent-165, score-0.474]

23 As a result, our measures rameter hn is only allowed to vanish at a slower rate than log n n of instability defined in Sections 4. [sent-170, score-0.342]

24 Estimating the Level Set and Cluster Tree For a given density level λ and kernel bandwidth h, the estimated level set is Lh,X (λ) = {x : ph,X (x) > λ}. [sent-175, score-0.376]

25 The performance of plug-in estimators of density level sets has been studied earlier, but we state the results here in a form that provides insights into the performance of instability measures proposed in the next section. [sent-179, score-0.438]

26 For any sequence hn = ω((log n/n)1/d ), let εn = K3 log n nhd n and rhn ,εn ,λ = P ({u : |p(u) − λ| < ADhn + εn }) . [sent-181, score-0.614]

27 The level set estimator indexed by the probability content α ∈ (0, 1) is given as Mh,X (α) = Lh,X (λh,α,X ), where λh,α,X = sup λ : PX ({u : ph,X (u) > λ}) ≥ α (4) and ph,X is the kernel density estimate computed using the data X with bandwidth h. [sent-188, score-0.453]

28 Lemma 5 Assume that the true density satisfies (A0)-(A1) and the density level sets of ph corresponding to probability content α satisfy (B3). [sent-197, score-0.893]

29 913 d ε2 + 8ne−nε 2 /32 , (5) R INALDO , S INGH , N UGENT AND WASSERMAN Using Lemma 4 and Lemma 5, we immediately obtain the following bound on the deviation of the estimated level λh,α,X from the true density level λα corresponding to probability content α. [sent-199, score-0.326]

30 M(α)∆Mh,X (α) Theorem 7 Assume that the density p satisfies conditions (A0) and (A1) and the level set of ph indexed by probability content α satisfies (B3). [sent-202, score-0.793]

31 For any sequence hn = ω((log n/n)1/d ), let εn = K3 log n nhd n and set C1,n = ADhn + εn , C2,n = ADhn + (Aκ3 + 1)εn + Aκ3 /n and rhn ,εn ,α = P ({u : |p(u) − λα | ≤ C1,n +C2,n }) . [sent-203, score-0.614]

32 The first measure of cluster stability we analyze is the level set stability, which we denote, for a fixed density level λ and a varying bandwidth value h, with Ξλ,n (h). [sent-220, score-0.523]

33 The reason is that, as h gets larger, ph (x) decreases. [sent-232, score-0.516]

34 Every time we reach a value of h such that a mode of ph has height λ, Ξλ,n (h) will increase. [sent-233, score-0.537]

35 Unlike the level set stability, the total variation stability is a global measure of cluster stability in the sense that it takes into account the difference between ph,X and ph,Y overall all measurable sets, not just over the level sets. [sent-244, score-0.445]

36 In this case we recommend choosing h to be the smallest bandwidth value h∗ for which the instability is no larger than a prespecified probability values β ∈ (0, 1), that is Γn (h∗ ) ≤ β. [sent-249, score-0.337]

37 In regions where it is small, it √ also behaves like 1/ nhd . [sent-260, score-0.474]

38 1 Level Set Stability For the analysis of the level set stability we focus on a single level set indexed by some density level value λ ≥ 0. [sent-262, score-0.455]

39 3, for values of h ≪ log n , the kernel density estimate n ph is no longer a reliable estimate of ph . [sent-284, score-1.225]

40 In particular, the fluctuations of rh,ε as a function of h are related to the values of h for which the critical points of ph are in the interval [λ − ε, λ + ε]. [sent-303, score-0.516]

41 Notice that, while Ah,ε remains bounded away from ∞ for any sequence εn → 0 and hn = ω(n−1/d ), the same is not true for Ah,ε , which remains 1 bounded away from 0 as long as εn = Θ( nhd ) and hn = ω(n−1/d ). [sent-307, score-0.672]

42 u∈Uh,ε 917 R INALDO , S INGH , N UGENT AND WASSERMAN Figure 3: Top plots and left bottom plot: two densities ph corresponding to the mixture distribution of Figure 1 for h = 0, the true density (in black) and h = 4. [sent-310, score-0.702]

43 Then, for all h ≤ h(δ, ε), √ 2vd Ah,ε ≤ 2 1 − Φ − nhd ε 3λ C(δ, λ) + √ nhd and 2 , √ C(δ, λ) 2 2vd − √ , nhd ε δλ nhd where Φ denote the cumulative distribution function of a standard normal random variable and Ah,ε ≥ 2 1 − Φ C(δ, λ) = 33 4 918 2 . [sent-318, score-1.896]

44 δvd λ S TABILITY OF D ENSITY-BASED C LUSTERING The dips in Figure 2 correspond to values for which ph does not have a mode at height λ. [sent-319, score-0.537]

45 d nh nhd Next we investigate the extent to which Ξλ,n (h) is concentrated around its mean ξλ,n (h) = E[Ξλ,n (h)]. [sent-322, score-0.583]

46 2n 2 The previous results highlight the interesting feature that the empirical instability will be less variable around the values of h for which the expected instability is very small (close to 0) or very large (close to 1/2). [sent-325, score-0.486]

47 To this end, for a fixed h > 0, define the level set of ph Lh (λ) = {u : ph (u) > λ} and recall its estimator based on the kernel density estimator ph,X : Lh,X (λ) = {u : ph,X (u) > λ}. [sent-353, score-1.293]

48 If the level λ is (h, ε)-stable, then the cluster tree estimate at level λ is an accurate estimate of the true cluster tree, in a sense made precise by the following result, whose proof follows easily from the proofs of our previous results and Lemma 2 in Rinaldo and Wasserman (2010). [sent-373, score-0.403]

49 Conversely, if ph is smooth (which is the case if, for instance, the kernel or p are smooth) and infu∈Uλ,h,ε ∇ph (u) > δ, then λ is (h, ε)-stable for a small enough ε. [sent-386, score-0.554]

50 The above result has a somewhat limited practical value, because the notion of a (h, ε)-stable λ depends on the unknown density ph . [sent-387, score-0.645]

51 For a fixed bandwidth h, we define the instability of the density cluster tree as the random function Th,n : R≥0 → [0, 1] given by λ → PZ (Lh,X (λ)∆Lh,Y (λ)) and denote its expectation by τh,n (λ) = EX,Y,Z [Th,n (λ)]. [sent-394, score-0.608]

52 For any λ > 0, the expected cluster tree instability can be expressed as τh,n (λ) = 2 πλ (u)(1 − πλ (u))dP(u). [sent-400, score-0.402]

53 u Then, √ 2vd Aλ,ε ≤ 2 1 − Φ − nhd ε 3λ and C(δ, λ) + √ nhd 2 , √ 2vd nhd ε δλ C(δ, λ) 2 − √ , nhd where Φ denote the cumulative distribution function of a standard normal random variable and 2 33 . [sent-406, score-1.896]

54 Collectively, the results above results show that the cluster tree of ph can be estimated more accurately for values of λ for which the quantity rλ,ε remain small, with ε a term vanishing in n. [sent-410, score-0.675]

55 In particular, the level sets λ with larger instability are then the ones that are close to a critical level of ph or for which the gradient of ph is not defined, vanishes of has infinite norm for some points in {x : ph (x) = λ}. [sent-411, score-1.923]

56 More precisely, we consider the stronger notion of instability corresponding to the total variation stability as defined in (7). [sent-419, score-0.371]

57 ) Suppose that K is the spherical kernel and that the probability distribution P satisfies the conditions a1 hd vd ≤ inf P(B(u, h)) ≤ sup P(B(u, h)) ≤ hd vd a2 , u∈S u∈S ∀h > 0, (10) for some positive constants a1 < a2 , where S denotes the support of P. [sent-438, score-0.609]

58 There exists a t, depending on δ but not on h, such that, for all h < h∗ and for n ≥ n0 ≡ n0 (a, a1 , a2 , h, delta) PX,Y Γn (h) ≥ t 1 nhd > 1 − δ. [sent-440, score-0.474]

59 50), the instability quickly drops as h increases and then oscillates as h approaches values that correspond to density estimates with uncertainty at those levels. [sent-554, score-0.427]

60 Note that around h = 3, we have very low instability values for almost all values of α, and hence this value of kernel bandwidth would be a good choice that yields stable clustering. [sent-577, score-0.358]

61 Increasing the number of bins improves the approximation to the kernel density estimate; the use of two hundred bins was found to give almost identical results to the original kernel density estimate (results not shown). [sent-593, score-0.347]

62 The distribution of the instability measures for each value of h is also plotted using density strips (see Jackson, 2008); on the grey-scale, darker colors indicate more common instability values. [sent-604, score-0.615]

63 Discussion We have investigated the properties of the density level set and cluster tree estimator based on kernel density estimates, and we have proposed and analyzed various measures of instability for these quantities. [sent-643, score-0.778]

64 Also observe that Assumption (A1) implies that, for any h > 0, the sup-norm density approximation error can be bounded as ph − p ∞ = sup x ≤ sup x x−y 1 p(y)dy − p(x) K hd h x−y 1 K A x − y dy d h h = ADh. [sent-675, score-0.819]

65 Proof of Lemma 4: Using (A1) and the fact that Rd K(z)dz = 1, Equation (11) states that for any h>0 ph − p ∞ ≤ ADh. [sent-681, score-0.516]

66 Then, for any α ∈ (0, 1) and h > 0, {u : p(u) > λh,α + ADh} ⊆ {u : ph (u) > λh,α } ⊆ {u : p(u) > λh,α − ADh}. [sent-682, score-0.516]

67 And as a result, P({u : p(u) > λh,α + ADh}) ≤ P({u : ph (u) > λh,α }) ≤ P({u : p(u) > λh,α − ADh}). [sent-683, score-0.516]

68 Since P({u : p(u) > λα }) = α = P({u : ph (u) > λh,α }), we have P({u : p(u) > λh,α + ADh}) ≤ P({u : p(u) > λα }) ≤ P({u : p(u) > λh,α − ADh}). [sent-684, score-0.516]

69 Proof of Lemma 5: Let Ch = {u : ph (u) > λ}, λ > 0 denote the class of level sets of ph and define the events Ph,ε = sup |PX (C) − P(C)| ≤ ε and C∈Ch 935 Ah,ε = {|| ph,X − ph ||∞ ≤ ε} . [sent-687, score-1.648]

70 Then, on Ah,ε , we obtain {u : ph (u) > λ + ε} ⊆ {u : ph,X (u) > λ} ⊆ {u : ph (u) > λ − ε}, ∀λ > 0. [sent-691, score-1.032]

71 Thus, on Ah,ε , PX ({u : ph (u) > λ + ε}) ≤ PX ({u : ph,X (u) > λ}) ≤ PX ({u : ph (u) > λ − ε}), uniformly over all λ > 0. [sent-692, score-1.032]

72 Recalling that, by definition, |PX ({u : ph,X (u) > λh,α,X }) − α| ≤ 1/n, we obtain, on the events Ph,ε and Ah,ε , 1 1 P({u : ph (u) > λh,α,X + ε}) − − ε ≤ α ≤ P{u : ph (u) > λh,α,X − ε}) + + ε. [sent-694, score-1.032]

73 n n (15) We will now show that, for level sets of ph indexed by α satisfying (B3), and for any η ∈ (−η0 , η0 ) and 0 < h ≤ H, |λh,α+η − λh,α | ≤ Aκ3 |η|. [sent-697, score-0.611]

74 (16) Recalling that ε + 1/n < η0 , Equations (15) and (16) will then imply λh,α − Aκ3 ε + 1 1 − ε ≤ λh,α,X ≤ λh,α + Aκ3 ε + + ε, n n on the events Ph,ε and Ah,ε , for level sets of ph indexed by α satisfying (B3) and with 0 < h ≤ H. [sent-698, score-0.611]

75 Then, notice that, because ph is Lipschitz and hence continuous, for every x ∈ ∂Mh (α), ph (x) = λh,α and, for every y ∈ ∂Mh (α + η), ph (y) = λh,α+η . [sent-701, score-1.56]

76 Thus, for |η| < η0 , x − y ≤ d∞ (Mh (α), Mh (α + η)) ≤ κ3 |η|, where the last inequality follows for level sets of ph indexed by α that satisfy (B3) and 0 < h ≤ H. [sent-703, score-0.611]

77 Therefore, |λh,α+η − λh,α | = |ph (y) − ph (x)| ≤ A x − y ≤ Aκ3 |η|, where in the first inequality we used the fact that, by (A1), ph is Lipschitz with constant A. [sent-704, score-1.032]

78 Indeed, for any x = y, using the Lipschitz assumption (A1) on p, |ph (x) − ph (y)| ≤ Rd |p(x + zh) − p(y + zh)| K(z)dz ≤ A x − y Rd K(z)dz = A x − y . [sent-705, score-0.516]

79 Therefore, EZ [Ξλ,n (h)|X,Y ] ≤ 2pmax vd nhd and hence it follows that ξλ,n (h) = EX,Y,Z [Ξλ,n (h)] ≤ 2pmax vd nhd = O(hd ), as h → 0. [sent-735, score-1.222]

80 Let Ah,ε denote the event ph − ph,X c By (1), PX,Y (Ah,ε ) ≤ 2K1 e−K2 nh Ah,ε , d ε2 ∞∨ ph − ph,Y ∞ ≤ ε. [sent-742, score-1.166]

81 Thus, the previous expression for ξλ,n (h) is upper bounded by 2 Uh,ε PX,Y ({ ph,X (u) > λ, ph,Y (u) ≤ λ} ∩ Ah,ε ) dP(u) + 2K1 e−K2 nh d ε2 which, using independence, is no larger than 2 Uh,ε πh (u)(1 − πh (u))dP(u) + 2K1 e−K2 nh d ε2 ≤ P(Uh,ε )Ah,ε + 2K1 e−K2 nh d ε2 . [sent-745, score-0.327]

82 Let σ2 (u, h) = Var(Bi (u)) and µ3 (u, h) = E|Bi (u) − µ(u, h)|3 where µ(u, h) = E(Bi (u)) = ph (u). [sent-749, score-0.516]

83 Then, pu,h (1 − pu,h ) σ2 (u, h) = (21) (hd vd )2 939 R INALDO , S INGH , N UGENT AND WASSERMAN and µ3 (u, h) = pu,h (1 − pu,h ) (1 − pu,h )2 + p2 u,h (hd vd )3 ≤ pu,h (1 − pu,h ) , (hd vd )3 where the last inequality holds since (1 − pu,h )2 + p2 ≤ 1, for all u and h. [sent-751, score-0.411]

84 d vd h vd hd Thus, a1 a2 ≤ σ2 (u, h) ≤ d , hd h where a1 = δλ 2vd and a2 = 3λ , 2vd (22) uniformly over u ∈ Uh,ε . [sent-758, score-0.486]

85 78), we e obtain √ nhd ( ph,X (u) − ph (u)) 33 µ3 (u, h) C(δ, λ) √ = ≤ t − Φ(t) ≤ , sup P 3 (u, h) n a(u, h) 4 σ nhd t where Φ is the cumulative distribution function of the standard Normal distribution. [sent-760, score-1.498]

86 Now, √ √ nhd ( ph,X (u) − ph (u)) nhd (λ − ph (u)) > πh (u) = PX ( ph,X (u) > λ) = PX a(u, h) a(u, h) . [sent-761, score-1.98]

87 Hence, 1−Φ √ nhd (λ − ph (u)) a(u, h) C(δ, λ) − √ ≤ πh (u) ≤ 1 − Φ nhd 940 √ nhd (λ − ph (u)) a(u, h) C(δ, λ) + √ . [sent-762, score-2.454]

88 nhd S TABILITY OF D ENSITY-BASED C LUSTERING Using the fact that u ∈ Uh,ε , and taking advantage of the uniform bounds a1 ≤ a(u, h) ≤ a2 , the previous inequalities imply √ √ nhd ε nhd ε C(δ, λ) C(δ, λ) − √ + √ 1−Φ ≤ πh (u) ≤ 1 − Φ − . [sent-763, score-1.422]

89 Let Ah,ε denote the event max || ph,X − ph ||∞ , |λh,α − λh,α,X |, || ph,Y − ph ||∞ , |λh,α − λh,α,Y | ≤ ε, (29) where ε = ε(Aκ3 + 1) + Aκ3 /n. [sent-789, score-1.057]

90 d 2v dv dv h h d h d h 2vd d Because of this and the fact that, on Ah,ε , |ph (u) − λh,α,X | ≤ 3ε for all u ∈ Uh,2ε,α , the same Berry-Esseen arguments used in the proof of lemma 11 yield √ √ C(δ, λh,α ) C(δ, λh,α ) 3ε nhd 3ε nhd − √ + √ 1−Φ ≤ πh,α,ε (u) ≤ 1 − Φ − . [sent-801, score-0.964]

91 d a1 a2 nh nhd where πh,α,ε (u) = PX { ph,X (u) > λh,α,X } ∩ Ah,ε , a1 = δλh,α /(2vd ), a2 = 3λh,α /(2vd ), and C(δ, λh,α ) = 2 δvd λh,α . [sent-802, score-0.583]

92 33 4 Now notice that πh,α (u) ≥ πh,α,ε (u) ≥ 1 − Φ and πh,α (u) ≤ c πh,α,ε (u) + P(Ah,ε ) √ 3ε nhd ≤ 1−Φ − a2 where C(h, ε, n) is defined in (30). [sent-803, score-0.486]

93 Therefore, √ 3ε nhd Ah,ε,α ≤ 2 1 − Φ − a2 and Ah,ε,α ≥ 2 1 − Φ √ 3ε nhd a1 √ 3ε nhd a1 944 − C(δ, λh,α ) √ nhd + C(δ, λh,α ) √ +C(h, ε, n). [sent-804, score-1.896]

94 nhd 2 C(δ, λh,α ) + √ +C(h, ε, n) nhd C(δ, λh,α ) − √ −C(h, ε, n) nhd , 2 . [sent-805, score-1.422]

95 Therefore, we obtain the inequality Γn (h) ≤ µ(S) µ(S) µ(S) || ph,X − ph,Y ||∞ ≤ || ph,X − ph ||∞ + || ph,Y − ph ||∞ 2 2 2 d = µ(S)|| ph,X − ph ||∞ . [sent-808, score-1.548]

96 The variance of D(u) is Var √ nhd ( ph,X (u) − ph,Y (u)) = nhd (Var( ph,X (u)) + Var( ph,Y (u))) = 2nhd Var( ph,X (u)) = 2nhd Var = = 2n2 hd 1 nhd vd n ∑ I(||Xi − u|| ≤ h) i=1 Var(I(||Xi − u|| ≤ h)) n2 h2d v2 d 2 P(B(u, h))(1 − P(B(u, h))). [sent-814, score-1.665]

97 2 hd vd 945 R INALDO , S INGH , N UGENT AND WASSERMAN Now, for u ∈ S, by (10), P(B(u, h))(1 − P(B(u, h))) ≤ P(B(u, h)) ≤ a2 hd vd and P(B(u, h))(1 − P(B(u, h))) ≥ P(B(u, h))δ ≥ a1 hd vd δ. [sent-815, score-0.729]

98 Now, for any u, √ √ D(u) = D1 (u) − D2 (u) ≡ nhd (Pn − P)( fu ) − nhd (Qn − P)( fu ) where Pn is the empirical measure based on X1 , . [sent-818, score-0.974]

99 We can regard { nhd (Pn − P)( f ) : f ∈ F } as an empirical process, where F = { fu : √ u ∈ S} and similarly for { nhd (Qn − P)( f ) : f ∈ F }. [sent-826, score-0.961]

100 Hence 1 nhd Γn (h) ≥ t ≥ PX,Y Γn,S (h) ≥ t = PX,Y PX,Y 1 2 = P 1 2 S 1 nhd = PX,Y √ nhd Γn,S (h) ≥ t |D(u)|du ≥ t |G(u)|du ≥ t + o(1), where the last probability is the law of the Gaussian process G. [sent-831, score-1.439]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ph', 0.516), ('nhd', 0.474), ('ah', 0.385), ('instability', 0.243), ('vd', 0.137), ('inaldo', 0.13), ('ingh', 0.13), ('ugent', 0.13), ('uh', 0.13), ('density', 0.129), ('tability', 0.124), ('nh', 0.109), ('wasserman', 0.109), ('hd', 0.106), ('px', 0.105), ('hn', 0.099), ('stability', 0.099), ('lustering', 0.096), ('cluster', 0.086), ('pz', 0.083), ('phn', 0.077), ('bandwidth', 0.077), ('tree', 0.073), ('ahn', 0.071), ('level', 0.066), ('adhn', 0.065), ('rh', 0.063), ('rinaldo', 0.059), ('adh', 0.053), ('dp', 0.045), ('rhn', 0.041), ('mh', 0.041), ('clusters', 0.038), ('kernel', 0.038), ('content', 0.036), ('sup', 0.034), ('rd', 0.034), ('lh', 0.03), ('epanechnikov', 0.03), ('nugent', 0.03), ('variation', 0.029), ('indexed', 0.029), ('bottom', 0.028), ('ez', 0.027), ('du', 0.027), ('dz', 0.026), ('kh', 0.025), ('event', 0.025), ('clustering', 0.025), ('moons', 0.024), ('bands', 0.023), ('drops', 0.023), ('vc', 0.022), ('var', 0.022), ('ui', 0.022), ('spherical', 0.021), ('height', 0.021), ('bn', 0.021), ('chaudhuri', 0.02), ('heat', 0.02), ('cuevas', 0.02), ('zi', 0.02), ('connected', 0.019), ('stuetzle', 0.018), ('behavior', 0.018), ('aarti', 0.018), ('ntc', 0.018), ('oscillates', 0.018), ('rigollet', 0.018), ('top', 0.017), ('probability', 0.017), ('lemma', 0.016), ('singh', 0.016), ('mixture', 0.016), ('red', 0.016), ('th', 0.015), ('ch', 0.015), ('rodr', 0.015), ('split', 0.015), ('regularity', 0.015), ('estimator', 0.014), ('estimates', 0.014), ('von', 0.014), ('pmax', 0.014), ('supu', 0.014), ('ib', 0.014), ('bivariate', 0.013), ('tsybakov', 0.013), ('inf', 0.013), ('left', 0.013), ('sample', 0.013), ('fu', 0.013), ('fixed', 0.013), ('lebesgue', 0.013), ('estimate', 0.013), ('luxburg', 0.013), ('cmu', 0.012), ('notice', 0.012), ('bound', 0.012), ('cadre', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 109 jmlr-2012-Stability of Density-Based Clustering

Author: Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman

Abstract: High density clusters can be characterized by the connected components of a level set L(λ) = {x : p(x) > λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. Keywords: clustering, density estimation, level sets, stability, model selection

2 0.06955424 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

Author: Steve Hanneke

Abstract: We study the theoretical advantages of active learning over passive learning. Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asymptotically strictly superior label complexity for all nontrivial target functions and distributions. We further provide a general characterization of the magnitudes of these improvements in terms of a novel generalization of the disagreement coefficient. We also extend these results to active learning in the presence of label noise, and find that even under broad classes of noise distributions, we can typically guarantee strict improvements over the known results for passive learning. Keywords: active learning, selective sampling, sequential design, statistical learning theory, PAC learning, sample complexity 1. Introduction and Background The recent rapid growth in data sources has spawned an equally rapid expansion in the number of potential applications of machine learning methodologies to extract useful concepts from these data. However, in many cases, the bottleneck in the application process is the need to obtain accurate annotation of the raw data according to the target concept to be learned. For instance, in webpage classification, it is straightforward to rapidly collect a large number of webpages, but training an accurate classifier typically requires a human expert to examine and label a number of these webpages, which may require significant time and effort. For this reason, it is natural to look for ways to reduce the total number of labeled examples required to train an accurate classifier. In the traditional machine learning protocol, here referred to as passive learning, the examples labeled by the expert are sampled independently at random, and the emphasis is on designing learning algorithms that make the most effective use of the number of these labeled examples available. However, it is possible to go beyond such methods by altering the protocol itself, allowing the learning algorithm to sequentially select the examples to be labeled, based on its observations of the labels of previously-selected examples; this interactive protocol is referred to as active learning. The objective in designing this selection mechanism is to focus the expert’s efforts toward labeling only the most informative data for the learning process, thus eliminating some degree of redundancy in the information content of the labeled examples. ∗. Some of these (and related) results previously appeared in the author’s doctoral dissertation (Hanneke, 2009b). c 2012 Steve Hanneke. H ANNEKE It is now well-established that active learning can sometimes provide significant practical and theoretical advantages over passive learning, in terms of the number of labels required to obtain a given accuracy. However, our current understanding of active learning in general is still quite limited in several respects. First, since we are lacking a complete understanding of the potential capabilities of active learning, we are not yet sure to what standards we should aspire for active learning algorithms to meet, and in particular this challenges our ability to characterize how a “good” active learning algorithm should behave. Second, since we have yet to identify a complete set of general principles for the design of effective active learning algorithms, in many cases the most effective known active learning algorithms have problem-specific designs (e.g., designed specifically for linear separators, or decision trees, etc., under specific assumptions on the data distribution), and it is not clear what components of their design can be abstracted and transferred to the design of active learning algorithms for different learning problems (e.g., with different types of classifiers, or different data distributions). Finally, we have yet to fully understand the scope of the relative benefits of active learning over passive learning, and in particular the conditions under which such improvements are achievable, as well as a general characterization of the potential magnitudes of these improvements. In the present work, we take steps toward closing this gap in our understanding of the capabilities, general principles, and advantages of active learning. Additionally, this work has a second theme, motivated by practical concerns. To date, the machine learning community has invested decades of research into constructing solid, reliable, and well-behaved passive learning algorithms, and into understanding their theoretical properties. We might hope that an equivalent amount of effort is not required in order to discover and understand effective active learning algorithms. In particular, rather than starting from scratch in the design and analysis of active learning algorithms, it seems desirable to leverage this vast knowledge of passive learning, to whatever extent possible. For instance, it may be possible to design active learning algorithms that inherit certain desirable behaviors or properties of a given passive learning algorithm. In this way, we can use a given passive learning algorithm as a reference point, and the objective is to design an active learning algorithm with performance guarantees strictly superior to those of the passive algorithm. Thus, if the passive learning algorithm has proven effective in a variety of common learning problems, then the active learning algorithm should be even better for those same learning problems. This approach also has the advantage of immediately supplying us with a collection of theoretical guarantees on the performance of the active learning algorithm: namely, improved forms of all known guarantees on the performance of the given passive learning algorithm. Due to its obvious practical advantages, this general line of informal thinking dominates the existing literature on empirically-tested heuristic approaches to active learning, as most of the published heuristic active learning algorithms make use of a passive learning algorithm as a subroutine (e.g., SVM, logistic regression, k-NN, etc.), constructing sets of labeled examples and feeding them into the passive learning algorithm at various times during the execution of the active learning algorithm (see the references in Section 7). Below, we take a more rigorous look at this general strategy. We develop a reduction-style framework for studying this approach to the design of active learning algorithms relative to a given passive learning algorithm. We then proceed to develop and analyze a variety of such methods, to realize this approach in a very general sense. Specifically, we explore the following fundamental questions. 1470 ACTIVIZED L EARNING • Is there a general procedure that, given any passive learning algorithm, transforms it into an active learning algorithm requiring significantly fewer labels to achieve a given accuracy? • If so, how large is the reduction in the number of labels required by the resulting active learning algorithm, compared to the number of labels required by the original passive algorithm? • What are sufficient conditions for an exponential reduction in the number of labels required? • To what extent can these methods be made robust to imperfect or noisy labels? In the process of exploring these questions, we find that for many interesting learning problems, the techniques in the existing literature are not capable of realizing the full potential of active learning. Thus, exploring this topic in generality requires us to develop novel insights and entirely new techniques for the design of active learning algorithms. We also develop corresponding natural complexity quantities to characterize the performance of such algorithms. Several of the results we establish here are more general than any related results in the existing literature, and in many cases the algorithms we develop use significantly fewer labels than any previously published methods. 1.1 Background The term active learning refers to a family of supervised learning protocols, characterized by the ability of the learning algorithm to pose queries to a teacher, who has access to the target concept to be learned. In practice, the teacher and queries may take a variety of forms: a human expert, in which case the queries may be questions or annotation tasks; nature, in which case the queries may be scientific experiments; a computer simulation, in which case the queries may be particular parameter values or initial conditions for the simulator; or a host of other possibilities. In our present context, we will specifically discuss a protocol known as pool-based active learning, a type of sequential design based on a collection of unlabeled examples; this seems to be the most common form of active learning in practical use today (e.g., Settles, 2010; Baldridge and Palmer, 2009; Gangadharaiah, Brown, and Carbonell, 2009; Hoi, Jin, Zhu, and Lyu, 2006; Luo, Kramer, Goldgof, Hall, Samson, Remsen, and Hopkins, 2005; Roy and McCallum, 2001; Tong and Koller, 2001; McCallum and Nigam, 1998). We will not discuss alternative models of active learning, such as online (Dekel, Gentile, and Sridharan, 2010) or exact (Heged¨ s, 1995). In the pool-based active learning u setting, the learning algorithm is supplied with a large collection of unlabeled examples (the pool), and is allowed to select any example from the pool to request that it be labeled. After observing the label of this example, the algorithm can then select another unlabeled example from the pool to request that it be labeled. This continues sequentially for a number of rounds until some halting condition is satisfied, at which time the algorithm returns a function intended to approximately mimic and generalize the observed labeling behavior. This setting contrasts with passive learning, where the learning algorithm is supplied with a collection of labeled examples without any interaction. Supposing the labels received agree with some true target concept, the objective is to use this returned function to approximate the true target concept on future (previously unobserved) data points. The hope is that, by carefully selecting which examples should be labeled, the algorithm can achieve improved accuracy while using fewer labels compared to passive learning. The motivation for this setting is simple. For many modern machine learning problems, unlabeled examples are inexpensive and available in abundance, while annotation is time-consuming or expensive. For instance, this is the case in the aforementioned webpage classification problem, where the pool would 1471 H ANNEKE be the set of all webpages, and labeling a webpage requires a human expert to examine the website content. Settles (2010) surveys a variety of other applications for which active learning is presently being used. To simplify the discussion, in this work we focus specifically on binary classification, in which there are only two possible labels. The results generalize naturally to multiclass classification as well. As the above description indicates, when studying the advantages of active learning, we are primarily interested in the number of label requests sufficient to achieve a given accuracy, a quantity referred to as the label complexity (Definition 1 below). Although active learning has been an active topic in the machine learning literature for many years now, our theoretical understanding of this topic was largely lacking until very recently. However, within the past few years, there has been an explosion of progress. These advances can be grouped into two categories: namely, the realizable case and the agnostic case. 1.1.1 T HE R EALIZABLE C ASE In the realizable case, we are interested in a particularly strict scenario, where the true label of any example is determined by a function of the features (covariates), and where that function has a specific known form (e.g., linear separator, decision tree, union of intervals, etc.); the set of classifiers having this known form is referred to as the concept space. The natural formalization of the realizable case is very much analogous to the well-known PAC model for passive learning (Valiant, 1984). In the realizable case, there are obvious examples of learning problems where active learning can provide a significant advantage compared to passive learning; for instance, in the problem of learning threshold classifiers on the real line (Example 1 below), a kind of binary search strategy for selecting which examples to request labels for naturally leads to exponential improvements in label complexity compared to learning from random labeled examples (passive learning). As such, there is a natural attraction to determine how general this phenomenon is. This leads us to think about general-purpose learning strategies (i.e., which can be instantiated for more than merely threshold classifiers on the real line), which exhibit this binary search behavior in various special cases. The first such general-purpose strategy to emerge in the literature was a particularly elegant strategy proposed by Cohn, Atlas, and Ladner (1994), typically referred to as CAL after its discoverers (Meta-Algorithm 2 below). The strategy behind CAL is the following. The algorithm examines each example in the unlabeled pool in sequence, and if there are two classifiers in the concept space consistent with all previously-observed labels, but which disagree on the label of this next example, then the algorithm requests that label, and otherwise it does not. For this reason, below we refer to the general family of algorithms inspired by CAL as disagreement-based methods. Disagreement-based methods are sometimes referred to as “mellow” active learning, since in some sense this is the least we can expect from a reasonable active learning algorithm; it never requests the label of an example whose label it can infer from information already available, but otherwise makes no attempt to seek out particularly informative examples to request the labels of. That is, the notion of informativeness implicit in disagreement-based methods is a binary one, so that an example is either informative or not informative, but there is no further ranking of the informativeness of examples. The disagreement-based strategy is quite general, and obviously leads to algorithms that are at least reasonable, but Cohn, Atlas, and Ladner (1994) did not study the label complexity achieved by their strategy in any generality. 1472 ACTIVIZED L EARNING In a Bayesian variant of the realizable setting, Freund, Seung, Shamir, and Tishby (1997) studied an algorithm known as query by committee (QBC), which in some sense represents a Bayesian variant of CAL. However, QBC does distinguish between different levels of informativeness beyond simple disagreement, based on the amount of disagreement on a random unlabeled example. They were able to analyze the label complexity achieved by QBC in terms of a type of information gain, and found that when the information gain is lower bounded by a positive constant, the algorithm achieves a label complexity exponentially smaller than the known results for passive learning. In particular, this is the case for the threshold learning problem, and also for the problem of learning higher-dimensional (nearly balanced) linear separators when the data satisfy a certain (uniform) distribution. Below, we will not discuss this analysis further, since it is for a slightly different (Bayesian) setting. However, the results below in our present setting do have interesting implications for the Bayesian setting as well, as discussed in the recent work of Yang, Hanneke, and Carbonell (2011). The first general analysis of the label complexity of active learning in the (non-Bayesian) realizable case came in the breakthrough work of Dasgupta (2005). In that work, Dasgupta proposed a quantity, called the splitting index, to characterize the label complexities achievable by active learning. The splitting index analysis is noteworthy for several reasons. First, one can show it provides nearly tight bounds on the minimax label complexity for a given concept space and data distribution. In particular, the analysis matches the exponential improvements known to be possible for threshold classifiers, as well as generalizations to higher-dimensional homogeneous linear separators under near-uniform distributions (as first established by Dasgupta, Kalai, and Monteleoni, 2005, 2009). Second, it provides a novel notion of informativeness of an example, beyond the simple binary notion of informativeness employed in disagreement-based methods. Specifically, it describes the informativeness of an example in terms of the number of pairs of well-separated classifiers for which at least one out of each pair will be contradicted, supposing the least-favorable label. Finally, unlike any other existing work on active learning (present work included), it provides an elegant description of the trade-off between the number of label requests and the number of unlabeled examples needed by the learning algorithm. Another interesting byproduct of Dasgupta’s work is a better understanding of the nature of the improvements achievable by active learning in the general case. In particular, his work clearly illustrates the need to study the label complexity as a quantity that varies depending on the particular target concept and data distribution. We will see this issue arise in many of the examples below. Coming from a slightly different perspective, Hanneke (2007a) later analyzed the label complexity of active learning in terms of an extension of the teaching dimension (Goldman and Kearns, 1995). Related quantities were previously used by Heged¨ s (1995) and Hellerstein, Pillaipakkamu natt, Raghavan, and Wilkins (1996) to tightly characterize the number of membership queries sufficient for Exact learning; Hanneke (2007a) provided a natural generalization to the PAC learning setting. At this time, it is not clear how this quantity relates to the splitting index. From a practical perspective, in some instances it may be easier to calculate (see the work of Nowak, 2008 for a discussion related to this), though in other cases the opposite seems true. The next progress toward understanding the label complexity of active learning came in the work of Hanneke (2007b), who introduced a quantity called the disagreement coefficient (Definition 9 below), accompanied by a technique for analyzing disagreement-based active learning algorithms. In particular, implicit in that work, and made explicit in the later work of Hanneke (2011), was the first general characterization of the label complexities achieved by the original CAL strategy for 1473 H ANNEKE active learning in the realizable case, stated in terms of the disagreement coefficient. The results of the present work are direct descendants of that 2007 paper, and we will discuss the disagreement coefficient, and results based on it, in substantial detail below. Disagreement-based active learners such as CAL are known to be sometimes suboptimal relative to the splitting index analysis, and therefore the disagreement coefficient analysis sometimes results in larger label complexity bounds than the splitting index analysis. However, in many cases the label complexity bounds based on the disagreement coefficient are surprisingly good considering the simplicity of the methods. Furthermore, as we will see below, the disagreement coefficient has the practical benefit of often being fairly straightforward to calculate for a variety of learning problems, particularly when there is a natural geometric interpretation of the classifiers and the data distribution is relatively smooth. As we discuss below, it can also be used to bound the label complexity of active learning in noisy settings. For these reasons (simplicity of algorithms, ease of calculation, and applicability beyond the realizable case), subsequent work on the label complexity of active learning has tended to favor the disagreement-based approach, making use of the disagreement coefficient to bound the label complexity (Dasgupta, Hsu, and Monteleoni, 2007; Friedman, 2009; Beygelzimer, Dasgupta, and Langford, 2009; Wang, 2009; Balcan, Hanneke, and Vaughan, 2010; Hanneke, 2011; Koltchinskii, 2010; Beygelzimer, Hsu, Langford, and Zhang, 2010; Mahalanabis, 2011; Wang, 2011). A significant part of the present paper focuses on extending and generalizing the disagreement coefficient analysis, while still maintaining the relative ease of calculation that makes the disagreement coefficient so useful. In addition to many positive results, Dasgupta (2005) also pointed out several negative results, even for very simple and natural learning problems. In particular, for many problems, the minimax label complexity of active learning will be no better than that of passive learning. In fact, Balcan, Hanneke, and Vaughan (2010) later showed that, for a certain type of active learning algorithm— namely, self-verifying algorithms, which themselves adaptively determine how many label requests they need to achieve a given accuracy—there are even particular target concepts and data distributions for which no active learning algorithm of that type can outperform passive learning. Since all of the above label complexity analyses (splitting index, teaching dimension, disagreement coefficient) apply to certain respective self-verifying learning algorithms, these negative results are also reflected in all of the existing general label complexity analyses. While at first these negative results may seem discouraging, Balcan, Hanneke, and Vaughan (2010) noted that if we do not require the algorithm to be self-verifying, instead simply measuring the number of label requests the algorithm needs to find a good classifier, rather than the number needed to both find a good classifier and verify that it is indeed good, then these negative results vanish. In fact, (shockingly) they were able to show that for any concept space with finite VC dimension, and any fixed data distribution, for any given passive learning algorithm there is an active learning algorithm with asymptotically superior label complexity for every nontrivial target concept! A positive result of this generality and strength is certainly an exciting advance in our understanding of the advantages of active learning. But perhaps equally exciting are the unresolved questions raised by that work, as there are potential opportunities to strengthen, generalize, simplify, and elaborate on this result. First, note that the above statement allows the active learning algorithm to be specialized to the particular distribution according to which the (unlabeled) data are sampled, and indeed the active learning method used by Balcan, Hanneke, and Vaughan (2010) in their proof has a rather strong direct dependence on the data distribution (which cannot be removed by simply replacing some calculations with data-dependent estimators). One interesting question is whether 1474 ACTIVIZED L EARNING an alternative approach might avoid this direct distribution-dependence in the algorithm, so that the claim can be strengthened to say that the active algorithm is superior to the passive algorithm for all nontrivial target concepts and data distributions. This question is interesting both theoretically, in order to obtain the strongest possible theorem on the advantages of active learning, as well as practically, since direct access to the distribution from which the data are sampled is typically not available in practical learning scenarios. A second question left open by Balcan, Hanneke, and Vaughan (2010) regards the magnitude of the gap between the active and passive label complexities. Specifically, although they did find particularly nasty learning problems where the label complexity of active learning will be close to that of passive learning (though always better), they hypothesized that for most natural learning problems, the improvements over passive learning should typically be exponentially large (as is the case for threshold classifiers); they gave many examples to illustrate this point, but left open the problem of characterizing general sufficient conditions for these exponential improvements to be achievable, even when they are not achievable by self-verifying algorithms. Another question left unresolved by Balcan, Hanneke, and Vaughan (2010) is whether this type of general improvement guarantee might be realized by a computationally efficient active learning algorithm. Finally, they left open the question of whether such general results might be further generalized to settings that involve noisy labels. The present work picks up where Balcan, Hanneke, and Vaughan (2010) left off in several respects, making progress on each of the above questions, in some cases completely resolving the question. 1.1.2 T HE AGNOSTIC C ASE In addition to the above advances in our understanding of active learning in the realizable case, there has also been wonderful progress in making these methods robust to imperfect teachers, feature space underspecification, and model misspecification. This general topic goes by the name agnostic active learning, from its roots in the agnostic PAC model (Kearns, Schapire, and Sellie, 1994). In contrast to the realizable case, in the agnostic case, there is not necessarily a perfect classifier of a known form, and indeed there may even be label noise so that there is no perfect classifier of any form. Rather, we have a given set of classifiers (e.g., linear separators, or depth-limited decision trees, etc.), and the objective is to identify a classifier whose accuracy is not much worse than the best classifier of that type. Agnostic learning is strictly more general, and often more difficult, than realizable learning; this is true for both passive learning and active learning. However, for a given agnostic learning problem, we might still hope that active learning can achieve a given accuracy using fewer labels than required for passive learning. The general topic of agnostic active learning got its first taste of real progress from Balcan, Beygelzimer, and Langford (2006a, 2009) with the publication of the A2 (agnostic active) algorithm. This method is a noise-robust disagreement-based algorithm, which can be applied with essentially arbitrary types of classifiers under arbitrary noise distributions. It is interesting both for its effectiveness and (as with CAL) its elegance. The original work of Balcan, Beygelzimer, and Langford (2006a, 2009) showed that, in some special cases (thresholds, and homogeneous linear separators under a uniform distribution), the A2 algorithm does achieve improved label complexities compared to the known results for passive learning. Using a different type of general active learning strategy, Hanneke (2007a) found that the teaching dimension analysis (discussed above for the realizable case) can be extended beyond the realizable case, arriving at general bounds on the label complexity under arbitrary noise distributions. 1475 H ANNEKE These bounds improve over the known results for passive learning in many cases. However, the algorithm requires direct access to a certain quantity that depends on the noise distribution (namely, the noise rate, defined in Section 6 below), which would not be available in many real-world learning problems. Later, Hanneke (2007b) established a general characterization of the label complexities achieved by A2 , expressed in terms of the disagreement coefficient. The result holds for arbitrary types of classifiers (of finite VC dimension) and arbitrary noise distributions, and represents the natural generalization of the aforementioned realizable-case analysis of CAL. In many cases, this result shows improvements over the known results for passive learning. Furthermore, because of the simplicity of the disagreement coefficient, the bound can be calculated for a variety of natural learning problems. Soon after this, Dasgupta, Hsu, and Monteleoni (2007) proposed a new active learning strategy, which is also effective in the agnostic setting. Like A2 , the new algorithm is a noise-robust disagreement-based method. The work of Dasgupta, Hsu, and Monteleoni (2007) is significant for at least two reasons. First, they were able to establish a general label complexity bound for this method based on the disagreement coefficient. The bound is similar in form to the previous label complexity bound for A2 by Hanneke (2007b), but improves the dependence of the bound on the disagreement coefficient. Second, the proposed method of Dasgupta, Hsu, and Monteleoni (2007) set a new standard for computational and aesthetic simplicity in agnostic active learning algorithms. This work has since been followed by related methods of Beygelzimer, Dasgupta, and Langford (2009) and Beygelzimer, Hsu, Langford, and Zhang (2010). In particular, Beygelzimer, Dasgupta, and Langford (2009) develop a method capable of learning under an essentially arbitrary loss function; they also show label complexity bounds similar to those of Dasgupta, Hsu, and Monteleoni (2007), but applicable to a larger class of loss functions, and stated in terms of a generalization of the disagreement coefficient for arbitrary loss functions. While the above results are encouraging, the guarantees reflected in these label complexity bounds essentially take the form of (at best) constant factor improvements; specifically, in some cases the bounds improve the dependence on the noise rate factor (defined in Section 6 below), compared to the known results for passive learning. In fact, K¨ ari¨ inen (2006) showed that any a¨ a label complexity bound depending on the noise distribution only via the noise rate cannot do better than this type of constant-factor improvement. This raised the question of whether, with a more detailed description of the noise distribution, one can show improvements in the asymptotic form of the label complexity compared to passive learning. Toward this end, Castro and Nowak (2008) studied a certain refined description of the noise conditions, related to the margin conditions of Mammen and Tsybakov (1999), which are well-studied in the passive learning literature. Specifically, they found that in some special cases, under certain restrictions on the noise distribution, the asymptotic form of the label complexity can be improved compared to passive learning, and in some cases the improvements can even be exponential in magnitude; to achieve this, they developed algorithms specifically tailored to the types of classifiers they studied (threshold classifiers and boundary fragment classes). Balcan, Broder, and Zhang (2007) later extended this result to general homogeneous linear separators under a uniform distribution. Following this, Hanneke (2009a, 2011) generalized these results, showing that both of the published general agnostic active learning algorithms (Balcan, Beygelzimer, and Langford, 2009; Dasgupta, Hsu, and Monteleoni, 2007) can also achieve these types of improvements in the asymptotic form of the label complexity; he further proved general bounds on the label complexities of these methods, again based on the disagreement coefficient, which apply to arbitrary types of classifiers, and which reflect these types of improvements 1476 ACTIVIZED L EARNING (under conditions on the disagreement coefficient). Wang (2009) later bounded the label complexity of A2 under somewhat different noise conditions, in particular identifying weaker noise conditions sufficient for these improvements to be exponential in magnitude (again, under conditions on the disagreement coefficient). Koltchinskii (2010) has recently improved on some of Hanneke’s results, refining certain logarithmic factors and simplifying the proofs, using a slightly different algorithm based on similar principles. Though the present work discusses only classes of finite VC dimension, most of the above references also contain results for various types of nonparametric classes with infinite VC dimension. At present, all of the published bounds on the label complexity of agnostic active learning also apply to self-verifying algorithms. As mentioned, in the realizable case, it is typically possible to achieve significantly better label complexities if we do not require the active learning algorithm to be self-verifying, since the verification of learning may be more difficult than the learning itself (Balcan, Hanneke, and Vaughan, 2010). We might wonder whether this is also true in the agnostic case, and whether agnostic active learning algorithms that are not self-verifying might possibly achieve significantly better label complexities than the existing label complexity bounds described above. We investigate this in depth below. 1.2 Summary of Contributions In the present work, we build on and extend the above results in a variety of ways, resolving a number of open problems. The main contributions of this work can be summarized as follows. • We formally define a notion of a universal activizer, a meta-algorithm that transforms any passive learning algorithm into an active learning algorithm with asymptotically strictly superior label complexities for all nontrivial distributions and target concepts in the concept space. • We analyze the existing strategy of disagreement-based active learning from this perspective, precisely characterizing the conditions under which this strategy can lead to a universal activizer for VC classes in the realizable case. • We propose a new type of active learning algorithm, based on shatterable sets, and construct universal activizers for all VC classes in the realizable case based on this idea; in particular, this overcomes the issue of distribution-dependence in the existing results mentioned above. • We present a novel generalization of the disagreement coefficient, along with a new asymptotic bound on the label complexities achievable by active learning in the realizable case; this new bound is often significantly smaller than the existing results in the published literature. • We state new concise sufficient conditions for exponential improvements over passive learning to be achievable in the realizable case, including a significant weakening of known conditions in the published literature. • We present a new general-purpose active learning algorithm for the agnostic case, based on the aforementioned idea involving shatterable sets. • We prove a new asymptotic bound on the label complexities achievable by active learning in the presence of label noise (the agnostic case), often significantly smaller than any previously published results. 1477 H ANNEKE • We formulate a general conjecture on the theoretical advantages of active learning over passive learning in the presence of arbitrary types of label noise. 1.3 Outline of the Paper The paper is organized as follows. In Section 2, we introduce the basic notation used throughout, formally define the learning protocol, and formally define the label complexity. We also define the notion of an activizer, which is a procedure that transforms a passive learning algorithm into an active learning algorithm with asymptotically superior label complexity. In Section 3, we review the established technique of disagreement-based active learning, and prove a new result precisely characterizing the scenarios in which disagreement-based active learning can be used to construct an activizer. In particular, we find that in many scenarios, disagreement-based active learning is not powerful enough to provide the desired improvements. In Section 4, we move beyond disagreementbased active learning, developing a new type of active learning algorithm based on shatterable sets of points. We apply this technique to construct a simple 3-stage procedure, which we then prove is a universal activizer for any concept space of finite VC dimension. In Section 5, we begin by reviewing the known results for bounding the label complexity of disagreement-based active learning in terms of the disagreement coefficient; we then develop a somewhat more involved procedure, again based on shatterable sets, which takes full advantage of the sequential nature of active learning. In addition to being an activizer, we show that this procedure often achieves dramatically superior label complexities than achievable by passive learning. In particular, we define a novel generalization of the disagreement coefficient, and use it to bound the label complexity of this procedure. This also provides us with concise sufficient conditions for obtaining exponential improvements over passive learning. Continuing in Section 6, we extend our framework to allow for label noise (the agnostic case), and discuss the possibility of extending the results from previous sections to these noisy learning problems. We first review the known results for noise-robust disagreement-based active learning, and characterizations of its label complexity in terms of the disagreement coefficient and Mammen-Tsybakov noise parameters. We then proceed to develop a new type of noise-robust active learning algorithm, again based on shatterable sets, and prove bounds on its label complexity in terms of our aforementioned generalization of the disagreement coefficient. Additionally, we present a general conjecture concerning the existence of activizers for certain passive learning algorithms in the agnostic case. We conclude in Section 7 with a host of enticing open problems for future investigation. 2. Definitions and Notation For most of the paper, we consider the following formal setting. There is a measurable space (X , FX ), where X is called the instance space; for simplicity, we suppose this is a standard Borel space (Srivastava, 1998) (e.g., Rm under the usual Borel σ -algebra), though most of the results generalize. A classifier is any measurable function h : X → {−1, +1}. There is a set C of classifiers called the concept space. In the realizable case, the learning problem is characterized as follows. There is a probability measure P on X , and a sequence ZX = {X1 , X2 , . . .} of independent X -valued random variables, each with distribution P. We refer to these random variables as the sequence of unlabeled examples; although in practice, this sequence would typically be large but finite, to simplify the discussion and focus strictly on counting labels, we will suppose this sequence is inexhaustible. There is additionally a special element f ∈ C, called the target function, and we denote by 1478 ACTIVIZED L EARNING Yi = f (Xi ); we further denote by Z = {(X1 ,Y1 ), (X2 ,Y2 ), . . .} the sequence of labeled examples, and for m ∈ N we denote by Zm = {(X1 ,Y1 ), (X2 ,Y2 ), . . . , (Xm ,Ym )} the finite subsequence consisting of the first m elements of Z. For any classifier h, we define the error rate er(h) = P(x : h(x) = f (x)). Informally, the learning objective in the realizable case is to identify some h with small er(h) using elements from Z, without direct access to f . An active learning algorithm A is permitted direct access to the ZX sequence (the unlabeled examples), but to gain access to the Yi values it must request them one at a time, in a sequential manner. Specifically, given access to the ZX values, the algorithm selects any index i ∈ N, requests to observe the Yi value, then having observed the value of Yi , selects another index i′ , observes the value of Yi′ , etc. The algorithm is given as input an integer n, called the label budget, and is permitted ˆ to observe at most n labels total before eventually halting and returning a classifier hn = A(n); that is, by definition, an active learning algorithm never attempts to access more than the given budget n ˆ number of labels. We will then study the values of n sufficient to guarantee E[er(hn )] ≤ ε , for any given value ε ∈ (0, 1). We refer to this as the label complexity. We will be particularly interested in the asymptotic dependence on ε in the label complexity, as ε → 0. Formally, we have the following definition. Definition 1 An active learning algorithm A achieves label complexity Λ(·, ·, ·) if, for every target function f , distribution P, ε ∈ (0, 1), and integer n ≥ Λ(ε , f , P), we have E [er (A(n))] ≤ ε . This definition of label complexity is similar to one originally studied by Balcan, Hanneke, and Vaughan (2010). It has a few features worth noting. First, the label complexity has an explicit dependence on the target function f and distribution P. As noted by Dasgupta (2005), we need this dependence if we are to fully understand the range of label complexities achievable by active learning; we further illustrate this issue in the examples below. The second feature to note is that the label complexity, as defined here, is simply a sufficient budget size to achieve the specified accuracy. That is, here we are asking only how many label requests are required for the algorithm to achieve a given accuracy (in expectation). However, as noted by Balcan, Hanneke, and Vaughan (2010), this number might not be sufficiently large to detect that the algorithm has indeed achieved the required accuracy based only on the observed data. That is, because the number of labeled examples used in active learning can be quite small, we come across the problem that the number of labels needed to learn a concept might be significantly smaller than the number of labels needed to verify that we have successfully learned the concept. As such, this notion of label complexity is most useful in the design of effective learning algorithms, rather than for predicting the number of labels an algorithm should request in any particular application. Specifically, to design effective active learning algorithms, we should generally desire small label complexity values, so that (in the extreme case) if some algorithm A has smaller label complexity values than some other algorithm A′ for all target functions and distributions, then (all other factors being equal) we should clearly prefer algorithm A over algorithm A′ ; this is true regardless of whether we have a means to detect (verify) how large the improvements offered by algorithm A over algorithm A′ are for any particular application. Thus, in our present context, performance guarantees in terms of this notion of label complexity play a role analogous to concepts such as universal consistency or admissibility, which are also generally useful in guiding the design of effective algorithms, but are not intended to be informative in the context of any particular application. See the work of Balcan, Hanneke, and Vaughan (2010) for a discussion of this issue, as it relates to a definition of label complexity similar 1479 H ANNEKE to that above, as well as other notions of label complexity from the active learning literature (some of which include a verification requirement). We will be interested in the performance of active learning algorithms, relative to the performance of a given passive learning algorithm. In this context, a passive learning algorithm A takes as input a finite sequence of labeled examples L ∈ n (X × {−1, +1})n , and returns a classifier ˆ h = A(L). We allow both active and passive learning algorithms to be randomized: that is, to have independent internal randomness, in addition to the given random data. We define the label complexity for a passive learning algorithm as follows. Definition 2 A passive learning algorithm A achieves label complexity Λ(·, ·, ·) if, for every target function f , distribution P, ε ∈ (0, 1), and integer n ≥ Λ(ε , f , P), we have E [er (A (Zn ))] ≤ ε . Although technically some algorithms may be able to achieve a desired accuracy without any observations, to make the general results easier to state (namely, those in Section 5), unless otherwise stated we suppose label complexities (both passive and active) take strictly positive values, among N ∪ {∞}; note that label complexities (both passive and active) can be infinite, indicating that the corresponding algorithm might not achieve expected error rate ε for any n ∈ N. Both the passive and active label complexities are defined as a number of labels sufficient to guarantee the expected error rate is at most ε . It is also common in the literature to discuss the number of label requests sufficient to guarantee the error rate is at most ε with high probability 1 − δ (e.g., Balcan, Hanneke, and Vaughan, 2010). In the present work, we formulate our results in terms of the expected error rate because it simplifies the discussion of asymptotics, in that we need only study the behavior of the label complexity as the single argument ε approaches 0, rather than the more complicated behavior of a function of ε and δ as both ε and δ approach 0 at various relative rates. However, we note that analogous results for these high-probability guarantees on the error rate can be extracted from the proofs below without much difficulty, and in several places we explicitly state results of this form. Below we employ the standard notation from asymptotic analysis, including O(·), o(·), Ω(·), ω (·), Θ(·), ≪, and ≫. In all contexts below not otherwise specified, the asymptotics are always considered as ε → 0 when considering a function of ε , and as n → ∞ when considering a function of n; also, in any expression of the form “x → 0,” we always mean the limit from above (i.e., x ↓ 0). For instance, when considering nonnegative functions of ε , λa (ε ) and λ p (ε ), the above notations λ are defined as follows. We say λa (ε ) = o(λ p (ε )) when lim λa (ε ) = 0, and this is equivalent to p (ε ) ε →0 writing λ p (ε ) = ω (λa (ε )), λa (ε ) ≪ λ p (ε ), or λ p (ε ) ≫ λa (ε ). We say λa (ε ) = O(λ p (ε )) when λ lim sup λa (ε ) < ∞, which can equivalently be expressed as λ p (ε ) = Ω(λa (ε )). Finally, we write p (ε ) ε →0 λa (ε ) = Θ(λ p (ε )) to mean that both λa (ε ) = O(λ p (ε )) and λa (ε ) = Ω(λ p (ε )) are satisfied. We also use the standard notation for the limit of a sequence of sets, such as lim Ar , defined by the r→0 property ½ lim Ar = lim ½Ar (if the latter exists), where ½A is the indicator function for the set A. r→0 r→0 Define the class of functions Polylog(1/ε ) as those g : (0, 1) → [0, ∞) such that, for some k ∈ [0, ∞), g(ε ) = O(logk (1/ε )). For a label complexity Λ, also define the set Nontrivial(Λ) as the collection of all pairs ( f , P) of a classifier and a distribution such that, ∀ε > 0, Λ(ε , f , P) < ∞, and ∀g ∈ Polylog(1/ε ), Λ(ε , f , P) = ω (g(ε )). 1480 ACTIVIZED L EARNING In this context, an active meta-algorithm is a procedure Aa taking as input a passive algorithm A p and a label budget n, such that for any passive algorithm A p , Aa (A p , ·) is an active learning algorithm. We define an activizer for a given passive algorithm as follows. Definition 3 We say an active meta-algorithm Aa activizes a passive algorithm A p for a concept space C if the following holds. For any label complexity Λ p achieved by A p , the active learning algorithm Aa (A p , ·) achieves a label complexity Λa such that, for every f ∈ C and every distribution P on X with ( f , P) ∈ Nontrivial(Λ p ), there exists a constant c ∈ [1, ∞) such that Λa (cε , f , P) = o (Λ p (ε , f , P)) . In this case, Aa is called an activizer for A p with respect to C, and the active learning algorithm Aa (A p , ·) is called the Aa -activized A p . We also refer to any active meta-algorithm Aa that activizes every passive algorithm A p for C as a universal activizer for C. One of the main contributions of this work is establishing that such universal activizers do exist for any VC class C. A bit of explanation is in order regarding Definition 3. We might interpret it as follows: an activizer for A p strongly improves (in a little-o sense) the label complexity for all nontrivial target functions and distributions. Here, we seek a meta-algorithm that, when given A p as input, results in an active learning algorithm with strictly superior label complexities. However, there is a sense in which some distributions P or target functions f are trivial relative to A p . For instance, perhaps A p has a default classifier that it is naturally biased toward (e.g., with minimal P(x : h(x) = +1), as in the Closure algorithm of Helmbold, Sloan, and Warmuth, 1990), so that when this default classifier is the target function, A p achieves a constant label complexity. In these trivial scenarios, we cannot hope to improve over the behavior of the passive algorithm, but instead can only hope to compete with it. The sense in which we wish to compete may be a subject of some controversy, but the implication of Definition 3 is that the label complexity of the activized algorithm should be strictly better than every nontrivial upper bound on the label complexity of the passive algorithm. For instance, if Λ p (ε , f , P) ∈ Polylog(1/ε ), then we are guaranteed Λa (ε , f , P) ∈ Polylog(1/ε ) as well, but if Λ p (ε , f , P) = O(1), we are still only guaranteed Λa (ε , f , P) ∈ Polylog(1/ε ). This serves the purpose of defining a framework that can be studied without requiring too much obsession over small additive terms in trivial scenarios, thus focusing the analyst’s efforts toward nontrivial scenarios where A p has relatively large label complexity, which are precisely the scenarios for which active learning is truly needed. In our proofs, we find that in fact Polylog(1/ε ) can be replaced with O(log(1/ε )), giving a slightly broader definition of “nontrivial,” for which all of the results below still hold. Section 7 discusses open problems regarding this issue of trivial problems. The definition of Nontrivial(·) also only requires the activized algorithm to be effective in scenarios where the passive learning algorithm has reasonable behavior (i.e., finite label complexities); this is only intended to keep with the reduction-based style of the framework, and in fact this restriction can easily be lifted using a trick from Balcan, Hanneke, and Vaughan (2010) (aggregating the activized algorithm with another algorithm that is always reasonable). Finally, we also allow a constant factor c loss in the ε argument to Λa . We allow this to be an arbitrary constant, again in the interest of allowing the analyst to focus only on the most significant aspects of the problem; for most reasonable passive learning algorithms, we typically expect Λ p (ε , f , P) = Poly(1/ε ), in which case c can be set to 1 by adjusting the leading constant factors of 1481 H ANNEKE Λa . A careful inspection of our proofs reveals that c can always be set arbitrarily close to 1 without affecting the theorems below (and in fact, we can even get c = (1 + o(1)), a function of ε ). ˆ Throughout this work, we will adopt the usual notation for probabilities, such as P(er(h) > ε ), and as usual we interpret this as measuring the corresponding event in the (implicit) underlying probability space. In particular, we make the usual implicit assumption that all sets involved in the analysis are measurable; where this assumption does not hold, we may turn to outer probabilities, though we will not make further mention of these technical details. We will also use the notation P k (·) to represent k-dimensional product measures; for instance, for a measurable set ′ ′ ′ ′ A ⊆ X k , P k (A) = P((X1 , . . . , Xk ) ∈ A), for independent P-distributed random variables X1 , . . . , Xk . 0 = {∅} and P 0 (X 0 ) = 1. Additionally, to simplify notation, we will adopt the convention that X Throughout, we will denote by ½A (z) the indicator function for a set A, which has the value 1 when z ∈ A and 0 otherwise; additionally, at times it will be more convenient to use the bipolar indicator function, defined as ½± (z) = 2½A (z) − 1. A We will require a few additional definitions for the discussion below. For any classifier h : X → {−1, +1} and finite sequence of labeled examples L ∈ m (X × {−1, +1})m , define the empirical error rate erL (h) = |L|−1 (x,y)∈L ½{−y} (h(x)); for completeness, define er∅ (h) = 0. Also, for L = Zm , the first m labeled examples in the data sequence, abbreviate this as erm (h) = erZm (h). For any probability measure P on X , set of classifiers H, classifier h, and r > 0, define BH,P (h, r) = {g ∈ H : P (x : h(x) = g(x)) ≤ r}; when P = P, the distribution of the unlabeled examples, and P is clear from the context, we abbreviate this as BH (h, r) = BH,P (h, r); furthermore, when P = P and H = C, the concept space, and both P and C are clear from the context, we abbreviate this as B(h, r) = BC,P (h, r). Also, for any set of classifiers H, and any sequence of labeled examples L ∈ m (X × {−1, +1})m , define H[L] = {h ∈ H : erL (h) = 0}; for any (x, y) ∈ X × {−1, +1}, abbreviate H[(x, y)] = H[{(x, y)}] = {h ∈ H : h(x) = y}. We also adopt the usual definition of “shattering” used in learning theory (e.g., Vapnik, 1998). Specifically, for any set of classifiers H, k ∈ N, and S = (x1 , . . . , xk ) ∈ X k , we say H shatters S if, ∀(y1 , . . . , yk ) ∈ {−1, +1}k , ∃h ∈ H such that ∀i ∈ {1, . . . , k}, h(xi ) = yi ; equivalently, H shatters S if ∃{h1 , . . . , h2k } ⊆ H such that for each i, j ∈ {1, . . . , 2k } with i = j, ∃ℓ ∈ {1, . . . , k} with hi (xℓ ) = h j (xℓ ). To simplify notation, we will also say that H shatters ∅ if and only if H = {}. As usual, we define the VC dimension of C, denoted d, as the largest integer k such that ∃S ∈ X k shattered by C (Vapnik and Chervonenkis, 1971; Vapnik, 1998). To focus on nontrivial problems, we will only consider concept spaces C with d > 0 in the results below. Generally, any such concept space C with d < ∞ is called a VC class. 2.1 Motivating Examples Throughout this paper, we will repeatedly refer to a few canonical examples. Although themselves quite toy-like, they represent the boiled-down essence of some important distinctions between various types of learning problems. In some sense, the process of grappling with the fundamental distinctions raised by these types of examples has been a driving force behind much of the recent progress in understanding the label complexity of active learning. The first example is perhaps the most classic, and is clearly the first that comes to mind when considering the potential for active learning to provide strong improvements over passive learning. Example 1 In the problem of learning threshold classifiers, we consider X = [0, 1] and C = {hz (x) = ½± (x) : z ∈ (0, 1)}. [z,1] 1482 ACTIVIZED L EARNING There is a simple universal activizer for threshold classifiers, based on a kind of binary search. Specifically, suppose n ∈ N and that A p is any given passive learning algorithm. Consider the points in {X1 , X2 , . . . , Xm }, for m = 2n−1 , and sort them in increasing order: X(1) , X(2) , . . . , X(m) . Also initialize ℓ = 0 and u = m + 1, and define X(0) = 0 and X(m+1) = 1. Now request the label of X(i) for i = ⌊(ℓ + u)/2⌋ (i.e., the median point between ℓ and u); if the label is −1, let ℓ = i, and otherwise let u = i; repeat this (requesting this median point, then updating ℓ or u accordingly) until we have u = ℓ+1. Finally, let z = X(u) , construct the labeled sequence L = {(X1 , hz (X1 )) , . . . , (Xm , hz (Xm ))}, ˆ ˆ ˆ ˆ and return the classifier h = A p (L). Since each label request at least halves the set of integers between ℓ and u, the total number of label requests is at most log2 (m) + 1 = n. Supposing f ∈ C is the target function, this procedure maintains the invariant that f (X(ℓ) ) = −1 and f (X(u) ) = +1. Thus, once we reach u = ℓ + 1, since f is a threshold, it must be some hz with z ∈ (ℓ, u]; therefore every X( j) with j ≤ ℓ has f (X( j) ) = −1, and likewise every X( j) with j ≥ u has f (X( j) ) = +1; in particular, this means L ˆ equals Zm , the true labeled sequence. But this means h = A p (Zm ). Since n = log2 (m) + 1, this active learning algorithm will achieve an equivalent error rate to what A p achieves with m labeled examples, but using only log2 (m) + 1 label requests. In particular, this implies that if A p achieves label complexity Λ p , then this active learning algorithm achieves label complexity Λa such that Λa (ε , f , P) ≤ log2 Λ p (ε , f , P) + 2; as long as 1 ≪ Λ p (ε , f , P) < ∞, this is o(Λ p (ε , f , P)), so that this procedure activizes A p for C. The second example we consider is almost equally simple (only increasing the VC dimension from 1 to 2), but is far more subtle in terms of how we must approach its analysis in active learning. Example 2 In the problem of learning interval classifiers, we consider X = [0, 1] and C = {h[a,b] (x) = ½± (x) : 0 < a ≤ b < 1}. [a,b] For the intervals problem, we can also construct a universal activizer, though slightly more complicated. Specifically, suppose again that n ∈ N and that A p is any given passive learning algorithm. We first request the labels {Y1 ,Y2 , . . . ,Y⌈n/2⌉ } of the first ⌈n/2⌉ examples in the sequence. If every one of these labels is −1, then we immediately return the all-negative constant classifier ˆ h(x) = −1. Otherwise, consider the points {X1 , X2 , . . . , Xm }, for m = max 2⌊n/4⌋−1 , n , and sort them in increasing order X(1) , X(2) , . . . , X(m) . For some value i ∈ {1, . . . , ⌈n/2⌉} with Yi = +1, let j+ denote the corresponding index j such that X( j) = Xi . Also initialize ℓ1 = 0, u1 = ℓ2 = j+ , and u2 = m + 1, and define X(0) = 0 and X(m+1) = 1. Now if ℓ1 + 1 < u1 , request the label of X(i) for i = ⌊(ℓ1 + u1 )/2⌋ (the median point between ℓ1 and u1 ); if the label is −1, let ℓ1 = i, and otherwise let u1 = i; repeat this (requesting this median point, then updating ℓ1 or u1 accordingly) until we have u1 = ℓ1 + 1. Now if ℓ2 + 1 < u2 , request the label of X(i) for i = ⌊(ℓ2 + u2 )/2⌋ (the median point between ℓ2 and u2 ); if the label is −1, let u2 = i, and otherwise let ℓ2 = i; repeat this (requesting this median point, then updating u2 or ℓ2 accordingly) until we have u2 = ℓ2 + 1. Finally, let a = u1 and ˆ ˆ = ℓ2 , construct the labeled sequence L = X1 , h ˆ (X1 ) , . . . , Xm , h ˆ (Xm ) , and return the b [a,b] ˆ [a,b] ˆ ˆ classifier h = A p (L). Since each label request in the second phase halves the set of values between either ℓ1 and u1 or ℓ2 and u2 , the total number of label requests is at most min {m, ⌈n/2⌉ + 2 log2 (m) + 2} ≤ n. Suppose f ∈ C is the target function, and let w( f ) = P(x : f (x) = +1). If w( f ) = 0, then with ˆ ˆ probability 1 the algorithm will return the constant classifier h(x) = −1, which has er(h) = 0 in this 2 1 case. Otherwise, if w( f ) > 0, then for any n ≥ w( f ) ln ε , with probability at least 1 − ε , there exists 1483 H ANNEKE i ∈ {1, . . . , ⌈n/2⌉} with Yi = +1. Let H+ denote the event that such an i exists. Supposing this is the case, the algorithm will make it into the second phase. In this case, the procedure maintains the invariant that f (X(ℓ1 ) ) = −1, f (X(u1 ) ) = f (X(ℓ2 ) ) = +1, and f (X(u2 ) ) = −1, where ℓ1 < u1 ≤ ℓ2 < u2 . Thus, once we have u1 = ℓ1 + 1 and u2 = ℓ2 + 1, since f is an interval, it must be some h[a,b] with a ∈ (ℓ1 , u1 ] and b ∈ [ℓ2 , u1 ); therefore, every X( j) with j ≤ ℓ1 or j ≥ u2 has f (X( j) ) = −1, and likewise every X( j) with u1 ≤ j ≤ ℓ2 has f (X( j) ) = +1; in particular, this means L equals Zm , the true ˆ labeled sequence. But this means h = A p (Zm ). Supposing A p achieves label complexity Λ p , and ˆ that n ≥ max 8 + 4 log2 Λ p (ε , f , P), 2 ln 1 , then m ≥ 2⌊n/4⌋−1 ≥ Λ p (ε , f , P) and E er(h) ≤ w( f ) ε ˆ E er(h)½H+ + (1 − P(H+ )) ≤ E [er(A p (Zm ))] + ε ≤ 2ε . In particular, this means this active learning algorithm achieves label complexity Λa such that, for any f ∈ C with w( f ) = 0, Λa (2ε , f , P) = 0, 2 1 and for any f ∈ C with w( f ) > 0, Λa (2ε , f , P) ≤ max 8 + 4 log2 Λ p (ε , f , P), w( f ) ln ε . If ( f , P) ∈ 1 2 Nontrivial(Λ p ), then w( f ) ln ε = o(Λ p (ε , f , P)) and 8 + 4 log2 Λ p (ε , f , P) = o(Λ p (ε , f , P)), so that Λa (2ε , f , P) = o(Λ p (ε , f , P)). Therefore, this procedure activizes A p for C. This example also brings to light some interesting phenomena in the analysis of the label complexity of active learning. Note that unlike the thresholds example, we have a much stronger dependence on the target function in these label complexity bounds, via the w( f ) quantity. This issue is fundamental to the problem, and cannot be avoided. In particular, when P([0, x]) is continuous, this is the very issue that makes the minimax label complexity for this problem (i.e., minΛa max f ∈C Λa (ε , f , P)) no better than passive learning (Dasgupta, 2005). Thus, this problem emphasizes the need for any informative label complexity analysis of active learning to explicitly describe the dependence of the label complexity on the target function, as advocated by Dasgupta (2005). This example also highlights the unverifiability phenomenon explored by Balcan, Hanneke, and Vaughan (2010), since in the case of w( f ) = 0, the error rate of the returned classifier is zero, but (for nondegenerate P) there is no way for the algorithm to verify this fact based only on the finite number of labels it observes. In fact, Balcan, Hanneke, and Vaughan (2010) have shown that under continuous P, for any f ∈ C with w( f ) = 0, the number of labels required to both find a classifier of small error rate and verify that the error rate is small based only on observable quantities is essentially no better than for passive learning. These issues are present to a small degree in the intervals example, but were easily handled in a very natural way. The target-dependence shows up only in an initial phase of waiting for a positive example, and the always-negative classifiers were handled by setting a default return value. However, we can amplify these issues so that they show up in more subtle and involved ways. Specifically, consider the following example, studied by Balcan, Hanneke, and Vaughan (2010). Example 3 In the problem of learning unions of i intervals, we consider X = [0, 1] and C = hz (x) = ½±i j=1 [z2 j−1 ,z2 j ] (x) : 0 < z1 ≤ z2 ≤ . . . ≤ z2i < 1 . The challenge of this problem is that, because sometimes z j = z j+1 for some j values, we do not know how many intervals are required to minimally represent the target function: only that it is at most i. This issue will be made clearer below. We can essentially think of any effective strategy here as having two components: one component that searches (perhaps randomly) with the purpose of identifying at least one example from each decision region, and another component that refines our estimates of the end-points of the regions the first component identifies. Later, we will go through the behavior of a universal activizer for this problem in detail. 1484 ACTIVIZED L EARNING 3. Disagreement-Based Active Learning At present, perhaps the best-understood active learning algorithms are those choosing their label requests based on disagreement among a set of remaining candidate classifiers. The canonical algorithm of this type, a version of which we discuss below in Section 5.1, was proposed by Cohn, Atlas, and Ladner (1994). Specifically, for any set H of classifiers, define the region of disagreement: DIS(H) = {x ∈ X : ∃h1 , h2 ∈ H s.t. h1 (x) = h2 (x)} . The basic idea of disagreement-based algorithms is that, at any given time in the algorithm, there is a subset V ⊆ C of remaining candidates, called the version space, which is guaranteed to contain the target f . When deciding whether to request a particular label Yi , the algorithm simply checks whether Xi ∈ DIS(V ): if so, the algorithm requests Yi , and otherwise it does not. This general strategy is reasonable, since for any Xi ∈ DIS(V ), the label agreed upon by V must be f (Xi ), / so that we would get no information by requesting Yi ; that is, for Xi ∈ DIS(V ), we can accurately / infer Yi based on information already available. This type of algorithm has recently received substantial attention, not only for its obvious elegance and simplicity, but also because (as we discuss in Section 6) there are natural ways to extend the technique to the general problem of learning with label noise and model misspecification (the agnostic setting). The details of disagreement-based algorithms can vary in how they update the set V and how frequently they do so, but it turns out almost all disagreement-based algorithms share many of the same fundamental properties, which we describe below. 3.1 A Basic Disagreement-Based Active Learning Algorithm In Section 5.1, we discuss several known results on the label complexities achievable by these types of active learning algorithms. However, for now let us examine a very basic algorithm of this type. The following is intended to be a simple representative of the family of disagreement-based active learning algorithms. It has been stripped down to the bare essentials of what makes such algorithms work. As a result, although the gap between its label complexity and that achieved by passive learning is not necessarily as large as those achieved by the more sophisticated disagreement-based active learning algorithms of Section 5.1, it has the property that whenever those more sophisticated methods have label complexities asymptotically superior to those achieved by passive learning, that guarantee will also be true for this simpler method, and vice versa. The algorithm operates in only 2 phases. In the first, it uses one batch of label requests to reduce the version space V to a subset of C; in the second, it uses another batch of label requests, this time only requesting labels for points in DIS(V ). Thus, we have isolated precisely that aspect of disagreement-based active learning that involves improvements due to only requesting the labels of examples in the region of disagreement. ˆ The procedure is formally defined as follows, in terms of an estimator Pn (DIS(V )) specified below. 1485 H ANNEKE Meta-Algorithm 0 Input: passive algorithm A p , label budget n ˆ Output: classifier h 0. 1. 2. 3. 4. 5. 6. 7. 8. Request the first ⌊n/2⌋ labels {Y1 , . . . ,Y⌊n/2⌋ }, and let t ← ⌊n/2⌋ Let V = {h ∈ C : er⌊n/2⌋ (h) = 0} ˆ ˆ Let ∆ ← Pn (DIS(V )) Let L ← {} ˆ For m = ⌊n/2⌋ + 1, . . . ⌊n/2⌋ + ⌊n/(4∆)⌋ If Xm ∈ DIS(V ) and t < n, request the label Ym of Xm , and let y ← Ym and t ← t + 1 ˆ Else let y ← h(Xm ) for an arbitrary h ∈ V ˆ Let L ← L ∪ {(Xm , y)} ˆ Return A p (L) ˆ Meta-Algorithm 0 depends on a data-dependent estimator Pn (DIS(V )) of P(DIS(V )), which we can define in a variety of ways using only unlabeled examples. In particular, for the theorems ˆ below, we will take the following definition for Pn (DIS(V )), designed to be a confidence upper bound on P(DIS(V )). Let Un = {Xn2 +1 , . . . , X2n2 }. Then define   2 4 ˆ n (DIS(V )) = max P ½DIS(V ) (x), . (1)  n2 n x∈Un Meta-Algorithm 0 is divided into two stages: one stage where we focus on reducing V , and a second stage where we construct the sample L for the passive algorithm. This might intuitively seem somewhat wasteful, as one might wish to use the requested labels from the first stage to augment those in the second stage when constructing L, thus feeding all of the observed labels into the passive algorithm A p . Indeed, this can improve the label complexity in some cases (albeit only by a constant factor); however, in order to get the general property of being an activizer for all passive algorithms A p , we construct the sample L so that the conditional distribution of the X components in L given |L| is P |L| , so that it is (conditionally) an i.i.d. sample, which is essential to our analysis. The choice of the number of (unlabeled) examples to process in the second stage guarantees (by a Chernoff bound) that the “t < n” constraint in Step 5 is redundant; this is a trick we will employ in several of the methods below. As explained above, because f ∈ V , this implies that every (x, y) ∈ L has y = f (x). To give some basic intuition for how this algorithm behaves, consider the example of learning ˆ threshold classifiers (Example 1); to simplify the explanation, for now we ignore the fact that Pn is only an estimate, as well as the “t < n” constraint in Step 5 (both of which will be addressed in the general analysis below). In this case, suppose the target function is f = hz . Let a = max{Xi : Xi < z, 1 ≤ i ≤ ⌊n/2⌋} and b = min{Xi : Xi ≥ z, 1 ≤ i ≤ ⌊n/2⌋}. Then V = {hz ′ : a < z ′ ≤ b} and DIS(V ) = (a, b), so that the second phase of the algorithm only requests labels for a number of points in the region (a, b). With probability 1 − ε , the probability mass in this region is at most O(log(1/ε )/n), so that |L| ≥ ℓn,ε = Ω(n2 / log(1/ε )); also, since the labels in L are all correct, and the Xm values in L are conditionally iid (with distribution P) given |L|, we see that the conditional distribution of L given |L| = ℓ is the same as the (unconditional) distribution of Zℓ . In particular, if ˆ A p achieves label complexity Λ p , and hn is the classifier returned by Meta-Algorithm 0 applied to 1486 ACTIVIZED L EARNING A p , then for any n = Ω ˆ E er hn Λ p (ε , f , P) log(1/ε ) chosen so that ℓn,ε ≥ Λ p (ε , f , P), we have ≤ ε + sup E [er (A p (Zℓ ))] ≤ ε + ℓ≥ℓn,ε sup ℓ≥Λ p (ε , f ,P) E [er (A p (Zℓ ))] ≤ 2ε . This indicates the active learning algorithm achieves label complexity Λa with Λa (2ε , f , P) = O Λ p (ε , f , P) log(1/ε ) . In particular, if ∞ > Λ p (ε , f , P) = ω (log(1/ε )), then Λa (2ε , f , P) = o(Λ p (ε , f , P)). Therefore, Meta-Algorithm 0 is a universal activizer for the space of threshold classifiers. In contrast, consider the problem of learning interval classifiers (Example 2). In this case, suppose the target function f has P(x : f (x) = +1) = 0, and that P is uniform in [0, 1]. Since (with probability one) every Yi = −1, we have V = {h[a,b] : {X1 , . . . , X⌊n/2⌋ } ∩ [a, b] = ∅}. But this contains classifiers h[a,a] for every a ∈ (0, 1) \ {X1 , . . . , X⌊n/2⌋ }, so that DIS(V ) = (0, 1) \ {X1 , . . . , X⌊n/2⌋ }. Thus, P(DIS(V )) = 1, and |L| = O(n); that is, A p gets run with no more labeled examples than simple passive learning would use. This indicates we should not expect Meta-Algorithm 0 to be a universal activizer for interval classifiers. Below, we formalize this by constructing a passive learning algorithm A p that Meta-Algorithm 0 does not activize in scenarios of this type. 3.2 The Limiting Region of Disagreement In this subsection, we generalize the examples from the previous subsection. Specifically, we prove that the performance of Meta-Algorithm 0 is intimately tied to a particular limiting set, referred to as the disagreement core. A similar definition was given by Balcan, Hanneke, and Vaughan (2010) (there referred to as the boundary, for reasons that will become clear below); it is also related to certain quantities in the work of Hanneke (2007b, 2011) described below in Section 5.1. Definition 4 Define the disagreement core of a classifier f with respect to a set of classifiers H and probability measure P as ∂H,P f = lim DIS (BH,P ( f , r)) . r→0 When P = P, the data distribution on X , and P is clear from the context, we abbreviate this as ∂H f = ∂H,P f ; if additionally H = C, the full concept space, which is clear from the context, we further abbreviate this as ∂ f = ∂C f = ∂C,P f . As we will see, disagreement-based algorithms often tend to focus their label requests around the disagreement core of the target function. As such, the concept of the disagreement core will be essential in much of our discussion below. We therefore go through a few examples to build intuition about this concept and its properties. Perhaps the simplest example to start with is C as the class of threshold classifiers (Example 1), under P uniform on [0, 1]. For any hz ∈ C and sufficiently small r > 0, B( f , r) = {hz ′ : |z ′ − z| ≤ r}, and DIS(B( f , r)) = [z − r, z + r). Therefore, ∂hz = lim DIS(B(hz , r)) = lim [z − r, z + r) = {z}. Thus, in this case, the disagreement core r→0 r→0 of hz with respect to C and P is precisely the decision boundary of the classifier. As a slightly more involved example, consider again the example of interval classifiers (Example 2), again under P uniform on [0, 1]. Now for any h[a,b] ∈ C with b − a > 0, for any sufficiently small r > 0, B(h[a,b] , r) = {h[a′ ,b′ ] : |a − a′ | + |b − b′ | ≤ r}, and DIS(B(h[a,b] , r)) = [a − r, a + r) ∪ (b − r, b + r]. Therefore, ∂h[a,b] = lim DIS(B(h[a,b] , r)) = lim [a − r, a + r) ∪ (b − r, b + r] = {a, b}. Thus, in this r→0 r→0 case as well, the disagreement core of h[a,b] with respect to C and P is again the decision boundary of the classifier. 1487 H ANNEKE As the above two examples illustrate, ∂ f often corresponds to the decision boundary of f in some geometric interpretation of X and f . Indeed, under fairly general conditions on C and P, the disagreement core of f does correspond to (a subset of) the set of points dividing the two label regions of f ; for instance, Friedman (2009) derives sufficient conditions, under which this is the case. In these cases, the behavior of disagreement-based active learning algorithms can often be interpreted in the intuitive terms of seeking label requests near the decision boundary of the target function, to refine an estimate of that boundary. However, in some more subtle scenarios this is no longer the case, for interesting reasons. To illustrate this, let us continue the example of interval classifiers from above, but now consider h[a,a] (i.e., h[a,b] with a = b). This time, for any r ∈ (0, 1) we have B(h[a,a] , r) = {h[a′ ,b′ ] ∈ C : b′ − a′ ≤ r}, and DIS(B(h[a,a] , r)) = (0, 1). Therefore, ∂h[a,a] = lim DIS(B(h[a,a] , r)) = lim (0, 1) = (0, 1). r→0 r→0 This example shows that in some cases, the disagreement core does not correspond to the decision boundary of the classifier, and indeed has P(∂ f ) > 0. Intuitively, as in the above example, this typically happens when the decision surface of the classifier is in some sense simpler than it could be. For instance, consider the space C of unions of two intervals (Example 3 with i = 2) under uniform P. The classifiers f ∈ C with P(∂ f ) > 0 are precisely those representable (up to probability zero differences) as a single interval. The others (with 0 < z1 < z2 < z3 < z4 < 1) have ∂hz = {z1 , z2 , z3 , z4 }. In these examples, the f ∈ C with P(∂ f ) > 0 are not only simpler than other nearby classifiers in C, but they are also in some sense degenerate relative to the rest of C; however, it turns out this is not always the case, as there exist scenarios (C, P), even with d = 2, and even with countable C, for which every f ∈ C has P(∂ f ) > 0; in these cases, every classifier is in some important sense simpler than some other subset of nearby classifiers in C. In Section 3.3, we show that the label complexity of disagreement-based active learning is intimately tied to the disagreement core. In particular, scenarios where P(∂ f ) > 0, such as those mentioned above, lead to the conclusion that disagreement-based methods are sometimes insufficient for activized learning. This motivates the design of more sophisticated methods in Section 4, which overcome this deficiency, along with a corresponding refinement of the definition of “disagreement core ” in Section 5.2 that eliminates the above issue with “simple” classifiers. 3.3 Necessary and Sufficient Conditions for Disagreement-Based Activized Learning In the specific case of Meta-Algorithm 0, for large n we may intuitively expect it to focus its second batch of label requests in and around the disagreement core of the target function. Thus, whenever P(∂ f ) = 0, we should expect the label requests to be quite focused, and therefore the algorithm should achieve smaller label complexity compared to passive learning. On the other hand, if P(∂ f ) > 0, then the label requests will not become focused beyond a constant fraction of the space, so that the improvements achieved by Meta-Algorithm 0 over passive learning should be, at best, a constant factor. This intuition is formalized in the following general theorem, the proof of which is included in Appendix A. Theorem 5 For any VC class C, Meta-Algorithm 0 is a universal activizer for C if and only if every f ∈ C and distribution P has P (∂C,P f ) = 0. While the formal proof is given in Appendix A, the general idea is simple. As we always have f ∈ V , any y inferred in Step 6 must equal f (x), so that all of the labels in L are correct. Also, as n ˆ grows large, classic results on passive learning imply the diameter of the set V will become small, 1488 ACTIVIZED L EARNING shrinking to zero as n → ∞ (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989). Therefore, as n → ∞, DIS(V ) should converge to a subset of ∂ f , ˆ so that in the case P(∂ f ) = 0, we have ∆ → 0; thus |L| ≫ n, which implies an asymptotic strict improvement in label complexity over the passive algorithm A p that L is fed into in Step 8. On the other hand, since ∂ f is defined by classifiers arbitrarily close to f , it is unlikely that any finite sample of correctly labeled examples can contradict enough classifiers to make DIS(V ) significantly smaller ˆ than ∂ f , so that we always have P(DIS(V )) ≥ P(∂ f ). Therefore, if P(∂ f ) > 0, then ∆ converges to some nonzero constant, so that |L| = O(n), representing only a constant factor improvement in label complexity. In fact, as is implied from this sketch (and is proven in Appendix A), the targets f and distributions P for which Meta-Algorithm 0 achieves asymptotic strict improvements for all passive learning algorithms (for which f and P are nontrivial) are precisely those (and only those) for which P(∂C,P f ) = 0. There are some general conditions under which the zero-probability disagreement cores condition of Theorem 5 will hold. For instance, it is not difficult to show this will always hold when X is countable. Furthermore, with some effort one can show it will hold for most classes having VC dimension one (e.g., any countable C with d = 1). However, as we have seen, not all spaces C satisfy this zero-probability disagreement cores property. In particular, for the interval classifiers studied in Section 3.2, we have P(∂h[a,a] ) = P((0, 1)) = 1. Indeed, the aforementioned special cases aside, for most nontrivial spaces C, one can construct distributions P that in some sense make C mimic the intervals problem, so that we should typically expect disagreement-based methods will not be activizers. For detailed discussions of various scenarios where the P(∂C,P f ) = 0 condition is (or is not) satisfied for various C, P, and f , see the works of Hanneke (2009b), Hanneke (2007b), Hanneke (2011), Balcan, Hanneke, and Vaughan (2010), Friedman (2009), Wang (2009) and Wang (2011). 4. Beyond Disagreement: A Basic Activizer Since the zero-probability disagreement cores condition of Theorem 5 is not always satisfied, we are left with the question of whether there could be other techniques for active learning, beyond simple disagreement-based methods, which could activize every passive learning algorithm for every VC class. In this section, we present an entirely new type of active learning algorithm, unlike anything in the existing literature, and we show that indeed it is a universal activizer for any class C of finite VC dimension. 4.1 A Basic Activizer As mentioned, the case P(∂ f ) = 0 is already handled nicely by disagreement-based methods, since the label requests made in the second stage of Meta-Algorithm 0 will become focused into a small region, and L therefore grows faster than n. Thus, the primary question we are faced with is what to do when P(∂ f ) > 0. Since (loosely speaking) we have DIS(V ) → ∂ f in Meta-Algorithm 0, P(∂ f ) > 0 corresponds to scenarios where the label requests of Meta-Algorithm 0 will not become focused beyond a certain extent; specifically, as we show in Appendix B (Lemmas 35 and 36), P(DIS(V ) ⊕ ∂ f ) → 0 almost surely, where ⊕ is the symmetric difference, so that we expect MetaAlgorithm 0 will request labels for at least some constant fraction of the examples in L. On the one hand, this is definitely a major problem for disagreement-based methods, since it prevents them from improving over passive learning in those cases. On the other hand, if we do not 1489 H ANNEKE restrict ourselves to disagreement-based methods, we may actually be able to exploit properties of this scenario, so that it works to our advantage. In particular, in addition to the fact that P(DIS(V ) ⊕ ∂C f ) → 0, we show in Appendix B (Lemma 35) that P(∂V f ⊕ ∂C f ) = 0 (almost surely) in MetaAlgorithm 0; this implies that for sufficiently large n, a random point x1 in DIS(V ) is likely to be in ∂V f . We can exploit this fact by using x1 to split V into two subsets: V [(x1 , +1)] and V [(x1 , −1)]. Now, if x1 ∈ ∂V f , then (by definition of the disagreement core) inf er(h) = inf er(h) = h∈V [(x1 ,+1)] h∈V [(x1 ,−1)] 0. Therefore, for almost every point x ∈ DIS(V [(x1 , +1)]), the label agreed upon for x by classifiers / in V [(x1 , +1)] should be f (x). Likewise, for almost every point x ∈ DIS(V [(x1 , −1)]), the label / agreed upon for x by classifiers in V [(x1 , −1)] should be f (x). Thus, we can accurately infer the label of any point x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) (except perhaps a zero-probability subset). / With these sets V [(x1 , +1)] and V [(x1 , −1)] in hand, there is no longer a need to request the labels of points for which either of them has agreement about the label, and we can focus our label requests to the region DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]), which may be much smaller than DIS(V ). Now if P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) → 0, then the label requests will become focused to a shrinking region, and by the same reasoning as for Theorem 5 we can asymptotically achieve strict improvements over passive learning by a method analogous to Meta-Algorithm 0 (with the above changes). Already this provides a significant improvement over disagreement-based methods in many cases; indeed, in some cases (such as intervals) this fully addresses the nonzero-probability disagreement core issue in Theorem 5. In other cases (such as unions of two intervals), it does not completely address the issue, since for some targets we do not have P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) → 0. However, by repeatedly applying this same reasoning, we can address the issue in full generality. Specifically, if P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) 0, then DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) essentially converges to a region ∂C[(x1 ,+1)] f ∩ ∂C[(x1 ,−1)] f , which has nonzero probability, and is nearly equivalent to ∂V [(x1 ,+1)] f ∩ ∂V [(x1 ,−1)] f . Thus, for sufficiently large n, a random x2 in DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) will likely be in ∂V [(x1 ,+1)] f ∩ ∂V [(x1 ,−1)] f . In this case, we can repeat the above argument, this time splitting V into four sets (V [(x1 , +1)][(x2 , +1)], V [(x1 , +1)][(x2 , −1)], V [(x1 , −1)][(x2 , +1)], and V [(x1 , −1)][(x2 , −1)]), each with infimum error rate equal zero, so that for a point x in the region of agreement of any of these four sets, the agreed-upon label will (almost surely) be f (x), so that we can infer that label. Thus, we need only request the labels of those points in the intersection of all four regions of disagreement. We can further repeat this process as many times as needed, until we get a partition of V with shrinking probability mass in the intersection of the regions of disagreement, which (as above) can then be used to obtain asymptotic improvements over passive learning. Note that the above argument can be written more concisely in terms of shattering. That is, any x ∈ DIS(V ) is simply an x such that V shatters {x}; a point x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) is simply one for which V shatters {x1 , x}, and for any x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]), the / label y we infer about x has the property that the set V [(x, −y)] does not shatter {x1 }. This continues for each repetition of the above idea, with x in the intersection of the four regions of disagreement simply being one for which V shatters {x1 , x2 , x}, and so on. In particular, this perspective makes it clear that we need only repeat this idea at most d times to get a shrinking intersection region, since no set of d + 1 points is shatterable. Note that there may be unobservable factors (e.g., the target function) determining the appropriate number of iterations of this idea sufficient to have a shrinking probability of requesting a label, while maintaining the accuracy of inferred labels. To address this, 1490 ACTIVIZED L EARNING we can simply try all d + 1 possibilities, and then select one of the resulting d + 1 classifiers via a kind of tournament of pairwise comparisons. Also, in order to reduce the probability of a mistaken inference due to x1 ∈ ∂V f (or similarly for later xi ), we can replace each single xi with multiple / samples, and then take a majority vote over whether to infer the label, and which label to infer if we do so; generally, we can think of this as estimating certain probabilities, and below we write ˆ these estimators as Pm , and discuss the details of their implementation later. Combining MetaAlgorithm 0 with the above reasoning motivates a new type of active learning algorithm, referred to as Meta-Algorithm 1 below, and stated as follows. Meta-Algorithm 1 Input: passive algorithm A p , label budget n ˆ Output: classifier h 0. Request the first mn = ⌊n/3⌋ labels, {Y1 , . . . ,Ymn }, and let t ← mn 1. Let V = {h ∈ C : ermn (h) = 0} 2. For k = 1, 2, . . . , d + 1 ˆ ˆ ˆ 3. ∆(k) ← Pmn x : P S ∈ X k−1 : V shatters S ∪ {x}|V shatters S ≥ 1/2 4. Let Lk ← {} ˆ 5. For m = mn + 1, . . . , mn + ⌊n/(6 · 2k ∆(k) )⌋ ˆ m S ∈ X k−1 : V shatters S ∪ {Xm }|V shatters S ≥ 1/2 and t < ⌊2n/3⌋ 6. If P 7. Request the label Ym of Xm , and let y ← Ym and t ← t + 1 ˆ ˆ m S ∈ X k−1 :V [(Xm , −y)] does not shatter S|V shatters S 8. Else, let y ← argmax P ˆ y∈{−1,+1} 9. Let Lk ← Lk ∪ {(Xm , y)} ˆ 10. Return ActiveSelect({A p (L1 ), A p (L2 ), . . . , A p (Ld+1 )}, ⌊n/3⌋, {Xmn +maxk |Lk |+1 , . . .}) Subroutine: ActiveSelect Input: set of classifiers {h1 , h2 , . . . , hN }, label budget m, sequence of unlabeled examples U ˆ Output: classifier h 0. 1. 2. 3. 4. For each j, k ∈ {1, 2, . . . , N} s.t. j < k, m Let R jk be the first j(N− j) ln(eN) points in U ∩{x : h j (x) = hk (x)} (if such values exist) Request the labels for R jk and let Q jk be the resulting set of labeled examples Let mk j = erQ jk (hk ) ˆ Return hk , where k = max k ∈ {1, . . . , N} : max j 2 er(h ∗∗ ). In particular, this implies er(h |{x : Now suppose j ∈ {k j j k hk∗∗ (x) = h j (x)}) > 2/3 and P(x : h j (x) = hk∗∗ (x)) > 0, which again means (with probability one) |{XM , XM+1 , . . .} ∩ {x : h j (x) = hk∗∗ (x)}| ≥ Mk∗∗ . By Hoeffding’s inequality, we have that P m jk∗∗ ≤ 7/12 ≤ exp {−Mk∗∗ /72} ≤ exp {1 − m/ (72k∗ N ln(eN))} . By a union bound, we have that P ∃ j > k∗∗ : er(h j ) > 2 er(hk∗∗ ) and m jk∗∗ ≤ 7/12 ≤ (N − k∗∗ ) · exp {1 − m/ (72k∗ N ln(eN))} . ˆ In particular, when k ≥ k∗∗ , and m jk∗∗ > 7/12 for all j > k∗∗ with er(h j ) > 2 er(hk∗∗ ), it must be true that er(hk ) ≤ 2 er(hk∗∗ ) ≤ 2 er(hk∗ ). ˆ ˆ So, by a union bound, with probability ≥ 1 − eN · exp {−m/ (72k∗ N ln(eN))}, the k chosen by ActiveSelect has er(hk ) ≤ 2 er(hk∗ ). ˆ ⋆ The next two lemmas describe the limiting behavior of S k (Vm ). In particular, we see that its k limiting value is precisely ∂C f (up to zero-probability differences). Lemma 35 establishes that k (V ⋆ ) does not decrease below ∂ k f (except for a zero-probability set), and Lemma 36 establishes S m C k that its limit is not larger than ∂C f (again, except for a zero-probability set). Lemma 35 There is an event H ′ with P(H ′ ) = 1 such that on H ′ , ∀m ∈ N, ∀k ∈ {0, . . . , d˜f − 1}, for ⋆ any H with Vm ⊆ H ⊆ C, k k k P k S k (H) ∂C f = P k ∂H f ∂C f = 1, and ∀i ∈ N, ½∂ k Hf (k+1) Si = ½∂ k f Si (k+1) C . (k) k k Also, on H ′ , every such H has P k ∂H f = P k ∂C f , and Mℓ (H) → ∞ as ℓ → ∞. ⋆ Proof We will show the first claim for the set Vm , and the result will then hold for H by monotonicity. In particular, we will show this for any fixed k ∈ {0, . . . , d˜f − 1} and m ∈ N, and the k ⋆ existence of H ′ then holds by a union bound. Fix any set S ∈ ∂C f . Suppose BVm ( f , r) does not (i) (i) (i) shatter S for some r > 0. There is an infinite sequence of sets {{h1 , h2 , . . . , h2k }}i with ∀ j ≤ 2k , (i) (i) (i) ⋆ P(x : h j (x) = f (x)) ↓ 0, such that each {h1 , . . . , h2k } ⊆ B( f , r) and shatters S. Since BVm ( f , r) does not shatter S, / ⋆ 1 = inf ½ ∃ j : h j ∈ BVm ( f , r) = inf ½ ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) . (i) i (i) i 1536 ACTIVIZED L EARNING But P inf ½ ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) = 1 ≤ inf P ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) (i) (i) i i ≤ lim i→∞ (i) (i) mP x : h j (x) = f (x) = j≤2k j≤2k m lim P x : h j (x) = f (x) = 0, i→∞ ⋆ where the second inequality follows by a union bound. Therefore, ∀r > 0, P S ∈ S k BVm ( f , r) = / ¯ ⋆ 0. Furthermore, since S k BVm ( f , r) is monotonic in r, the dominated convergence theorem gives us that ⋆ P S ∈ ∂Vm f = E lim ½S k (BV ⋆ ( f ,r)) (S) = lim P S ∈ S k BVm ( f , r) / k⋆ / ¯ r→0 r→0 m = 0. ⋆ This implies that (letting S ∼ P k be independent from Vm ) k k ¯k⋆ ¯k⋆ P P k ∂Vm f ∂C f > 0 = P P k ∂Vm f ∩ ∂C f > 0 k ¯k⋆ = lim P P k ∂Vm f ∩ ∂C f > ξ ξ →0 ≤ lim 1 ξ →0 ξ k ¯k⋆ E P k ∂Vm f ∩ ∂C f 1 E ξ →0 ξ = lim ½∂C f (S)P S ∈ ∂Vm f S / k⋆ k (Markov) (Fubini) = lim 0 = 0. ξ →0 ⋆ This establishes the first claim for Vm , on an event of probability 1, and monotonicity extends the ⋆ claim to any H ⊇ Vm . Also note that, on this event, k k k k k k k P k ∂H f ≥ P k ∂H f ∩ ∂C f = P k ∂H f ∂C f P k ∂C f = P k ∂C f , k k where the last equality follows from the first claim. Noting that for H ⊆ C, ∂H f ⊆ ∂C f , we must have k k P k ∂H f = P k ∂C f . This establishes the third claim. From the first claim, for any given value of i ∈ N the second claim (k+1) ⋆ holds for Si (with H = Vm ) on an additional event of probability 1; taking a union bound over (k) all i ∈ N extends this claim to every Si on an event of probability 1. Monotonicity then implies ½∂C f Si(k+1) = ½∂V ⋆ f Si(k+1) ≤ ½∂H f Si(k+1) ≤ ½∂C f Si(k+1) , k k k k m k extending the result to general H. Also, as k < d˜f , we know P k ∂C f > 0, and since we also know ⋆ is independent from W , the strong law of large numbers implies the final claim (for V ⋆ ) on an Vm 2 m ⋆ additional event of probability 1; again, monotonicity extends this claim to any H ⊇ Vm . Intersecting the above events over values m ∈ N and k < d˜f gives the event H ′ , and as each of the above events has probability 1 and there are countably many such events, a union bound implies P(H ′ ) = 1. 1537 H ANNEKE ⋆ Note that one specific implication of Lemma 35, obtained by taking k = 0, is that on H ′ , Vm = ∅ 0 f = X 0 so that P 0 ∂ 0 f = 1, (even if f ∈ cl(C) \ C). This is because, for f ∈ cl(C), we have ∂C C ⋆ 0 0 which means P 0 ∂Vm f = 1 (on H ′ ), so that we must have ∂Vm f = X 0 , which implies Vm = ∅. In ⋆ ⋆ ⋆ particular, this also means f ∈ cl (Vm ). Lemma 36 There is a monotonic function q(r) = o(1) (as r → 0) such that, on event H ′ , for any ⋆ k ∈ 0, . . . , d˜f − 1 , m ∈ N, r > 0, and set H such that Vm ⊆ H ⊆ B( f , r), ¯k P k ∂C f S k (H) ≤ q(r). In particular, for τ ∈ N and δ > 0, on Hτ (δ ) ∩ H ′ (where Hτ (δ ) is from Lemma 29), every m ≥ τ ⋆ ¯k and k ∈ 0, . . . , d˜f − 1 has P k ∂C f S k (Vm ) ≤ q(φ (τ ; δ )). Proof Fix any k ∈ 0, . . . , d˜f − 1 . By Lemma 35, we know that on event H ′ , ¯k Pk P k ∂C f ∩ S k (H) ¯k ≤ P k ∂C f S k (H) = P k (S k (H)) ¯k P k ∂C f ∩ S k (H) Pk = ≤ k P k ∂C f ¯k ∂C f ∩ S k (H) k P k ∂H f ¯ ∂ k f ∩ S k (B ( f , r)) C k P k ∂C f . ¯k Define qk (r) as this latter quantity. Since P k ∂C f ∩ S k (B( f , r)) is monotonic in r, ¯k P k ∂C f ∩ lim S k (B( f , r)) ¯k P k ∂C f ∩ S k (B( f , r)) r→0 = lim k k ∂k f r→0 P P k ∂C f C = k ¯k P k ∂C f ∩ ∂C f = 0. k P k ∂C f This proves qk (r) = o(1). Defining q(r) = max qk (r) : k ∈ 0, 1, . . . , d˜f − 1 = o(1) completes the proof of the first claim. ⋆ For the final claim, simply recall that by Lemma 29, on Hτ (δ ), every m ≥ τ has Vm ⊆ Vτ⋆ ⊆ B( f , φ (τ ; δ )). Lemma 37 For ζ ∈ (0, 1), define rζ = sup {r ∈ (0, 1) : q(r) < ζ } /2. ⋆ On H ′ , ∀k ∈ 0, . . . , d˜f − 1 , ∀ζ ∈ (0, 1), ∀m ∈ N, for any set H such that Vm ⊆ H ⊆ B( f , rζ ), ¯ P x : P k S k (H[(x, f (x))]) S k (H) > ζ k ¯ = P x : P k S k (H[(x, f (x))]) ∂H f > ζ = 0. (16) In particular, for δ ∈ (0, 1), defining τ (ζ ; δ ) = min τ ∈ N : sup φ (m; δ ) ≤ rζ , ∀τ ≥ τ (ζ ; δ ), and ⋆ ∀m ≥ τ , on Hτ (δ ) ∩ H ′ , (16) holds for H = Vm . 1538 m≥τ ACTIVIZED L EARNING ¯k Proof Fix k, m, H as described above, and suppose q = P k ∂C f |S k (H) < ζ ; by Lemma 36, this ′ . Since, ∂ k f ⊆ S k (H), we have that ∀x ∈ X , happens on H H k k ¯ ¯ P k S k (H[(x, f (x))]) S k (H) = P k S k (H[(x, f (x))]) ∂H f P k ∂H f S k (H) ¯k ¯k ¯ + P k S k (H[(x, f (x))]) S k (H) ∩ ∂H f P k ∂H f S k (H) . Since all probability values are bounded by 1, we have k ¯k ¯ ¯ P k S k (H[(x, f (x))]) S k (H) ≤ P k S k (H[(x, f (x))]) ∂H f + P k ∂H f S k (H) . (17) Isolating the right-most term in (17), by basic properties of probabilities we have ¯k P k ∂H f S k (H) k k ¯k ¯k ¯k ¯k = P k ∂H f S k (H) ∩ ∂C f P k ∂C f S k (H) + P k ∂H f S k (H) ∩ ∂C f P k ∂C f S k (H) k ¯k ¯k ≤ P k ∂C f S k (H) + P k ∂H f S k (H) ∩ ∂C f . (18) By assumption, the left term in (18) equals q. Examining the right term in (18), we see that k k k ¯k ¯k P k ∂H f S k (H) ∩ ∂C f = P k S k (H) ∩ ∂H f ∂C f /P k S k (H) ∂C f k k k ¯k ≤ P k ∂H f ∂C f /P k ∂H f ∂C f . (19) By Lemma 35, on H ′ the denominator in (19) is 1 and the numerator is 0. Thus, combining this fact with (17) and (18), we have that on H ′ , k ¯ ¯ P x : P k S k (H[(x, f (x))]) S k (H) > ζ ≤ P x : P k S k (H[(x, f (x))]) ∂H f > ζ − q . (20) Note that proving the right side of (20) equals zero will suffice to establish the result, since it upper bounds both the first expression of (16) (as just established) and the second expression of (16) (by monotonicity of measures). Letting X ∼ P be independent from the other random variables (Z,W1 ,W2 ), by Markov’s inequality, the right side of (20) is at most 1 k ¯ E P k S k (H[(X, f (X))]) ∂H f ζ −q H = k ¯ E P k S k (H[(X, f (X))]) ∩ ∂H f k (ζ − q)P k ∂H f H , and by Fubini’s theorem, this is (letting S ∼ P k be independent from the other random variables) E / ½∂H f (S)P x : S ∈ S k (H[(x, f (x))]) H k k (ζ − q)P k ∂H f . Lemma 35 implies this equals E ½∂H f (S)P x : S ∈ S k (H[(x, f (x))]) H / k k (ζ − q)P k ∂C f 1539 . (21) H ANNEKE (i) k For any fixed S ∈ ∂H f , there is an infinite sequence of sets (i) (i) (i) 2k , P x : h j (x) = f (x) ↓ 0, such that each h1 , . . . , h2k not shatter S, then (i) (i) h1 , h2 , . . . , h2k i∈N with ∀ j ≤ ⊆ H and shatters S. If H[(x, f (x))] does 1 = inf ½ ∃ j : h j ∈ H[(x, f (x))] = inf ½ ∃ j : h j (x) = f (x) . / (i) (i) i i In particular, P x : S ∈ S k (H[(x, f (x))]) ≤ P x : inf ½ ∃ j : h j (x) = f (x) = 1 / (i) i (i) (i) =P ≤ inf P x : ∃ j s.t. h j (x) = f (x) x : ∃ j : h j (x) = f (x) i i (i) (i) ≤ lim i→∞ j≤2k P x : h j (x) = f (x) = j≤2k lim P x : h j (x) = f (x) = 0. i→∞ Thus (21) is zero, which establishes the result. ⋆ The final claim is then implied by Lemma 29 and monotonicity of Vm in m: that is, on Hτ (δ ), ⋆ Vm ⊆ Vτ⋆ ⊆ B( f , φ (τ ; δ )) ⊆ B( f , rζ ). Lemma 38 For any ζ ∈ (0, 1), there are values n ∈ N and ε > 0, on event H⌊n/3⌋ ˜ (ε /2) ∩ H ′ , (ζ ) ∆n (ε ) : n ∈ N, ε ∈ (0, 1) letting V ˜ such that, for any ⋆ = V⌊n/3⌋ , (ζ ) ˜ ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ≥ ζ ≤ ∆n (ε ), (ζ ) and for any N-valued N(ε ) = ω (log(1/ε )), ∆N(ε ) (ε ) = o(1). Proof Throughout, we suppose the event H⌊n/3⌋ (ε /2) ∩ H ′ , and fix some ζ ∈ (0, 1). We have ∀x, ˜ ˜ ˜ ˜ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ˜ ˜ ˜ ˜ d˜ −1 f P d f −1 ∂Cf ˜ ˜ ˜ ˜ d˜ −1 ˜ ˜ ¯ d −1 f P d f −1 ∂Cf f S d f −1 (V ) ˜ ˜ ˜ ˜ d˜ −1 ˜ = P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ¯ + P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ≤ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf d˜ −1 ˜ ˜ f S d f −1 (V ) ˜ ˜ ˜ ¯ d −1 f +P d f −1 ∂Cf f S d f −1 (V ) . (22) By Lemma 35, the left term in (22) equals ˜ ˜ d˜ −1 ˜ ˜ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ˜ ˜ ˜ d˜ −1 = P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf 1540 f , ˜ ˜ d˜ −1 f P d f −1 S d f −1 (V ) ∂Cf f ACTIVIZED L EARNING and by Lemma 36, the right term in (22) is at most q(φ (⌊n/3⌋; ε /2)). Thus, we have ˜ ˜ ˜ ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ≥ ζ ˜ ˜ ˜ d˜ −1 ≤ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf f ≥ ζ − q(φ (⌊n/3⌋; ε /2)) . (23) (ζ ) For n < 3τ (ζ /2; ε /2) (for τ (·; ·) defined in Lemma 37), we define ∆n (ε ) = 1. Otherwise, suppose n ≥ 3τ (ζ /2; ε /2), so that q(φ (⌊n/3⌋; ε /2)) < ζ /2, and thus (23) is at most ˜ ˜ ˜ d˜ −1 P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf f ≥ ζ /2 . By Lemma 29, this is at most ˜ ˜ d˜ −1 ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) ∂Cf f ≥ ζ /2 . Letting X ∼ P, by Markov’s inequality this is at most 2 ˜ ˜ ˜ d˜ −1 E P d f −1 S ∈ X d f −1 : S ∪ {X} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) ∂Cf f ζ 2 d˜f ˜ ˜ d˜ −1 = P S ∪ {x} ∈ X d f : S ∪ {x} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) and S ∈ ∂Cf f ˜f ζδ 2 d˜f ˜ ≤ P S d f (B( f , φ (⌊n/3⌋; ε /2))) . ˜f ζδ (24) (ζ ) Thus, defining ∆n (ε ) as (24) for n ≥ 3τ (ζ /2; ε /2) establishes the first claim. It remains only to prove the second claim. Let N(ε ) = ω (log(1/ε )). Since τ (ζ /2; ε /2) ≤ 4 4 d ln r4e + ln ε = O(log(1/ε )), we have that for all sufficiently small ε > 0, N(ε ) ≥ r ζ /2 ζ /2 (ζ ) ˜ 3τ (ζ /2; ε /2), so that ∆N(ε ) (ε ) equals (24) (with n = N(ε )). Furthermore, since δ f > 0, while ˜ d˜ P d f ∂Cf f = 0, and φ (⌊N(ε )/3⌋; ε /2) = o(1), by continuity of probability measures we know (ζ ) (24) is o(1) when n = N(ε ), so that we generally have ∆N(ε ) (ε ) = o(1). For any m ∈ N, define ˜ ˜ M(m) = m3 δ f /2. Lemma 39 There is a (C, P, f )-dependent constant c(i) ∈ (0, ∞) such that, for any τ ∈ N there is (i) an event Hτ ⊆ H ′ with (i) ˜ P Hτ ≥ 1 − c(i) · exp −M(τ )/4 (i) such that on Hτ , if d˜f ≥ 2, then ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , ∀ℓ ∈ N, for any set H such that Vℓ⋆ ⊆ H ⊆ C, (k) ˜ Mm (H) ≥ M(m). 1541 H ANNEKE Proof On H ′ , Lemma 35 implies every ½S k−1 (H) Si(k) ≥ ½∂H f Si(k) = ½∂C f Si(k) , so we k−1 k−1 (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) on an appropriate event. We know focus on showing P ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) = 1 − P ∃k ∈ 2, . . . , d˜f , m ≥ τ : d˜f ≥ 1− (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) , P m≥τ k=2 where the last line follows by a union bound. Thus, we will focus on bounding d˜f P m≥τ k=2 (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) . Fix any k ∈ 2, . . . , d˜f , and integer m ≥ τ . Since E (k) k−1 ˜ = P k−1 ∂C f m3 ≥ δ f m3 , k−1 Si : i ≤ m3 ∩ ∂C f a Chernoff bound implies that P (k) k−1 k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) ≤ exp −m3 P k−1 ∂C f /8 ˜ ≤ exp −m3 δ f /8 . Thus, we have that (25) is at most d˜f m≥τ k=2 ˜ exp −m3 δ f /8 ≤ m≥τ ˜ d˜f · exp −m3 δ f /8 ≤ ˜ ≤ d˜f · exp −M(τ )/4 + d˜f · ∞ τ3 m≥τ 3 ˜ d˜f · exp −mδ f /8 ˜ exp −xδ f /8 dx ˜ ˜ = d˜f · 1 + 8/δ f · exp −M(τ )/4 ˜ ˜ ≤ 9d˜f /δ f · exp −M(τ )/4 . Note that since P(H ′ ) = 1, defining (i) Hτ = ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) ∩ H ′ has the required properties. 1542 (25) ACTIVIZED L EARNING (i) Lemma 40 For any τ ∈ N, there is an event Gτ with (i) (i) P Hτ \ Gτ ˜ ˜ ≤ 121d˜f /δ f · exp −M(τ )/60 (i) such that, on Gτ , if d˜f ≥ 2, then for every integer s ≥ τ and k ∈ 2, . . . , d˜f , ∀r ∈ 0, r1/6 , (k) Ms (B ( f , r)) ≤ (3/2) (k) k−1 Si : i ≤ s3 ∩ ∂C f . (k) ˆ Proof Fix integers s ≥ τ and k ∈ 2, . . . , d˜f , and let r = r1/6 . Define the set S k−1 = Si : i ≤ s3 ∩ (k) ˆ ˆ S k−1 (B ( f , r)). Note S k−1 = Ms (B ( f , r)) and the elements of S k−1 are conditionally i.i.d. given (k) (k) Ms (B ( f , r)), each with conditional distribution equivalent to the conditional distribution of S1 (k) given S1 ∈ S k−1 (B ( f , r)) . In particular, (k) (k) k−1 k−1 ˆ E S k−1 ∩ ∂C f Ms (B ( f , r)) = P k−1 ∂C f S k−1 (B ( f , r)) Ms (B ( f , r)) . Define the event (i) Gτ (k, s) = k−1 ˆ ˆ S k−1 ≤ (3/2) S k−1 ∩ ∂C f . By Lemma 36 (indeed by definition of q(r) and r1/6 ) we have (k) (i) 1 − P Gτ (k, s) Ms (B ( f , r)) =P (k) (k) k−1 ˆ S k−1 ∩ ∂C f < (2/3)Ms (B ( f , r)) Ms (B ( f , r)) ≤P (k) (k) k−1 ˆ S k−1 ∩ ∂C f < (4/5) (1 − q (r)) Ms (B ( f , r)) Ms (B ( f , r)) ≤P (k) (k) k−1 k−1 ˆ S k−1 ∩ ∂C f < (4/5)P k−1 ∂C f S k−1 (B ( f , r)) Ms (B ( f , r)) Ms (B ( f , r)) . (26) By a Chernoff bound, (26) is at most (k) k−1 exp −Ms (B ( f , r)) P k−1 ∂C f S k−1 (B ( f , r)) /50 (k) (k) ≤ exp −Ms (B ( f , r)) (1 − q (r)) /50 ≤ exp −Ms (B ( f , r)) /60 . Thus, by Lemma 39, (i) (i) P Hτ \ Gτ (k, s) ≤ P =E (i) (k) (i) ˜ Ms (B ( f , r)) ≥ M(s) \ Gτ (k, s) (k) 1 − P Gτ (k, s) Ms (B ( f , r)) (k) ≤ E exp −Ms (B ( f , r)) /60 ½[M(s),∞) Ms(k) (B ( f , r)) ˜ ½[M(s),∞) Ms(k) (B ( f , r)) ˜ 1543 ˜ ≤ exp −M(s)/60 . H ANNEKE (i) Now defining Gτ = (i) d˜f (i) k=2 Gτ (k, s), s≥τ (i) P Hτ \ Gτ ≤ s≥τ a union bound implies ˜ d˜f · exp −M(s)/60 ˜ ≤ d˜f exp −M(τ )/60 + ∞ τ3 ˜ exp −xδ f /120 dx ˜ ˜ = d˜f 1 + 120/δ f · exp −M(τ )/60 ˜ ˜ ≤ 121d˜f /δ f · exp −M(τ )/60 . This completes the proof for r = r1/6 . Monotonicity extends the result to any r ∈ 0, r1/6 . Lemma 41 There exist (C, P, f , γ )-dependent constants τ ∗ ∈ N and c(ii) ∈ (0, ∞) such that, for any (ii) (i) integer τ ≥ τ ∗ , there is an event Hτ ⊆ Gτ with (i) (ii) P Hτ \ Hτ ˜ ≤ c(ii) · exp −M(τ )1/3 /60 (27) (i) (ii) such that, on Hτ ∩ Hτ , ∀s, m, ℓ, k ∈ N with ℓ < m and k ≤ d˜f , for any set of classifiers H with ⋆ ⊆ H, if either k = 1, or s ≥ τ and H ⊆ B( f , r Vℓ (1−γ )/6 ), then ˆ (k) ˆ (k) ˆ (k) ∆s (Xm ,W2 , H) < γ =⇒ Γs (Xm , − f (Xm ),W2 , H) < Γs (Xm , f (Xm ),W2 , H) . (i) (ii) In particular, for δ ∈ (0, 1) and τ ≥ max{τ ((1 − γ )/6; δ ), τ ∗ }, on Hτ (δ ) ∩ Hτ ∩ Hτ , this is true for H = Vℓ⋆ for every k, ℓ, m, s ∈ N satisfying τ ≤ ℓ < m, τ ≤ s, and k ≤ d˜f . ˜ Proof Let τ ∗ = (6/(1− γ ))· 2/δ f 1/3 , and consider any τ , k, ℓ, m, s, H as described above. If k = 1, (i) ⋆ the result clearly holds. In particular, Lemma 35 implies that on Hτ , H[(Xm , f (Xm ))] ⊇ Vm = ∅, so that some h ∈ H has h(Xm ) = f (Xm ), and therefore ˆ (1) Γs (Xm , − f (Xm ),W2 , H) = ½ {h(Xm )} (− f (Xm )) = 0, h∈H ˆ (1) ˆ (1) / and since ∆s (Xm ,W2 , H) = ½DIS(H) (Xm ), if ∆s (Xm ,W2 , H) < γ , then since γ < 1 we have Xm ∈ DIS(H), so that ˆ (1) Γs (Xm , f (Xm ),W2 , H) = ½ 1544 {h(Xm )} ( f (Xm )) h∈H = 1. ACTIVIZED L EARNING (i) (i) Otherwise, suppose 2 ≤ k ≤ d˜f . Note that on Hτ ∩ Gτ , ∀m ∈ N, and any H with Vℓ⋆ ⊆ H ⊆ B( f , r(1−γ )/6 ) for some ℓ ∈ N, ˆ (k) Γs (Xm , − f (Xm ),W2 , H) = (k) Ms (H) i=1 ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) ½S k−1 (H) Si(k) s3 1 ≤ (k) Si :i≤ s3 k−1 ∩ ∂H f (k) Si :i≤ s3 k−1 ∩ ∂H f = (k) Si :i≤ s3 i=1 s3 1 k−1 ∩ ∂C f s3 3 (k) 2Ms (B( f , r(1−γ )/6 )) i=1 ½S¯k−1 (Vm ) Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ⋆ (monotonicity) ½∂ k−1 f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ¯ ⋆ (monotonicity) ½∂C f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ¯ k−1 (Lemma 35) i=1 s3 1 ≤ ≤ s3 1 Vm i=1 ½∂C f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) . ¯ k−1 (Lemma 40) (k) ˆ For brevity, let Γ denote this last quantity, and let Mks = Ms B f , r(1−γ )/6 . By Hoeffding’s inequality, we have ¯ k−1 ˆ P (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 −1/3 + Mks 1/3 ≤ exp −2Mks Mks . Thus, by Lemmas 36, 39, and 40, P (i) (i) ˆ (k) ˜ (2/3)Γs (Xm , − f (Xm ),W2 , H) > q r(1−γ )/6 + M(s)−1/3 ∩ Hτ ∩ Gτ (i) ≤P ¯ k−1 ˆ (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 ˜ + M(s)−1/3 ∩ Hτ ≤P ¯ k−1 ˆ (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 + Mks ¯ k−1 ˆ = E P (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 1/3 ≤ E exp −2Mks −1/3 ˜ ∩ {Mks ≥ M(s)} −1/3 + Mks Mks ½[M(s),∞) (Mks ) ˜ ˜ ½[M(s),∞) (Mks ) ≤ exp −2M(s)1/3 . ˜ (i) (ii) (ii) (ii) ˜ Thus, there is an event Hτ (k, s) with P Hτ ∩ Gτ \ Hτ (k, s) ≤ exp −2M(s)1/3 such that ˆ (k) ˜ Γs (Xm , − f (Xm ),W2 , H) ≤ (3/2) q r(1−γ )/6 + M(s)−1/3 holds for these particular values of k and s. 1545 H ANNEKE (ii) (i) To extend to the full range of values, we simply take Hτ = Gτ ∩ ˜ ˜ τ ≥ (2/δ f )1/3 , we have M(τ ) ≥ 1, so a union bound implies (i) (i) (ii) P Hτ ∩ Gτ \ Hτ ≤ s≥τ s≥τ k≤d˜f (ii) Hτ (k, s). Since ˜ d˜f · exp −2M(s)1/3 ∞ ˜ ≤ d˜f · exp −2M(τ )1/3 + τ ˜ exp −2M(x)1/3 dx ˜ −1/3 · exp −2M(τ )1/3 ≤ 2d˜f δ −1/3 · exp −2M(τ )1/3 . ˜ ˜ ˜ = d˜f 1 + 2−2/3 δ f f Then Lemma 40 and a union bound imply (i) (ii) P Hτ \ Hτ ˜ ˜ −1/3 · exp −2M(τ )1/3 + 121d˜f δ −1 · exp −M(τ )/60 ˜ ˜ ≤ 2d˜f δ f f ˜ ˜ ≤ 123d˜f δ f−1 · exp −M(τ )1/3 /60 . (i) (ii) On Hτ ∩ Hτ , every such s, m, ℓ, k and H satisfy ˆ (k) ˜ Γs (Xm , − f (Xm ),W2 , H) ≤ (3/2) q(r(1−γ )/6 ) + M(s)−1/3 < (3/2) ((1 − γ )/6 + (1 − γ )/6) = (1 − γ )/2, (28) where the second inequality follows by definition of r(1−γ )/6 and s ≥ τ ≥ τ ∗ . ˆ (k) If ∆s (Xm ,W2 , H) < γ , then 1 ˆ (k) 1 − γ < 1 − ∆s (Xm ,W2 , H) = s3 (k) Ms (H) i=1 ½S k−1 (H) Si(k) ½S¯k (H) Si(k) ∪ {Xm } . Finally, noting that we always have ½S¯k (H) Si(k) ∪ {Xm } ≤ ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) + ½S¯k−1 (H[(Xm ,− f (Xm ))]) Si(k) , (i) (ii) (k) ˆ we have that, on the event Hτ ∩ Hτ , if ∆s (Xm ,W2 , H) < γ , then ˆ (k) Γs (Xm , − f (Xm ),W2 , H) < (1 − γ )/2 = −(1 − γ )/2 + (1 − γ ) < −(1 − γ )/2 + ≤ −(1 − γ )/2 + s3 (k) Ms (H) i=1 1 s3 (k) Ms (H) i=1 1 s3 ½S k−1 (H) Si(k) ½S¯k (H) Si(k) ∪ {Xm } by (29) ½S k−1 (H) Si(k) ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) ½ ½ (k) (k) ¯ S k−1 (H) Si S k−1 (H[(Xm ,− f (Xm ))]) Si (k) Ms (H) i=1 ˆ (k) ˆ (k) −(1 − γ )/2 + Γs (Xm , − f (Xm ),W2 , H) + Γs (Xm , f (Xm ),W2 , H) + = 1 by (28) (k) ˆ < Γs (Xm , f (Xm ),W2 , H) . by (28) 1546 (29) ACTIVIZED L EARNING The final claim in the lemma statement is then implied by Lemma 29, since we have Vℓ⋆ ⊆ Vτ⋆ ⊆ B ( f , φ (τ ; δ )) ⊆ B f , r(1−γ )/6 on Hτ (δ ). For any k, ℓ, m ∈ N, and any x ∈ X , define (k) ˆ px (k, ℓ, m) = ∆m (x,W2 ,Vℓ⋆ ) ˆ px (k, ℓ) = P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) S k−1 (Vℓ⋆ ) . Lemma 42 For any ζ ∈ (0, 1), there is a (C, P, f , ζ )-dependent constant c(iii) (ζ ) ∈ (0, ∞) such (iii) that, for any τ ∈ N, there is an event Hτ (ζ ) with (i) (iii) ˜ P Hτ \ Hτ (ζ ) ≤ c(iii) (ζ ) · exp −ζ 2 M(τ ) (iii) (i) such that on Hτ ∩ Hτ (ζ ), ∀k, ℓ, m ∈ N with τ ≤ ℓ ≤ m and k ≤ d˜f , for any x ∈ X , ˜ P (x : |px (k, ℓ) − px (k, ℓ, m)| > ζ ) ≤ exp −ζ 2 M(m) . ˆ Proof Fix any k, ℓ, m ∈ N with τ ≤ ℓ ≤ m and k ≤ d˜f . Recall our convention that X 0 = {∅} and P 0 X 0 = 1; thus, if k = 1, px (k, ℓ, m) = ½DIS(V ⋆ ) (x) = ½S 1 (V ⋆ ) (x) = px (k, ℓ), so the result clearly ˆ ℓ ℓ holds for k = 1. (k) For the remaining case, suppose 2 ≤ k ≤ d˜f . To simplify notation, let m = Mm (Vℓ⋆ ), X = Xℓ+1 , ˜ px = px (k, ℓ) and px = px (k, ℓ, m). Consider the event ˆ ˆ ˜ ˆ H (iii) (k, ℓ, m, ζ ) = P (x : |px − px | > ζ ) ≤ exp −ζ 2 M(m) . We have (i) P Hτ \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ (30) ≤P ˜ m ≥ M(m) \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ ˜ =P 2 ˜ ˜ ˆ ˜ ˜ m ≥ M(m) ∩ P esm|pX − pX | > esmζ W2 ,Vℓ⋆ > e−ζ M(m) ˜ (by Lemma 39) Vℓ⋆ , (31) for any value s > 0. Proceeding as in Chernoff’s bounding technique, by Markov’s inequality (31) is at most P ≤P =E ˜ ˜ ˆ ˜ m ≥ M(m) ∩ e−smζ E esm|pX − pX | W2 ,Vℓ⋆ > e−ζ ˜ 2 M(m) ˜ Vℓ⋆ ˜ ˆ ˜ ˆ ˜ ˜ m ≥ M(m) ∩ e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ > e−ζ ˜ ˜ ˜ ˆ ˜ ˆ ˜ ½[M(m),∞) (m) P e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ > e−ζ ˜ 1547 2 M(m) ˜ 2 M(m) ˜ Vℓ⋆ m,Vℓ⋆ ˜ Vℓ⋆ H ANNEKE By Markov’s inequality, this is at most E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ E e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ m,Vℓ⋆ Vℓ⋆ =E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ e−smζ E esm(pX − pX ) + esm( pX −pX ) m,Vℓ⋆ Vℓ⋆ =E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ ˜ e−smζ E E esm(pX − pX ) + esm( pX −pX ) X, m,Vℓ⋆ m,Vℓ⋆ Vℓ⋆ . (32) ∞ ˜ The conditional distribution of m pX given (X, m,Vℓ⋆ ) is Binomial (m, pX ), so letting B j (pX ) j=1 ˜ˆ ˜ denote a sequence of random variables, conditionally independent given (X, m,Vℓ⋆ ), with the condi˜ tional distribution of each B j (pX ) being Bernoulli(pX ) given (X, m,Vℓ⋆ ), we have ˜ ˜ ˆ ˜ ˆ E esm(pX − pX ) + esm( pX −pX ) X, m,Vℓ⋆ ˜ ˜ ˆ ˜ ˆ ˜ = E esm(pX − pX ) X, m,Vℓ⋆ + E esm( pX −pX ) X, m,Vℓ⋆ ˜ m ˜ =E ˜ ∏ es(pX −Bi (pX )) X, m,Vℓ⋆ + E i=1 m ˜ = E es(pX −B1 (pX )) X, m,Vℓ⋆ ˜ m ˜ ˜ ∏ es(B (p )−p ) X, m,Vℓ⋆ i X X i=1 + E es(B1 (pX )−pX ) X, m,Vℓ⋆ ˜ m ˜ . (33) 2 It is known that for B ∼ Bernoulli(p), E es(B−p) and E es(p−B) are at most es /8 (see, e.g., Lemma ˜ 2 8.1 of Devroye, Gy¨ rfi, and Lugosi, 1996). Thus, taking s = 4ζ , (33) is at most 2e2mζ , and (32) is o at most E ˜ ½[M(m),∞) (m) 2eζ ˜ 2 M(m) ˜ ˜ ˜ e−4mζ e2mζ Vℓ⋆ = E 2 2 ˜ ½[M(m),∞) (m) 2eζ ˜ 2 M(m) ˜ ˜ e−2mζ Vℓ⋆ 2 ˜ ≤ 2 exp −ζ 2 M(m) . Since this bound holds for (30), the law of total probability implies (i) (i) P Hτ \ H (iii) (k, ℓ, m, ζ ) = E P Hτ \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ 1548 ˜ ≤ 2 · exp −ζ 2 M(m) . ACTIVIZED L EARNING d˜ (iii) f Defining Hτ (ζ ) = ℓ≥τ m≥ℓ k=2 H (iii) (k, ℓ, m, ζ ), we have the required property for the claimed ranges of k, ℓ and m, and a union bound implies (iii) (i) P Hτ \ Hτ (ζ ) ≤ ≤ 2d˜f · = 2d˜f · ℓ≥τ ℓ≥τ ℓ≥τ m≥ℓ ˜ 2d˜f · exp −ζ 2 M(m) ˜ exp −ζ 2 M(ℓ) + ∞ ℓ3 ˜ exp −xζ 2 δ f /2 dx ˜ ˜ 1 + 2ζ −2 δ f−1 · exp −ζ 2 M(ℓ) ˜ ˜ ≤ 2d˜f · 1 + 2ζ −2 δ f−1 · exp −ζ 2 M(τ ) + ˜ = 2d˜f · 1 + 2ζ −2 δ f−1 2 ∞ τ3 ˜ exp −xζ 2 δ f /2 dx ˜ · exp −ζ 2 M(τ ) ˜ ˜ ≤ 18d˜f ζ −4 δ f−2 · exp −ζ 2 M(τ ) . For k, ℓ, m ∈ N and ζ ∈ (0, 1), define ˆ pζ (k, ℓ, m) = P (x : px (k, ℓ, m) ≥ ζ ) . ¯ (34) √ (i) Lemma 43 For any α , ζ , δ ∈ (0, 1), β ∈ 0, 1 − α , and integer τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ ∩ (iii) Hτ (β ζ ), for any k, ℓ, ℓ′ , m ∈ N with τ ≤ ℓ ≤ ℓ′ ≤ m and k ≤ d˜f , ˜ pζ (k, ℓ′ , m) ≤ P (x : px (k, ℓ) ≥ αζ ) + exp −β 2 ζ 2 M(m) . ¯ (35) √ Proof Fix any α , ζ , δ ∈ (0, 1), β ∈ 0, 1 − α , τ , k, ℓ, ℓ′ , m ∈ N with τ (β ; δ ) ≤ τ ≤ ℓ ≤ ℓ′ ≤ m and k ≤ d˜f . If k = 1, the result clearly holds. In particular, we have pζ (1, ℓ′ , m) = P (DIS (Vℓ⋆ )) ≤ P (DIS (Vℓ⋆ )) = P (x : px (1, ℓ) ≥ αζ ) . ¯ ′ Otherwise, suppose 2 ≤ k ≤ d˜f . By a union bound, pζ (k, ℓ′ , m) = P x : px (k, ℓ′ , m) ≥ ζ ¯ ˆ √ √ ≤ P x : px (k, ℓ′ ) ≥ αζ + P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ . ˆ (36) Since √ ˆ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ ≤ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > β ζ , ˆ (i) (iii) Lemma 42 implies that, on Hτ ∩ Hτ (β ζ ), √ ˜ ˆ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ ≤ exp −β 2 ζ 2 M(m) . 1549 (37) H ANNEKE It remains only to examine the first term on the right side of (36). For this, if P k−1 S k−1 Vℓ⋆ = 0, ′ then the first term is 0 by our aforementioned convention, and thus (35) holds; otherwise, since ∀x ∈ X , S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ⊆ S k−1 (Vℓ⋆ ) , ′ ′ we have = P x : P k−1 √ αζ = P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) S k−1 (Vℓ⋆ ) ≥ ′ ′ √ S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 S k−1 (Vℓ⋆ ) . ′ ′ P x : px (k, ℓ′ ) ≥ √ αζ (38) (i) By Lemma 35 and monotonicity, on Hτ ⊆ H ′ , (38) is at most √ k−1 P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 ∂C f ′ , and monotonicity implies this is at most P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ √ k−1 αζ P k−1 ∂C f . (39) (i) By Lemma 36, for τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ , √ ¯ k−1 P k−1 ∂C f S k−1 (Vℓ⋆ ) ≤ q(φ (τ ; δ )) < β ≤ 1 − α , which implies k−1 k−1 P k−1 ∂C f ≥ P k−1 ∂C f ∩ S k−1 (Vℓ⋆ ) ¯ k−1 = 1 − P k−1 ∂C f S k−1 (Vℓ⋆ ) P k−1 S k−1 (Vℓ⋆ ) ≥ √ k−1 k−1 ⋆ αP S (Vℓ ) . (i) Altogether, for τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ , (39) is at most P x : P k−1 S ∈ X k−1 : S∪{x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 S k−1 (Vℓ⋆ ) = P (x : px (k, ℓ) ≥ αζ ), which, combined with (36) and (37), establishes (35). (iv) Lemma 44 There are events Hτ : τ ∈ N with (iv) P Hτ ≥ 1 − 3d˜f · exp {−2τ } s.t. for any ξ ∈ (0, γ /16], δ ∈ (0, 1), letting τ (iv) (ξ ; δ ) = max τ (4ξ /γ ; δ ), 4 ˜ δf ξ2 ln 4 ˜ δf ξ2 1/3 , (i) (iii) (iv) for any integer τ ≥ τ (iv) (ξ ; δ ), on Hτ (δ ) ∩ Hτ ∩ Hτ (ξ ) ∩ Hτ , ∀k ∈ 1, . . . , d˜f , ∀ℓ ∈ N with ℓ ≥ τ, ˆ (k) ˜ P x : px (k, ℓ) ≥ γ /2 + exp −γ 2 M(ℓ)/256 ≤ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ P (x : px (k, ℓ) ≥ γ /8) + 4ℓ−1 . 1550 (40) (41) ACTIVIZED L EARNING Proof For any k, ℓ ∈ N, by Hoeffding’s inequality and the law of total probability, on an event G(iv) (k, ℓ) with P G(iv) (k, ℓ) ≥ 1 − 2 exp {−2ℓ}, we have ℓ3 pγ /4 (k, ℓ, ℓ) − ℓ ¯ (iv) Define the event Hτ = (iv) 1 − P Hτ −3 i=1 ˆ ½[γ /4,∞) ∆(k) (wi ,W2 ,Vℓ⋆ ) ≤ ℓ−1 . ℓ d˜f (iv) (k, ℓ). k=1 G ℓ≥τ ≤ 2d˜f · ℓ≥τ (42) By a union bound, we have exp {−2ℓ} ≤ 2d˜f · exp {−2τ } + ∞ τ exp {−2x} dx = 3d˜f · exp {−2τ } . Now fix any ℓ ≥ τ and k ∈ 1, . . . , d˜f . By a union bound, P (x : px (k, ℓ) ≥ γ /2) ≤ P (x : px (k, ℓ, ℓ) ≥ γ /4) + P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > γ /4) . ˆ ˆ (i) (43) (iii) By Lemma 42, on Hτ ∩ Hτ (ξ ), ˜ P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > γ /4) ≤ P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > ξ ) ≤ exp −ξ 2 M(ℓ) . (44) ˆ ˆ (iv) Also, on Hτ , (42) implies P (x : px (k, ℓ, ℓ) ≥ γ /4) = pγ /4 (k, ℓ, ℓ) ˆ ¯ ℓ3 ≤ℓ −1 +ℓ −3 ˆ ½[γ /4,∞) ∆(k) (wi ,W2 ,Vℓ⋆ ) ℓ i=1 ˆ (k) = ∆ℓ (W1 ,W2 ,Vℓ⋆ ) − ℓ−1 . (45) Combining (43) with (44) and (45) yields (k) ˆ ˜ P (x : px (k, ℓ) ≥ γ /2) ≤ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) − ℓ−1 + exp −ξ 2 M(ℓ) . (46) ˜ ˜ For τ ≥ τ (iv) (ξ ; δ ), exp −ξ 2 M(ℓ) − ℓ−1 ≤ − exp −γ 2 M(ℓ)/256 , so that (46) implies the first inequality of the lemma: namely (40). (iv) For the second inequality (i.e., (41)), on Hτ , (42) implies we have ˆ (k) ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ pγ /4 (k, ℓ, ℓ) + 3ℓ−1 . ¯ (47) √ Also, by Lemma 43 (with α = 1/2, ζ = γ /4, β = ξ /ζ < 1 − α ), for τ ≥ τ (iv) (ξ ; δ ), on Hτ (δ ) ∩ (iii) (i) Hτ ∩ Hτ (ξ ), ˜ pγ /4 (k, ℓ, ℓ) ≤ P (x : px (k, ℓ) ≥ γ /8) + exp −ξ 2 M(ℓ) . ¯ (48) Thus, combining (47) with (48) yields ˆ (k) ˜ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ P (x : px (k, ℓ) ≥ γ /8) + 3ℓ−1 + exp −ξ 2 M(ℓ) . 1551 H ANNEKE ˜ For τ ≥ τ (iv) (ξ ; δ ), we have exp −ξ 2 M(ℓ) ≤ ℓ−1 , which establishes (41). For n ∈ N and k ∈ {1, . . . , d + 1}, define the set (k) (k) ˆ Un = mn + 1, . . . , mn + n/ 6 · 2k ∆mn (W1 ,W2 ,V ) , (k) where mn = ⌊n/3⌋; Un represents the set of indices processed in the inner loop of Meta-Algorithm 1 for the specified value of k. ˆ ˆ Lemma 45 There are ( f , C, P, γ )-dependent constants c1 , c2 ∈ (0, ∞) such that, for any ε ∈ (0, 1) ˆ and integer n ≥ c1 ln(c2 /ε ), on an event Hn (ε ) with ˆ ˆ ˆ P(Hn (ε )) ≥ 1 − (3/4)ε , (49) ⋆ we have, for V = Vmn , (k) (k) ˆ m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ ∀k ∈ 1, . . . , d˜f , ≤ n/ 3 · 2k , (50) (γ /8) ˆ (d ) ∆mnf (W1 ,W2 ,V ) ≤ ∆n (ε ) + 4m−1 , n ˜ (d˜f ) and ∀m ∈ Un (51) , ˜ ˜ ˜ ˆ (d ) ˆ (d ) ˆ (d ) ∆m f (Xm ,W2 ,V ) < γ ⇒ Γm f (Xm , − f (Xm ),W2 ,V ) < Γm f (Xm , f (Xm ),W2 ,V ). (52) Proof Suppose n ≥ c1 ln(c2 /ε ), where ˆ ˆ ˜ c1 = max ˆ 2d f +12 24 24 , , 3τ ∗ , ˜ f γ 2 r(1/16) r(1−γ )/6 δ and c2 = max 4 c(i) + c(ii) + c(iii) (γ /16) + 6d˜f , 4 ˆ 4e r(1/16) d ,4 4e d r(1−γ )/6 . In particular, we have chosen c1 and c2 large enough so that ˆ ˆ mn ≥ max τ (1/16; ε /2), τ (iv) (γ /16; ε /2), τ ((1 − γ )/6; ε /2), τ ∗ . We begin with (50). By Lemmas 43 and 44, on the event (iii) (i) (iv) ˆ (1) Hn (ε ) = Hmn (ε /2) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn , (k) ∀m ∈ Un , ∀k ∈ 1, . . . , d˜f , ˜ pγ (k, mn , m) ≤ P (x : px (k, mn ) ≥ γ /2) + exp −γ 2 M(m)/256 ¯ (k) ˆ ˜ ≤ P (x : px (k, mn ) ≥ γ /2) + exp −γ 2 M(mn )/256 ≤ ∆mn (W1 ,W2 ,V ) . 1552 (53) ACTIVIZED L EARNING Recall that (k) (k) ˆ is a sample of size n/(6 · 2k ∆mn (W1 ,W2 ,V )) , conditionally i.i.d. Xm : m ∈ Un (1) ˆ (given (W1 ,W2 ,V )) with conditional distributions P. Thus, ∀k ∈ 1, . . . , d˜f , on Hn (ε ), P ≤P (k) (k) ˆ m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ > n/ 3 · 2k (k) (k) ˆ (k) m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ (k) > 2 Un (k) (k) ˆ ≤ P B |Un |, ∆mn (W1 ,W2 ,V ) > 2 Un W1 ,W2 ,V ˆ (k) ∆mn (W1 ,W2 ,V ) W1 ,W2 ,V ˆ (k) ∆mn (W1 ,W2 ,V ) W1 ,W2 ,V , (54) where this last inequality follows from (53), and B(u, p) ∼ Binomial(u, p) is independent from W1 ,W2 ,V (for any fixed u and p). By a Chernoff bound, (54) is at most (k) ˆ exp − n/ 6 · 2k ∆mn (W1 ,W2 ,V ) ˆ (k) ∆mn (W1 ,W2 ,V )/3 ≤ exp 1 − n/ 18 · 2k . ˆ (2) By the law of total probability and a union bound, there exists an event Hn with ˆ (2) ≤ d˜f · exp 1 − n/ 18 · 2d˜f ˆ (1) P Hn (ε ) \ Hn ˆ (1) ˆ (2) such that, on Hn (ε ) ∩ Hn , (50) holds. ˆ (1) Next, by Lemma 44, on Hn (ε ), ˜ ˆ (d ) ∆mnf (W1 ,W2 ,V ) ≤ P x : px d˜f , mn ≥ γ /8 + 4m−1 , n (γ /8) (1) ˆ and by Lemma 38, on Hn (ε ), this is at most ∆n (ε ) + 4m−1 , which establishes (51). n (˜ (1) (ii) ˆ n (ε ) ∩ Hmn , ∀m ∈ Und f ) , (52) holds. Finally, Lemma 41 implies that on H Thus, defining (ii) ˆ (1) ˆ (2) ˆ Hn (ε ) = Hn (ε ) ∩ Hn ∩ Hmn , it remains only to establish (49). By a union bound, we have (i) (i) ˆ 1 − P Hn ≤ (1 − P (Hmn (ε /2))) + 1 − P Hmn (i) (iii) (ii) + P Hmn \ Hmn (iv) + P Hmn \ Hmn (γ /16) + 1 − P Hmn ˆ (1) ˆ (2) + P Hn (ε ) \ Hn . ˜ ˜ ≤ ε /2 + c(i) · exp −M(mn )/4 + c(ii) · exp −M(mn )1/3 /60 ˜ + c(iii) (γ /16) · exp −M(mn )γ 2 /256 + 3d˜f · exp {−2mn } ˜ + d˜f · exp 1 − n/ 18 · 2d f ˜ ˜ ≤ ε /2 + c(i) + c(ii) + c(iii) (γ /16) + 6d˜f · exp −nδ f γ 2 2−d f −12 . We have chosen n large enough so that (55) is at most (3/4)ε , which establishes (49). The following result is a slightly stronger version of Theorem 6. 1553 (55) H ANNEKE Lemma 46 For any passive learning algorithm A p , if A p achieves a label complexity Λ p with ∞ > Λ p (ε , f , P) = ω (log(1/ε )), then Meta-Algorithm 1, with A p as its argument, achieves a label complexity Λa such that Λa (3ε , f , P) = o(Λ p (ε , f , P)). Proof Suppose A p achieves label complexity Λ p with ∞ > Λ p (ε , f , P) = ω (log(1/ε )). Let ε ∈ (γ /8) ˜ (0, 1), define L(n; ε ) = n/ 6 · 2d f ∆n (ε ) + 4m−1 n max {n ∈ N : L(n; ε ) < m} (for any m ∈ (0, ∞)). Define c1 = max c1 , 2 · 63 (d + 1)d˜f ln(e(d + 1)) ˆ (for any n ∈ N), and let L−1 (m; ε ) = and c2 = max {c2 , 4e(d + 1)} , ˆ and suppose n ≥ max c1 ln(c2 /ε ), 1 + L−1 (Λ p (ε , f , P); ε ) . Consider running Meta-Algorithm 1 with A p and n as inputs, while f is the target function and P is the data distribution. ˆ Letting hn denote the classifier returned from Meta-Algorithm 1, Lemma 34 implies that on an ˆ ˆ event En with P(En ) ≥ 1 − e(d + 1) · exp −⌊n/3⌋/(72d˜f (d + 1) ln(e(d + 1))) ≥ 1 − ε /4, we have ˆ er(hn ) ≤ 2 er A p Ld˜f . ˆ ˆ ˆ ˆ By a union bound, the event Gn (ε ) = En ∩ Hn (ε ) has P Gn (ε ) ≥ 1 − ε . Thus, ˆ E er hn ≤E ˆ ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) er hn ˆ ˆ + P Gn (ε ) ∩ |Ld˜f | < Λ p (ε , f , P) ≤E ˆ + P Gn (ε )c ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) 2 er A p Ld˜f ˆ ˆ + P Gn (ε ) ∩ |Ld˜f | < Λ p (ε , f , P) + ε. (56) ˆ On Gn (ε ), (51) of Lemma 45 implies |Ld˜f | ≥ L(n; ε ), and we chose n large enough so that L(n; ε ) ≥ Λ p (ε , f , P). Thus, the second term in (56) is zero, and we have ˆ E er hn ≤ 2·E ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) er A p Ld˜f ˆ = 2·E E ½Gn (ε ) er A p Ld˜f ˆ |Ld˜f | +ε ½ |Ld˜f | ≥ Λ p (ε , f , P) + ε . (d˜f ) Note that for any ℓ with P(|Ld˜f | = ℓ) > 0, the conditional distribution of Xm : m ∈ Un (57) given |Ld˜f | = ℓ is simply the product P ℓ (i.e., conditionally i.i.d.), which is the same as the distribution ˆ of {X1 , X2 , . . . , Xℓ }. Furthermore, on Gn (ε ), (50) implies that the t < ⌊2n/3⌋ condition is always satisfied in Step 6 of Meta-Algorithm 1 while k ≤ d˜f , and (52) implies that the inferred labels from Step 8 for k = d˜f are all correct. Therefore, for any such ℓ with ℓ ≥ Λ p (ε , f , P), we have E ½Gn (ε ) er A p Ld˜f ˆ |Ld˜f | = ℓ 1554 ≤ E [er (A p (Zℓ ))] ≤ ε . ACTIVIZED L EARNING In particular, this means (57) is at most 3ε . This implies that Meta-Algorithm 1, with A p as its argument, achieves a label complexity Λa such that Λa (3ε , f , P) ≤ max c1 ln(c2 /ε ), 1 + L−1 (Λ p (ε , f , P); ε ) . Since Λ p (ε , f , P) = ω (log(1/ε )) ⇒ c1 ln(c2 /ε ) = o (Λ p (ε , f , P)), it remains only to show that L−1 (Λ p (ε , f , P); ε ) = o (Λ p (ε , f , P)). Note that ∀ε ∈ (0, 1), L(1; ε ) = 0 and L(n; ε ) is diverging in n. Furthermore, by Lemma 38, we know that for any N-valued N(ε ) = ω (log(1/ε )), we have (γ /8) ∆N(ε ) (ε ) = o(1), which implies L(N(ε ); ε ) = ω (N(ε )). Thus, since Λ p (ε , f , P) = ω (log(1/ε )), Lemma 31 implies L−1 (Λ p (ε , f , P); ε ) = o (Λ p (ε , f , P)), as desired. This establishes the result for an arbitrary γ ∈ (0, 1). To specialize to the specific procedure stated as Meta-Algorithm 1, we simply take γ = 1/2. Proof [Theorem 6] Theorem 6 now follows immediately from Lemma 46. Specifically, we have proven Lemma 46 for an arbitrary distribution P on X , an arbitrary f ∈ cl(C), and an arbitrary passive algorithm A p . Therefore, it will certainly hold for every P and f ∈ C, and since every ( f , P) ∈ Nontrivial(Λ p ) has ∞ > Λ p (ε , f , P) = ω (log(1/ε )), the implication that Meta-Algorithm 1 activizes every passive algorithm A p for C follows. Careful examination of the proofs above reveals that the “3” in Lemma 46 can be set to any arbitrary constant strictly larger than 1, by an appropriate modification of the “7/12” threshold ˆ in ActiveSelect. In fact, if we were to replace Step 4 of ActiveSelect by instead selecting k = argmink max j=k mk j (where mk j = erQk j (hk ) when k < j), then we could even make this a certain (1 + o(1)) function of ε , at the expense of larger constant factors in Λa . Appendix C. The Label Complexity of Meta-Algorithm 2 As mentioned, Theorem 10 is essentially implied by the details of the proof of Theorem 16 in Appendix D below. Here we present a proof of Theorem 13, along with two useful related lemmas. The first, Lemma 47, lower bounds the expected number of label requests Meta-Algorithm 2 would make while processing a given number of random unlabeled examples. The second, Lemma 48, bounds the amount by which each label request is expected to reduce the probability mass in the region of disagreement. Although we will only use Lemma 48 in our proof of Theorem 13, Lemma 47 may be of independent interest, as it provides additional insights into the behavior of disagreement based methods, as related to the disagreement coefficient, and is included for this reason. Throughout, we fix an arbitrary class C, a target function f ∈ C, and a distribution P, and we ⋆ continue using the notational conventions of the proofs above, such as Vm = {h ∈ C : ∀i ≤ m, h(Xi ) = f (Xi )} (with V0⋆ = C). Additionally, for t ∈ N, define the random variable m M(t) = min m ∈ N : ℓ=1 ½DIS(Vℓ−1 ) (Xℓ ) = t , ⋆ which represents the index of the t th unlabeled example Meta-Algorithm 2 would request the label of (assuming it has not yet halted). The two aforementioned lemmas are formally stated as follows. 1555 H ANNEKE Lemma 47 For any r ∈ (0, 1) and ℓ ∈ N, E [P (DIS (Vℓ⋆ ∩ B ( f , r)))] ≥ (1 − r)ℓ P (DIS (B ( f , r))) ,  ⌈1/r⌉ P (DIS (B( f , r))) ½DIS(Vm−1 ∩B( f ,r)) (Xm ) ≥ E . ⋆ 2r  and m=1 Lemma 48 For any r ∈ (0, 1) and n ∈ N, ⋆ E P DIS VM(n) ∩ B ( f , r) Note these results immediately imply that  ⌈1/r⌉ E and ≥ P (DIS (B( f , r))) − nr.  ½DIS(Vm−1 ) (Xm ) ≥ ⋆ m=1 ⋆ E P DIS VM(n) P (DIS (B( f , r))) 2r ≥ P (DIS (B( f , r))) − nr, which are then directly relevant to the expected number of label requests made by Meta-Algorithm 2 among the first m data points, and the probability Meta-Algorithm 2 requests the label of the next point, after already making n label requests, respectively. Before proving these lemmas, let us first mention their relevance to the disagreement coefficient analysis. Specifically, for any ε ∈ (0, r], we have     ⌈1/ε ⌉ ⌈1/r⌉ P (DIS (B( f , r))) E ½DIS(Vm−1 ) (Xm ) ≥ E  ½DIS(Vm−1 ) (Xm ) ≥ . ⋆ ⋆ 2r m=1 m=1 In particular, maximizing over r > ε , we have  ⌈1/ε ⌉ E m=1  ½DIS(Vm−1 ) (Xm ) ≥ θ f (ε )/2. ⋆ Thus, the expected number of label requests among the first ⌈1/ε ⌉ unlabeled examples processed by Meta-Algorithm 2 is at least θ f (ε )/2 (assuming it does not halt first). Similarly, for any ε ∈ (0, r], for any n ≤ P(DIS(B( f , r)))/(2r), Lemma 48 implies ⋆ E P DIS VM(n) ≥ P (DIS (B( f , r))) /2 ≥ P (DIS (B( f , ε ))) /2. Maximizing over r > ε , we see that ⋆ n ≤ θ f (ε )/2 =⇒ E P DIS VM(n) ≥ P (DIS (B( f , ε ))) /2. In other words, for Meta-Algorithm 2 to arrive at a region of disagreement with expected probability mass less than P(DIS(B( f , ε )))/2 requires a budget n of at least θ f (ε )/2. 1556 ACTIVIZED L EARNING We now present proofs of Lemmas 47 and 48. ⋆ Proof [Lemma 47] Let Dm = DIS (Vm ∩ B( f , r)). Since  ⌈1/r⌉ E m=1  ½Dm−1 (Xm ) = = ⌈1/r⌉ m=1 ⌈1/r⌉ ⋆ E P Xm ∈ Dm−1 Vm−1 E [P (Dm−1 )] , (58) m=1 we focus on lower bounding E [P (Dm )] for m ∈ N ∪ {0}. Note that for any x ∈ DIS(B( f , r)), there ⋆ exists some hx ∈ B( f , r) with hx (x) = f (x), and if this hx ∈ Vm , then x ∈ Dm as well. This means ⋆ ∀x, ½Dm (x) ≥ ½DIS(B( f ,r)) (x) · ½Vm (hx ) = ½DIS(B( f ,r)) (x) · ∏m ½DIS({hx , f })c (Xℓ ). Therefore, ℓ=1 E [P (Dm )] = P (Xm+1 ∈ Dm ) = E E ½Dm (Xm+1 ) Xm+1 m ≥E E ½DIS(B( f ,r)) (Xm+1 ) · ∏ ½DIS({hXm+1 , f })c (Xℓ ) Xm+1 ℓ=1 m =E ∏P hXm+1 (Xℓ ) = f (Xℓ ) Xm+1 ½DIS(B( f ,r)) (Xm+1 ) (59) ℓ=1 ≥ E (1 − r)m ½DIS(B( f ,r)) (Xm+1 ) = (1 − r)m P(DIS(B( f , r))), (60) where the equality in (59) is by conditional independence of the ½DIS({hXm+1 , f })c (Xℓ ) indicators, given Xm+1 , and the inequality in (60) is due to hXm+1 ∈ B( f , r). This indicates (58) is at least ⌈1/r⌉ m=1 (1 − r)m−1 P (DIS (B( f , r))) = 1 − (1 − r)⌈1/r⌉ ≥ 1− 1 e P (DIS (B( f , r))) r P (DIS (B( f , r))) P (DIS (B( f , r))) ≥ . r 2r ⋆ Proof [Lemma 48] For each m ∈ N ∪ {0}, let Dm = DIS (B( f , r) ∩Vm ). For convenience, let M(0) = 0. We prove the result by induction. We clearly have E P DM(0) = E [P (D0 )] = P(DIS(B( f , r))), which serves as our base case. Now fix any n ∈ N and take as the inductive hypothesis that E P DM(n−1) ≥ P(DIS(B( f , r))) − (n − 1)r. ⋆ As in the proof of Lemma 47, for any x ∈ DM(n−1) , there exists hx ∈ B( f , r) ∩VM(n−1) with hx (x) = ⋆ f (x); unlike the proof of Lemma 47, here hx is a random variable, determined by VM(n−1) . If hx is ⋆ ⋆ also in VM(n) , then x ∈ DM(n) as well. Thus, ∀x, ½DM(n) (x) ≥ ½DM(n−1) (x) · ½VM(n) (hx ) = ½DM(n−1) (x) · ½DIS({hx , f })c (XM(n) ), where this last equality is due to the fact that every m ∈ {M(n − 1) + 1, . . . , ⋆ M(n) − 1} has Xm ∈ DIS Vm−1 , so that in particular hx (Xm ) = f (Xm ). Therefore, letting X ∼ P be / 1557 H ANNEKE independent of the data Z, =E ½DM(n) (X) ≥ E ½DM(n−1) (X) · ½DIS({hX , f })c (XM(n) ) =E E P DM(n) ⋆ ½DM(n−1) (X) · P hX (XM(n) ) = f (XM(n) ) X,VM(n−1) . (61) ⋆ The conditional distribution of XM(n) given VM(n−1) is merely P but with support restricted to ⋆ DIS VM(n−1) ⋆ and renormalized to a probability measure: that is P · DIS VM(n−1) . Thus, ⋆ since any x ∈ DM(n−1) has DIS({hx , f }) ⊆ DIS VM(n−1) , we have ⋆ P hx (XM(n) ) = f (XM(n) ) VM(n−1) = P (DIS({hx , f })) ⋆ P DIS VM(n−1) ≤ r P DM(n−1) , ⋆ where the inequality follows from hx ∈ B( f , r) and DM(n−1) ⊆ DIS VM(n−1) . Therefore, (61) is at least E ½DM(n−1) (X)· 1 − r P(DM(n−1) ) = E P X ∈ DM(n−1) DM(n−1) · 1 − = E P DM(n−1) · 1 − r r P(DM(n−1) ) P(DM(n−1) ) = E P DM(n−1) − r. By the inductive hypothesis, this is at least P(DIS(B( f , r))) − nr. With Lemma 48 in hand, we are ready for the proof of Theorem 13. Proof [Theorem 13] Let C, f , P, and λ be as in the theorem statement. For m ∈ N, let λ −1 (m) = inf{ε > 0 : λ (ε ) ≤ m}, or 1 if this is not defined. We define A p as a randomized algorithm such that, for m ∈ N and L ∈ (X ×{−1, +1})m , A p (L) returns f with probability 1− λ −1 (|L|) and returns − f with probability λ −1 (|L|) (independent of the contents of L). Note that, for any integer m ≥ λ (ε ), E [er (A p (Zm ))] = λ −1 (m) ≤ λ −1 (λ (ε )) ≤ ε . Therefore, A p achieves some label complexity Λ p with Λ p (ε , f , P) = λ (ε ) for all ε > 0. If θ f λ (ε )−1 = ω (1), then monotonicity implies θ f λ (ε )−1 = O(1), and since every label complexity Λa is Ω(1), the result clearly holds. Otherwise, suppose θ f λ (ε )−1 = ω (1); in particular, this means ∃ε0 ∈ (0, 1/2) such that θ f λ (2ε0 )−1 ≥ 12. Fix any ε ∈ (0, ε0 ), let r > λ (2ε )−1 be such that P(DIS(B( f ,r))) ≥ θ f λ (2ε )−1 /2, and let n ∈ N satisfy n ≤ θ f λ (2ε )−1 /4. r ˆ Consider running Meta-Algorithm 2 with arguments A p and n, and let L denote the final value of the set L, and let m denote the value of m upon reaching Step 6. Note that any m < λ (2ε ) and ˇ m has er (A (L)) = λ −1 (m) ≥ inf{ε ′ > 0 : λ (ε ′ ) < λ (2ε )} ≥ 2ε . Therefore, L ∈ (X × {−1, +1}) p we have ˆ E er A p L ˆ ≥ 2ε P |L| < λ (2ε ) = 2ε P ˆ = 2ε P ∆ > n 6λ (2ε ) 1558 ˆ n/ 6∆ < λ (2ε ) ˆ = 2ε 1 − P ∆ ≤ n 6λ (2ε ) . (62) ACTIVIZED L EARNING Since n ≤ θ f λ (2ε )−1 /4 ≤ P(DIS(B( f , r)))/(2r) < λ (2ε )P(DIS(B( f , r)))/2, we have ˆ P ∆≤ n 6λ (2ε ) ˆ ≤ P ∆ < P(DIS(B( f , r)))/12 ⋆ ⋆ ˆ P (DIS (Vm )) < P(DIS(B( f , r)))/12 ∪ ∆ < P (DIS (Vm )) ˇ ˇ ≤P . (63) Since m ≤ M(⌈n/2⌉), monotonicity and a union bound imply this is at most ˇ ⋆ P P DIS VM(⌈n/2⌉) ⋆ ˆ < P(DIS(B( f , r)))/12 + P ∆ < P (DIS (Vm )) . ˇ (64) Markov’s inequality implies ⋆ P P DIS VM(⌈n/2⌉) < P(DIS(B( f , r)))/12 11 P(DIS(B( f , r))) 12 11 ⋆ ≤ P P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ∩ B( f , r) > P(DIS(B( f , r))) 12 ⋆ = P P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ≤ ⋆ E P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ∩ B( f , r) 11 12 P(DIS(B( f , r)))  ⋆ E P DIS VM(⌈n/2⌉) ∩ B( f , r) 12  = 1− 11 P(DIS(B( f , r))) Lemma 48 implies this is at most ⌈n/2⌉r 12 11 P(DIS(B( f ,r))) ≤  . 12 11 3/2 has ⌈a⌉ ≤ (3/2)a, and θ f λ (2ε )−1 ≥ 12 implies ≤ > 3 P(DIS(B( f ,r))) , 8 r so that 12 11 P(DIS(B( f ,r))) 4r ⋆ P P DIS VM(⌈n/2⌉) P(DIS(B( f ,r))) r 4r P(DIS(B( f ,r))) . Since any a ≥ P(DIS(B( f ,r))) ≥ 3/2, we have P(DIS(B( f ,r))) 4r 4r r P(DIS(B( f ,r))) ≤ 9 22 . Combining the above, we have < P(DIS(B( f , r)))/12 ≤ 9 . 22 (65) ˆ Examining the second term in (64), Hoeffding’s inequality and the definition of ∆ from (13) imply ⋆ ⋆ ⋆ ˆ ˆ P ∆ < P (DIS (Vm )) = E P ∆ < P (DIS (Vm )) Vm , m ˇ ˇ ˇ ˇ ˇ ≤ E e−8m ≤ e−8 < 1/11. (66) Combining (62), (63), (64), (65), and (66) implies ˆ E er A p L > 2ε 1 − 1 9 − 22 11 = ε. Thus, for any label complexity Λa achieved by running Meta-Algorithm 2 with A p as its argument, we must have Λa (ε , f , P) > θ f λ (2ε )−1 /4. Since this is true for all ε ∈ (0, ε0 ), this establishes the result. 1559 H ANNEKE Appendix D. The Label Complexity of Meta-Algorithm 3 As in Appendix B, we will assume C is a fixed VC class, P is some arbitrary distribution, and f ∈ cl(C) is an arbitrary fixed function. We continue using the notation introduced above: in k ˜ ¯k ¯ particular, S k (H) = S ∈ X k : H shatters S , S k (H) = X k \ S k (H), ∂H f = X k \ ∂H f , and δ f = ˜ d˜ −1 P d f −1 ∂Cf f . Also, as above, we will prove a more general result replacing the “1/2” in Steps 5, 9, and 12 of Meta-Algorithm 3 with an arbitrary value γ ∈ (0, 1); thus, the specific result for the stated algorithm will be obtained by taking γ = 1/2. ˆ For the estimators Pm in Meta-Algorithm 3, we take precisely the same definitions as given in ˆ (k) Appendix B.1 for the estimators in Meta-Algorithm 1. In particular, the quantities ∆m (x,W2 , H), (k) ˆ (k) ˆ ˆ (k) ∆m (W1 ,W2 , H), Γm (x, y,W2 , H), and Mm (H) are all defined as in Appendix B.1, and the Pm estimators are again defined as in (11), (12) and (13). Also, we sometimes refer to quantities defined above, such as pζ (k, ℓ, m) (defined in (34)), as ¯ (i) (ii) well as the various events from the lemmas of the previous appendix, such as Hτ (δ ), H ′ , Hτ , Hτ , (iii) (iv) (i) Hτ (ζ ), Hτ , and Gτ . D.1 Proof of Theorem 16 Throughout the proof, we will make reference to the sets Vm defined in Meta-Algorithm 3. Also let V (k) denote the final value of V obtained for the specified value of k in Meta-Algorithm 3. Both Vm and V (k) are implicitly functions of the budget, n, given to Meta-Algorithm 3. As above, we ⋆ continue to denote by Vm = {h ∈ C : ∀i ≤ m, h(Xm ) = f (Xm )}. One important fact we will use ⋆ ⋆ repeatedly below is that if Vm = Vm for some m, then since Lemma 35 implies that Vm = ∅ on H ′ , we must have that all of the previous y values were consistent with f , which means that ∀ℓ ≤ m, ˆ ⋆ . In particular, if V (k′ ) = V ⋆ for the largest m value obtained while k = k′ in Meta-Algorithm Vℓ = Vℓ m 3, then Vℓ = Vℓ⋆ for all ℓ obtained while k ≤ k′ in Meta-Algorithm 3. Additionally, define mn = ⌊n/24⌋, and note that the value m = ⌈n/6⌉ is obtained while k = 1 in ˜ Meta-Algorithm 3. We also define the following quantities, which we will show are typically equal ˆ to related quantities in Meta-Algorithm 3. Define m0 = 0, T0⋆ = ⌈2n/3⌉, and t0 = 0, and for each ˆ k ∈ {1, . . . , d + 1}, inductively define 1560 ACTIVIZED L EARNING ⋆ ˆ Tk⋆ = Tk−1 − tk−1 , ⋆ ⋆ ˆ Imk = ½[γ ,∞) ∆m Xm ,W2 ,Vm−1 , ∀m ∈ N,   m   ⋆ mk = min m ≥ mk−1 : ˇ ˆ Iℓk = ⌈Tk⋆ /4⌉ ∪ {max {k · 2n + 1, mk−1 }} , ˆ   (k) ℓ=mk−1 +1 ˆ (k) ⋆ ˆ mk = mk + Tk⋆ / 3∆mk W1 ,W2 ,Vmk ˆ ˇ ˇ ˇ ˇ Uk = (mk−1 , mk ] ∩ N, ˆ ˇ ˆ Uk = (mk , mk ] ∩ N, ˇ ˆ  ⋆ Cmk = ½[0,⌊3T ⋆ /4⌋)  k Q⋆ k = ˆ m∈Uk m−1 ℓ=mk−1 +1 ˆ ⋆ ⋆ Imk ·Cmk , ˆ and tk = Q⋆ + k ,  ⋆ Iℓk  ⋆ Imk . ˇ m∈Uk The meaning of these values can be understood in the context of Meta-Algorithm 3, under the ⋆ condition that Vm = Vm for values of m obtained for the respective value of k. Specifically, under ⋆ corresponds to T , t represents the final value t for round k, m represents the ˇk this condition, Tk k ˆk value of m upon reaching Step 9 in round k, while mk represents the value of m at the end of round k, ˆ ˇ ˆ Uk corresponds to the set of indices arrived at in Step 4 during round k, while Uk corresponds to the ⋆ indicates whether the label of X ˇ set of indices arrived at in Step 11 during round k, for m ∈ Uk , Imk m ˆk , I ⋆ · C⋆ indicates whether the label of Xm is requested. Finally Q⋆ is requested, while for m ∈ U mk mk k corresponds to the number of label requests in Step 13 during round k. In particular, note m1 ≥ mn . ˇ ˜ (i) Lemma 49 For any τ ∈ N, on the event H ′ ∩ Gτ , ∀k, ℓ, m ∈ N with k ≤ d˜f , ∀x ∈ X , for any sets H and H′ with Vℓ⋆ ⊆ H ⊆ H′ ⊆ B( f , r1/6 ), if either k = 1 or m ≥ τ , then ˆ (k) ˆ (k) ∆m (x,W2 , H) ≤ (3/2)∆m x,W2 , H′ . (i) In particular, for any δ ∈ (0, 1) and τ ≥ τ (1/6; δ ), on H ′ ∩ Hτ (δ ) ∩ Gτ , ∀k, ℓ, ℓ′ , m ∈ N with m ≥ τ , ˆ (k) ˆ (k) ℓ ≥ ℓ′ ≥ τ , and k ≤ d˜f , ∀x ∈ X , ∆m (x,W2 ,Vℓ⋆ ) ≤ (3/2)∆m x,W2 ,Vℓ⋆ . ′ Proof First note that ∀m ∈ N, ∀x ∈ X , ˆ (1) ˆ (1) ∆m (x,W2 , H) = ½DIS(H) (x) ≤ ½DIS(H′ ) (x) = ∆m x,W2 , H′ , (k) so the result holds for k = 1. Lemma 35, Lemma 40, and monotonicity of Mm (·) imply that on (i) H ′ ∩ Gτ , for any m ≥ τ and k ∈ 2, . . . , d˜f , m3 (k) Mm (H) ≥ i=1 (k) (k) ½∂C f Si(k) ≥ (2/3)Mm B( f , r1/6 ) ≥ (2/3)Mm H′ , k−1 1561 H ANNEKE so that ∀x ∈ X , m3 (k) ˆ (k) ∆m (x,W2 , H) = Mm (H)−1 i=1 m3 ≤ (k) Mm (H)−1 ≤ (k) (3/2)Mm i=1 H ½S k (H) Si(k) ∪ {x} ½S k (H′ ) Si(k) ∪ {x} ′ −1 m3 i=1 ˆm ½S k (H′ ) Si(k) ∪ {x} = (3/2)∆(k) x,W2 , H′ . The final claim follows from Lemma 29. ˆ Lemma 50 For any k ∈ {1, . . . , d + 1}, if n ≥ 3·4k−1 , then Tk⋆ ≥ 41−k (2n/3) and tk ≤ 3Tk⋆ /4 . Proof Recall T1⋆ = ⌈2n/3⌉ ≥ 2n/3. If n ≥ 2, we also have ⌊3T1⋆ /4⌋ ≥ ⌈T1⋆ /4⌉, so that (due to the ⋆ ˆ Cm1 factors) t1 ≤ ⌊3T1⋆ /4⌋. For the purpose of induction, suppose some k ∈ {2, . . . , d + 1} has n ≥ ⋆ ⋆ ⋆ ⋆ ˆ ˆ 3 · 4k−1 , Tk−1 ≥ 42−k (2n/3), and tk−1 ≤ ⌊3Tk−1 /4⌋. Then Tk⋆ = Tk−1 − tk−1 ≥ Tk−1 /4 ≥ 41−k (2n/3), ⋆ ˆ and since n ≥ 3 · 4k−1 , we also have ⌊3Tk⋆ /4⌋ ≥ ⌈Tk⋆ /4⌉, so that tk ≤ ⌊3Tk⋆ /4⌋ (again, due to the Cmk k−1 . factors). Thus, by induction, this holds for all k ∈ {1, . . . , d + 1} with n ≥ 3 · 4 The next lemma indicates that the “t < ⌊3Tk /4⌋” constraint in Step 12 is redundant for k ≤ d˜f . It ˆ is similar to (50) in Lemma 45, but is made only slightly more complicated by the fact that the ∆(k) estimate is calculated in Step 9 based on a set Vm different from the ones used to decide whether or not to request a label in Step 12. (i) (i) ˜ ˜ Lemma 51 There exist (C, P, f , γ )-dependent constants c1 , c2 ∈ [1, ∞) such that, for any δ ∈ (i) (i) (0, 1), and any integer n ≥ c1 ln c2 /δ , on an event ˜ ˜ (i) (i) (iv) (iii) ˜ (i) Hn (δ ) ⊆ Gmn ∩ Hmn (δ ) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn ˜ ˜ ˜ ˜ ˜ ˜ (i) ˆ with P Hn (δ ) ≥ 1 − 2δ , ∀k ∈ 1, . . . , d˜f , tk = mk ˆ m=mk−1 +1 ˆ ⋆ Imk ≤ 3Tk⋆ /4. Proof Define the constants (i) c1 = max ˜ (i) d˜ +6 192d 3·4 f ˜ r(3/32) , δ f γ 2 8e (i) , c2 = max ˜ r(3/32) ˜ , c(i) + c(iii) (γ /16) + 125d˜f δ f−1 (i) and let n(i) (δ ) = c1 ln c2 /δ . Fix any integer n ≥ n(i) (δ ) and consider the event ˜ ˜ (i) (i) (iii) (iv) ˜ (1) Hn (δ ) = Gmn ∩ Hmn (δ ) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn . ˜ ˜ ˜ ˜ ˜ 1562 , ACTIVIZED L EARNING (1) ˜ By Lemma 49 and the fact that mk ≥ mn for all k ≥ 1, since n ≥ n(i) (δ ) ≥ 24τ (1/6; δ ), on Hn (δ ), ˇ ˜ ˜f , ∀m ∈ Uk , ˆ ∀k ∈ 1, . . . , d ⋆ ⋆ ˆ (k) ˆ (k) ∆m Xm ,W2 ,Vm−1 ≤ (3/2)∆m Xm ,W2 ,Vmk . ˇ (67) Now fix any k ∈ 1, . . . , d˜f . Since n ≥ n(i) (δ ) ≥ 27·4k−1 , Lemma 50 implies Tk⋆ ≥ 18, which means ⋆ ≤ T ⋆ /4 . Let N = (4/3)∆(k) W ,W ,V ⋆ ˆ ˆ Uk , 3T ⋆ /4 − ⌈T ⋆ /4⌉ ≥ 4T ⋆ /9. Also note ˇ I 1 2 k k k m∈Uk mk k ˆ and note that Uk = Tk⋆ /   ˆ (k) 3∆mk ˇ k ⋆ W1 ,W2 ,Vmk ˇ mk ˇ mk ˇ , so that Nk ≤ (4/9)Tk⋆ . Thus, we have   ⋆ ˜ (1) P Hn (δ ) ∩ Imk > 3Tk⋆ /4    m=mk−1 +1 ˆ           ⋆ ⋆ ˜ (1) ˜ (1) Imk > 4Tk⋆ /9  ≤ P Hn (δ ) ∩ Imk > Nk  ≤ P Hn (δ ) ∩     ˆ ˆ m∈Uk m∈Uk      ⋆ ˆm ˜ (1) ≤ P Hn (δ ) ∩ ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk > Nk  , ˇ    mk ˆ (68) ˆ m∈Uk ⋆ ˜ ˇ where this last inequality is by (67). To simplify notation, define Zk = Tk⋆ , mk ,W1 ,W2 ,Vmk . By ˇ Lemmas 43 and 44 (with β = 3/32, ζ = 2γ /3, α = 3/4, and ξ = γ /16), since n ≥ n(i) (δ ) ≥ ˆ ˜ (1) 24 · max τ (iv) (γ /16; δ ), τ (3/32; δ ) , on Hn (δ ), ∀m ∈ Uk , ˜ ˇ ˇ p2γ /3 (k, mk , m) ≤ P (x : px (k, mk ) ≥ γ /2) + exp −γ 2 M(m)/256 ¯ ˜ ˇ ≤ P (x : px (k, mk ) ≥ γ /2) + exp −γ 2 M(mk )/256 ˇ (k) ⋆ ˆ ≤ ∆mk W1 ,W2 ,Vmk . ˇ ˇ (k) (1) ⋆ ˆ ˜n ˜n ˜ ˇ Letting G′ (k) denote the event p2γ /3 (k, mk , m) ≤ ∆mk W1 ,W2 ,Vmk , we see that G′ (k) ⊇ Hn (δ ). ¯ ˇ ˇ ⋆ ˆ (k) ˜ variables are conditionally independent given Zk for Thus, since the ½[2γ /3,∞) ∆m Xm ,W2 ,Vmk ˇ ˆ m ∈ Uk , each with respective conditional distribution Bernoulli p2γ /3 (k, mk , m) , the law of total ¯ ˇ probability and a Chernoff bound imply that (68) is at most  ˜n P G′ (k) ∩   = E P    ⋆ ˆm ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk ˇ  ˆ m∈Uk ⋆ ˆm ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk ˇ ˆ m∈Uk (k) ⋆ ˆ ≤ E exp −∆mk W1 ,W2 ,Vmk ˇ ˇ ˆ Uk /27   > Nk     ˜ > Nk Zk  · ½G′n (k)  ˜ ≤ E [exp{−Tk⋆ /162}] ≤ exp −n/ 243 · 4k−1 1563 , H ANNEKE ˜ ˜ ˜ (1) where the last inequality is by Lemma 50. Thus, there exists Gn (k) with P Hn (δ ) \ Gn (k) ≤ exp −n/ 243 · 4k−1 ˜ (i) ˜ (1) Hn (δ ) = Hn (δ ) ∩ (1) ˜ ˜ such that, on Hn (δ ) ∩ Gn (k), we have d˜f ˜ k=1 Gn (k), mk ˆ ⋆ m=mk−1 +1 Imk ˆ ≤ 3Tk⋆ /4. Defining a union bound implies ˜ ˜ (1) ˜ (i) P Hn (δ ) \ Hn (δ ) ≤ d˜f · exp −n/ 243 · 4d f −1 , (69) (i) ˆ ⋆ ⋆ ˜ and on Hn (δ ), every k ∈ 1, . . . , d˜f has mk mk−1 +1 Imk ≤ 3Tk⋆ /4. In particular, this means the Cmk m= ˆ ˆ ⋆ ˆ factors are redundant in Q⋆ , so that tk = mk mk−1 +1 Imk . k m= ˆ To get the stated probability bound, a union bound implies that (i) ˜ (1) 1 − P Hn (δ ) ≤ (1 − P (Hmn (δ ))) + 1 − P Hmn ˜ ˜ (iv) + 1 − P Hmn ˜ (i) (i) (iii) + P Hmn \ Hmn (γ /16) ˜ ˜ (i) + P Hmn \ Gmn ˜ ˜ ˜ ˜ ≤ δ + c(i) · exp −M (mn ) /4 ˜ ˜ ˜ + c(iii) (γ /16) · exp −M (mn ) γ 2 /256 + 3d˜f · exp {−2mn } −1 ˜ ˜ ˜ + 121d˜f δ f · exp −M (mn ) /60 ˜ ≤ δ + c(i) + c(iii) (γ /16) + 124d˜f δ f−1 · exp −mn δ f γ 2 /512 . ˜ ˜ (70) ˜ Since n ≥ n(i) (δ ) ≥ 24, we have mn ≥ n/48, so that summing (69) and (70) gives us ˜ ˜ ˜ ˜ (i) 1 − P Hn (δ ) ≤ δ + c(i) + c(iii) (γ /16) + 125d˜f δ f−1 · exp −nδ f γ 2 / 512 · 48 · 4d f −1 . (71) Finally, note that we have chosen n(i) (δ ) sufficiently large so that (71) is at most 2δ . The next lemma indicates that the redundancy of the “t < ⌊3Tk /4⌋” constraint, just established in Lemma 51, implies that all y labels obtained while k ≤ d˜f are consistent with the target function. ˆ Lemma 52 Consider running Meta-Algorithm 3 with a budget n ∈ N, while f is the target func˜ (ii) tion and P is the data distribution. There is an event Hn and (C, P, f , γ )-dependent constants (ii) (ii) (ii) (ii) ˜ (ii) ≤ δ , ˜ (i) c1 , c2 ∈ [1, ∞) such that, for any δ ∈ (0, 1), if n ≥ c1 ln c2 /δ , then P Hn (δ ) \ Hn ˜ ˜ ˜ ˜ ˜ ⋆ ˜ (i) ˜ (ii) and on Hn (δ ) ∩ Hn , we have V (d f ) = Vmd˜ = Vm ˜ . ˆ ˆ df f (ii) (i) Proof Define c1 = max c1 , r 192d , ˜ ˜ (1−γ )/6 211 ˜ 1/3 δf (ii) (i) , c2 = max c2 , r ˜ ˜ 8e (1−γ )/6 , c(ii) , exp {τ ∗ } , let n(ii) (δ ) = (ii) (ii) (ii) ˜ (ii) c1 ln c2 /δ , suppose n ≥ n(ii) (δ ), and define the event Hn = Hmn . ˜ ˜ ˜ ˜ (i) ˜ (ii) By Lemma 41, since n ≥ n(ii) (δ ) ≥ 24 · max {τ ((1 − γ )/6; δ ), τ ∗ }, on Hn (δ ) ∩ Hn , ∀m ∈ N ˜ and k ∈ 1, . . . , d˜f with either k = 1 or m > mn , ⋆ ⋆ ⋆ ˆ (k) ˆ (k) ˆ (k) ∆m Xm ,W2 ,Vm−1 < γ ⇒ Γm Xm , − f (Xm ),W2 ,Vm−1 < Γm Xm , f (Xm ),W2 ,Vm−1 . 1564 (72) ACTIVIZED L EARNING ˜ Recall that mn ≤ min {⌈T1 /4⌉ , 2n } = ⌈⌈2n/3⌉ /4⌉. Therefore, Vmn is obtained purely by mn exe˜ ˜ cutions of Step 8 while k = 1. Thus, for every m obtained in Meta-Algorithm 3, either k = 1 or m > mn . We now proceed by induction on m. We already know V0 = C = V0⋆ , so this serves as ˜ our base case. Now consider some value m ∈ N obtained in Meta-Algorithm 3 while k ≤ d˜f , and ⋆ suppose every m′ < m has Vm′ = Vm′ . But this means that Tk = Tk⋆ and the value of t upon obtaining m−1 ⋆ ⋆ ˆ (k) this particular m has t ≤ ℓ=mk−1 +1 Iℓk . In particular, if ∆m (Xm ,W2 ,Vm−1 ) ≥ γ , then Imk = 1, so ˆ ˆ ⋆ ⋆ ⋆ ˜ (ii) ˜ (i) that t < m mk−1 +1 Imk ; by Lemma 51, on Hn (δ ) ∩ Hn , m mk−1 +1 Imk ≤ mk mk−1 +1 Imk ≤ 3Tk⋆ /4, ℓ= ˆ ℓ= ˆ ℓ= ˆ ⋆ /4, and therefore y = Y = f (X ); this implies V = V ⋆ . On the other hand, on so that t < 3Tk ˆ m m m m (ii) (k) (i) ˆ ˜ ˜ Hn (δ ) ∩ Hn , if ∆m (Xm ,W2 ,Vm−1 ) < γ , then (72) implies ˆ (k) y = argmax Γm (Xm , y,W2 ,Vm−1 ) = f (Xm ), ˆ y∈{−1,+1} ⋆ ˜ (i) ˜ (ii) so that again Vm = Vm . Thus, by the principle of induction, on Hn (δ ) ∩ Hn , for every m ∈ N ˜f ) ⋆ ⋆ obtained while k ≤ d˜f , we have Vm = Vm ; in particular, this implies V (d = Vmd˜ = Vm ˜ . The bound ˆ ˆ df f ˜ (i) ˜ (ii) Hn (δ ) \ Hn on P then follows from Lemma 41, as we have chosen that (27) (with τ = mn ) is at most δ . ˜ n(ii) (δ ) sufficiently large so Lemma 53 Consider running Meta-Algorithm 3 with a budget n ∈ N, while f is the target func(iii) (iii) ˜ ˜ tion and P is the data distribution. There exist (C, P, f , γ )-dependent constants c1 , c2 ∈ [1, ∞) −3 ), λ ∈ [1, ∞), and n ∈ N, there exists an event H (iii) (δ , λ ) having ˜n such that, for any δ ∈ (0, e ˜ (ii) ˜ (iii) ˜ (i) P Hn (δ ) ∩ Hn \ Hn (δ , λ ) ≤ δ with the property that, if (iii) n ≥ c1 θ f (d/λ ) ln2 ˜ ˜ (i) (ii) (iii) c2 λ ˜ δ , (iii) ˜ ˜ ˜ then on Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), at the conclusion of Meta-Algorithm 3, Ld˜f ≥ λ . (iii) Proof Let c1 ˜ (i) 10+2d˜f (ii) d·d˜f ·4 ˜ γ 3 δ f3 = max c1 , c1 , ˜ ˜ , r192d (3/32) (iii) , c2 ˜ (i) (ii) = max c2 , c2 , r 8e ˜ ˜ (3/32) , fix any δ ∈ (iii) (iii) ˜ ˜ ˜ (0, e−3 ), λ ∈ [1, ∞), let n(iii) (δ , λ ) = c1 θ f (d/λ ) ln2 (c2 λ /δ ), and suppose n ≥ n(iii) (δ , λ ). ˜ ˜ ˆ Define a sequence ℓi = 2i for integers i ≥ 0, and let ι = log2 42+d f λ /γ δ f . Also define ˜ ˆ φ (m, δ , λ ) = max {φ (m; δ /2ι ) , d/λ }, where φ is defined in Lemma 29. Then define the events ˜ H (3) (δ , λ ) = ˆ ι i=1 ˜ (iii) ˜ ˆ Hℓi (δ /2ι ) , Hn (δ , λ ) = H (3) (δ , λ ) ∩ md˜f ≥ ℓι . ˇ ˆ ˆ ˇ Note that ι ≤ n, so that ℓι ≤ 2n , and therefore the truncation in the definition of md˜f , which enforces ˆ ˜f · 2n + 1, mk−1 , will never be a factor in whether or not m ˜ ≥ ℓι is satisfied. ˇ df md˜f ≤ max d ˇ ˆ ˆ (ii) (ii) ⋆ ˜ (ii) ˆ ˜ (i) Since n ≥ n(iii) (λ , δ ) ≥ c1 ln c2 /δ , Lemma 52 implies that on Hn (δ ) ∩ Hn , Vmd˜ = Vm ˜ . ˜ ˜ ˆ f df Recall that this implies that all y values obtained while m ≤ md˜f are consistent with their respective ˆ ˆ 1565 H ANNEKE ⋆ ⋆ f (Xm ) values, so that every such m has Vm = Vm as well. In particular, Vmd˜ = Vm ˜ . Also note that ˇ ˇ df f n(iii) (δ , λ ) Thus, on 24 · τ (iv) (γ /16; δ ), ≥ so that ˜ (ii) ˜ (iii) ˜ (i) Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), τ (iv) (γ /16; δ ) (taking ˆ ∆(k) ≤ mn , and recall we always have mn ≤ md˜f . ˜ ˜ ˇ as in Meta-Algorithm 3) ˜ ⋆ ˆ ˜ ˆ (d ) ∆(d f ) = ∆m ˜f W1 ,W2 ,Vmd˜ ˇ ˇ df (Lemma 52) f ˇ ˇ d˜ ≤ P x : px d˜f , md˜f ≥ γ /8 + 4m−1 (Lemma 44) f ˜ 8P d f ≤ ˜ γ P d f −1 ˜ ⋆ S d f Vm ˜ ˇ df ˜ S d f −1 ˜ ˜ ≤ 8/γ δ f P d f ⋆ Vm ˜ ˇd + 4md˜ ˇ −1 f ˜ ⋆ S d f Vmd˜ ˇ ˜ ˜ ˜ (Markov’s ineq.) f + 4md˜ ˇ −1 ˜ ˜ ≤ 8/γ δ f P d f S d f Vℓ⋆ ˆ ι (Lemma 35) f f ˜ (iii) (defn of Hn (δ , λ )) −1 + 4ℓι ˆ ˜ ˜ ˆ ≤ 8/γ δ f P d f S d f B f , φ (ℓι , δ , λ ) + 4ℓ−1 ˆ ι (Lemma 29) ˜ ˜ ˜ ˆ ≤ 8/γ δ f θ f (d/λ )φ (ℓι , δ , λ ) + 4ℓ−1 ˆ ι ˜ (defn of θ f (d/λ )) ˜ ˜ ˜ ˆ ≤ 12/γ δ f θ f (d/λ )φ (ℓι , δ , λ ) = ˜ ˆ (φ (ℓι , δ , λ ) ≥ ℓ−1 ) ˆ ι ˜ ˆ 12θ f (d/λ ) d ln (2e max {ℓι , d} /d) + ln (4ι /δ ) ˆ max 2 , d/λ . ˜ ℓι γδ f ˆ (73) ˆ Plugging in the definition of ι and ℓι , ˆ ˆ d ln (2e max {ℓι , d} /d) + ln (4ι /δ ) ˜ ˜ ˆ ˜ ˜ ≤ (d/λ )γ δ f 4−1−d f ln 41+d f λ /δ γ δ f ≤ (d/λ ) ln (λ /δ ) . ℓι ˆ ˜ ˜ Therefore, (73) is at most 24θ f (d/λ )(d/λ ) ln (λ /δ ) /γ δ f . Thus, since (i) (i) (ii) (ii) ˜ ˜ ˜ n(iii) (δ , λ ) ≥ max c1 ln c2 /δ , c1 ln c2 /δ ˜ , ˜ (i) ˜ (ii) ˜ (iii) Lemmas 51 and 52 imply that on Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), ˜ ˆ Ld˜f = Td⋆f / 3∆(d f ) ˜ ˜ ˜ ˆ ≥ 41−d f 2n/ 9∆(d f ) ˜ ˜ 41−d f γ δ f n ≥ λ ln(λ /δ ) ≥ λ . ≥ ˜ 9 · 24 · θ f (d/λ )(d/λ ) ln (λ /δ ) ˜ (i) ˜ (ii) ˜ (iii) Now we turn to bounding P Hn (δ ) ∩ Hn \ Hn (δ , λ ) . By a union bound, we have ˆ ι ˜ 1 − P H (3) (δ , λ ) ≤ i=1 ˆ (1 − P (Hℓi (δ /2ι ))) ≤ δ /2. 1566 (74) ACTIVIZED L EARNING ˜ (i) ˜ (ii) ˜ Thus, it remains only to bound P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ md˜f < ℓι ˇ ˆ . ˇ ˆ For each i ∈ {0, 1, . . . , ι − 1}, let Qi = ⋆ ˇ m ∈ (ℓi , ℓi+1 ] ∩ Ud˜f : Imd˜ = 1 . Now consider the set f ˇ ˆ I of all i ∈ {0, 1, . . . , ι − 1} with ℓi ≥ mn and (ℓi , ℓi+1 ] ∩ Ud˜f = ∅. Note that n(iii) (δ , λ ) ≥ 48, so that ˜ ℓ0 < mn . Fix any i ∈ I. Since n(iii) (λ , δ ) ≥ 24· τ (1/6; δ ), we have mn ≥ τ (1/6; δ ), so that Lemma 49 ˜ ˜ (i) (ii) (3) (δ , λ ), letting Q = 2 · 46+d˜f d/γ 2 δ 2 θ (d/λ ) ln(λ /δ ), ˜ ˜f ¯ ˜ ˜ ˜ implies that on Hn (δ ) ∩ Hn ∩ H f ˇ ¯ ˜ (ii) ˜ ˜ (i) P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ Qi > Q W2 ,Vℓ⋆ i ˜ ˆ (d ) m ∈ (ℓi , ℓi+1 ] ∩ N : ∆m f Xm ,W2 ,Vℓ⋆ ≥ 2γ /3 i ≤P ¯ > Q W2 ,Vℓ⋆ . (75) i ˜ ˆm ½[2γ /3,∞) ∆(d f ) Xm ,W2 ,Vℓ⋆i are conditionally (given W2 ,Vℓ⋆ ) indepeni dent, each with respective conditional distribution Bernoulli with mean p2γ /3 d˜f , ℓi , m . Since ¯ n(iii) (δ , λ ) ≥ 24 · τ (3/32; δ ), we have mn ≥ τ (3/32; δ ), so that Lemma 43 (with ζ = 2γ /3, α = 3/4, ˜ (i) ˜ n (δ ) ∩ Hn ∩ H (3) (δ , λ ), each of these m values has ˜ (ii) ˜ and β = 3/32) implies that on H For m > ℓi , the variables ˜ p2γ /3 d˜f , ℓi , m ≤ P x : px d˜f , ℓi ≥ γ /2 + exp −M(m)γ 2 /256 ¯ ˜ ≤ ˜ 2P d f S d f Vℓ⋆ i ˜ γ P d f −1 ˜ S d f −1 ˜ Vℓ⋆ i ˜ ˜ ≤ 2/γ δ f P d f S d f Vℓ⋆ i ˜ + exp −M(ℓi )γ 2 /256 (Markov’s ineq.) ˜ + exp −M(ℓi )γ 2 /256 ˜ ˜ ˜ ˜ ≤ 2/γ δ f P d f S d f B f , φ (ℓi , δ , λ ) (Lemma 35) ˜ + exp −M(ℓi )γ 2 /256 ˜ ˜ ˜ ˜ ≤ 2/γ δ f θ f (d/λ )φ (ℓi , δ , λ ) + exp −M(ℓi )γ 2 /256 (Lemma 29) ˜ (defn of θ f (d/λ )). Denote the expression in this last line by pi , and let B(ℓi , pi ) be a Binomial(ℓi , pi ) random vari˜ (ii) ˜ ˜ (i) able. Noting that ℓi+1 − ℓi = ℓi , we have that on Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ), (75) is at most ¯ P B(ℓi , pi ) > Q . Next, note that ˜ ˜ ˜ ˜ ℓi pi = (2/γ δ f )θ f (d/λ )ℓi φ (ℓi , δ , λ ) + ℓi · exp −ℓ3 δ f γ 2 /512 . i ˜ Since u · exp −u3 ≤ (3e)−1/3 for any u, letting u = ℓi δ f γ /8 we have ˜ ˜ ˜ ˜ ℓi · exp −ℓ3 δ f γ 2 /512 ≤ 8/γ δ f u · exp −u3 ≤ 8/ γ δ f (3e)1/3 ≤ 4/γ δ f . i 1567 H ANNEKE ˜ Therefore, since φ (ℓi , δ , λ ) ≥ ℓ−1 , we have that ℓi pi is at most i ˆ 6 ˜ 4ι 6 ˜ ˜ θ (d/λ )ℓi φ (ℓi , δ , λ ) ≤ θ (d/λ ) max 2d ln (2eℓι ) + 2 ln , ℓι d/λ ˆ ˆ ˜f f ˜f f δ γδ γδ 6 ˜ ≤ θ (d/λ ) max 2d ln ˜ f γδ f ≤ 6 ˜ θ (d/λ ) max 4d ln ˜ f γδ f ˜ ˜ 43+d f eλ ˜ γδ f + 2 ln ˜ d44+d f λ 6 ˜ ln θ f (d/λ ) · ≤ ˜f ˜f δ γδ γδ ˜ d43+d f , ˜ γδ f ˜ ˜ 43+d f λ ˜ γδ f δ 43+d f 2λ ˜ γδ f δ , d43+d f ˜ γδ f ˜ 46+d f d ˜ λ ≤ θ f (d/λ ) ln ˜ 2δ 2 δ γ f ¯ = Q/2. (i) ¯ ¯ ˜ ˆ Therefore, a Chernoff bound implies P B(ℓi , pi ) > Q ≤ exp −Q/6 ≤ δ /2ι , so that on Hn (δ ) ∩ (ii) (3) (δ , λ ), (75) is at most δ /2ι . The law of total probability implies there exists an event ˜ ˜ ˆ Hn ∩ H (4) (i) (ii) ˜ ˜ ˜ ˜ ˜ (4) ˜ (i) ˆ Hn (i, δ , λ ) with P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ Hn (i, δ , λ ) ≤ δ /2ι such that, on Hn (δ ) ∩ ˇ ¯ ˜ (4) ˜ (ii) ˜ Hn ∩ H (3) (δ , λ ) ∩ Hn (i, δ , λ ), Qi ≤ Q. Note that ˜ ˜ ˜ ˜ ˜ ˆ¯ ι Q ≤ log2 42+d f λ /γ δ f · 47+d f d/γ 2 δ f2 θ f (d/λ ) ln(λ /δ ) ˜ ˜ ˜ ˜ ≤ d˜f 49+d f /γ 3 δ f3 d θ f (d/λ ) ln2 (λ /δ ) ≤ 41−d f n/12. ⋆ m≤2mn Imd˜f ˜ ˜ (4) i∈I Hn (i, δ , λ ), Since (76) ˜ (i) ˜ (ii) ˜ ≤ n/12, if d˜f = 1 then (76) implies that on the event Hn (δ )∩ Hn ∩ H (3) (δ , λ )∩ ˇ ˆ¯ ≤ n/12 + i∈I Qi ≤ n/12 + ι Q ≤ n/6 ≤ ⌈T1⋆ /4⌉, so that m1 ≥ ℓι . ˇ ˆ (i) ˇ ˇ ˇ ˜ ˜ Otherwise, if d˜f > 1, then every m ∈ Ud˜f has m > 2mn , so that i≤ˆ Qi = i∈I Qi ; thus, on Hn (δ )∩ ι ˜f ˜f (4) (ii) ˇ ˜ ˜ ˜ ˆ¯ Hn ∩ H (3) (δ , λ ) ∩ i∈I Hn (i, δ , λ ), i∈I Qi ≤ ι Q ≤ 41−d n/12; Lemma 50 implies 41−d n/12 ≤ ⋆ m≤ℓι Im1 ˆ ˇ Td⋆ /4 , so that again we have md˜f ≥ ℓι . Combined with a union bound, this implies ˆ ˜ f ˜ (ii) ˜ ˜ (i) ˇ P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ md˜f < ℓι ˆ ˜ (ii) ˜ ˜ (i) ≤ P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ ≤ P i∈I ˜ (4) Hn (i, δ , λ ) i∈I ˜ (i) ˜ (ii) ˜ ˜ (4) Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ Hn (i, δ , λ ) ≤ δ /2. (77) ˜ (ii) ˜ (iii) ˜ (i) Therefore, P Hn (δ ) ∩ Hn \ Hn (δ , λ ) ≤ δ , obtained by summing (77) and (74). Proof [Theorem 16] If Λ p (ε /4, f , P) = ∞ then the result trivially holds. Otherwise, suppose (i) (ii) (iii) ε ∈ (0, 10e−3 ), let δ = ε /10, λ = Λ p (ε /4, f , P), c2 = max 10c2 , 10c2 , 10c2 , 10e(d + 1) , ˜ ˜ ˜ ˜ (i) (ii) (iii) and c1 = max c1 , c1 , c1 , 2 · 63 (d + 1)d˜ln(e(d + 1)) , and consider running Meta-Algorithm ˜ ˜ ˜ ˜ 1568 ACTIVIZED L EARNING 3 with passive algorithm A p and budget n ≥ c1 θ f (d/λ ) ln2 (c2 λ /ε ), while f is the target func˜ ˜ ˜ (i) ˜ (ii) ˜ (iii) ˜ n (δ ) ∩ Hn ∩ Hn (δ , λ ), Lemma 53 imtion and P is the data distribution. On the event H ˜ ⋆ plies Ld˜f ≥ λ , while Lemma 52 implies V (d f ) = Vm ˜ ; recalling that Lemma 35 implies that ˆ df ⋆ Vm ˜ ˆd f ˆ = ∅ on this event, we must have erLd˜ ( f ) = 0. Furthermore, if h is the classifier returned f ˆ by Meta-Algorithm 3, then Lemma 34 implies that er(h) is at most 2 er(A p (Ld˜f )), on a high ˆ ˜ (i) ˜ (ii) ˜ (iii) ˆ ˆ probability event (call it E2 in this context). Letting E3 (δ ) = E2 ∩ Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), ˆ a union bound implies the total failure probability 1 − P(E3 (δ )) from all of these events is at most 4δ + e(d + 1) · exp −⌊n/3⌋/ 72d˜f (d + 1) ln(e(d + 1)) ≤ 5δ = ε /2. Since, for ℓ ∈ N with P Ld˜f = ℓ > 0, the sequence of Xm values appearing in Ld˜f are conditionally distributed as P ℓ given |Ld˜f | = ℓ, and this is the same as the (unconditional) distribution of {X1 , X2 , . . . , Xℓ }, we have that ˆ E er h ≤ E 2 er A p Ld˜f ½E3 (δ ) + ε /2 = E E 2 er A p Ld˜f ˆ ≤2 sup ℓ≥Λ p (ε /4, f ,P) ½E3 (δ ) |Ld˜f | + ε /2 ˆ E [er(A p (Zℓ ))] + ε /2 ≤ ε . To specialize to the specific variant of Meta-Algorithm 3 stated in Section 5.2, take γ = 1/2. Appendix E. Proofs Related to Section 6: Agnostic Learning This appendix contains the proofs of our results on learning with noise. Specifically, Appendix E.1 provides the proof of the counterexample from Theorem 22, demonstrating that there is no activizer ˇ for the A p passive learning algorithm described in Section 6.2 in the agnostic case. Appendix E.2 presents the proof of Lemma 26 from Section 6.7, bounding the label complexity of Algorithm 5 under Condition 1. Finally, Appendix E.3 presents a proof of Theorem 28, demonstrating that any active learning algorithm can be modified to trivialize the misspecified model case. The notation used throughout Appendix E is taken from Section 6. E.1 Proof of Theorem 22: Negative Result for Agnostic Activized Learning ˇ It suffices to show that A p achieves a label complexity Λ p such that, for any label complexity Λa achieved by any active learning algorithm Aa , there exists a distribution PXY on X × {−1, +1} such that PXY ∈ Nontrivial(Λ p ; C) and yet Λa (ν + cε , PXY ) = o (Λ p (ν + ε , PXY )) for every constant c ∈ (0, ∞). Specifically, we will show that there is a distribution PXY for which Λ p (ν + ε , PXY ) = Θ(1/ε ) and Λa (ν + ε , PXY ) = o(1/ε ). Let P({0}) = 1/2, and for any measurable A ⊆ (0, 1], P(A) = λ (A)/2, where λ is Lebesgue measure. Let D be the family of distributions PXY on X × {−1, +1} characterized by the properties that the marginal distribution on X is P, η (0; PXY ) ∈ (1/8, 3/8), and ∀x ∈ (0, 1], η (x; PXY ) = η (0; PXY ) + (x/2) · (1 − η (0; PXY )) . η (0;PXY Thus, η (x; PXY ) is a linear function. For any PXY ∈ D, since the point z ∗ = 1−2η (0;PXY )) has 1− η (z ∗ ; PXY ) = 1/2, we see that f = hz ∗ is a Bayes optimal classifier. Furthermore, for any η0 ∈ 1569 H ANNEKE [1/8, 3/8], |η (0; PXY ) − η0 | 1 − 2η0 1 − 2η (0; PXY ) = , − 1 − η0 1 − η (0; PXY ) (1 − η0 )(1 − η (0; PXY )) and since (1 − η0 )(1 − η (0; PXY )) ∈ (25/64, 49/64) ⊂ (1/3, 1), the value z = 1−2η0 1−η0 satisfies |η0 − η (0; PXY )| ≤ |z − z ∗ | ≤ 3|η0 − η (0; PXY )|. (78) Also note that under PXY , since (1 − 2η (0; PXY )) = (1 − η (0; PXY ))z ∗ , any z ∈ (0, 1) has er(hz ) − er(hz ∗ ) = z∗ z 1 − 2η (x; PXY ) dx = = (1 − η (0; PXY )) z∗ z z∗ z 1 − 2η (0; PXY ) − x(1 − η (0; PXY )) dx (z ∗ − x) dx = (1 − η (0; PXY )) ∗ (z − z)2 , 2 so that 5 7 (z − z ∗ )2 ≤ er(hz ) − er(hz ∗ ) ≤ (z − z ∗ )2 . 16 16 Finally, note that any x, x′ ∈ (0, 1] with |x − z ∗ | < |x′ − z ∗ | has (79) |1 − 2η (x; PXY )| = |x − z ∗ |(1 − η (0; PXY )) < |x′ − z ∗ |(1 − η (0; PXY )) = |1 − 2η (x′ ; PXY )|. ′ ′ ′ Thus, for any q ∈ (0, 1/2], there exists zq ∈ [0, 1] such that z ∗ ∈ [zq , zq + 2q] ⊆ [0, 1], and the clas′ ′ sifier h′ (x) = hz ∗ (x) · 1 − 2½(zq ,zq +2q] (x) has er(h) ≥ er(h′ ) for every classifier h with h(0) = q q ′ −1 and P(x : h(x) = hz ∗ (x)) = q. Noting that er(h′ ) − er(hz ∗ ) = limz↓zq er(hz ) − er(hz ∗ ) + q ′ er(hzq +2q ) − er(hz ∗ ) , (79) implies that er(h′ ) − er(hz ∗ ) ≥ q 5 16 ′ zq − z ∗ 2 ′ + zq + 2q − z ∗ 2 , and 5 ′ ′ since max{z ∗ − zq , zq + 2q − z ∗ } ≥ q, this is at least 16 q2 . In general, any h with h(0) = +1 has er(h) − er(hz ∗ ) ≥ 1/2 − η (0; PXY ) > 1/8 ≥ (1/8)P(x : h(x) = hz ∗ (x))2 . Combining these facts, we see that any classifier h has er(h) − er(hz ∗ ) ≥ (1/8)P (x : h(x) = hz ∗ (x))2 . (80) ˇ Lemma 54 The passive learning algorithm A p achieves a label complexity Λ p such that, for every PXY ∈ D, Λ p (ν + ε , PXY ) = Θ(1/ε ). ˇ ˆ Proof Consider the values η0 and z from A p (Zn ) for some n ∈ N. Combining (78) and (79), ˆ 7 ∗ )2 ≤ 63 (η − η (0; P ))2 ≤ 4(η − η (0; P ))2 . Let N = ˆ0 ˆ0 we have er(hz ) − er(hz ∗ ) ≤ 16 (ˆ − z z XY XY n ˆ 16 −1 ¯ 0 = Nn |{i ∈ {1, . . . , n} : Xi = 0,Yi = +1}| if Nn > 0, or η0 = 0 if ¯ |{i ∈ {1, . . . , n} : Xi = 0}|, and η ˆ ˆ ¯ Nn = 0. Note that η0 = η0 ∨ 1 ∧ 3 , and since η (0; PXY ) ∈ (1/8, 3/8), we have |η0 − η (0; PXY )| ≤ 8 8 ¯ |η0 − η (0; PXY )|. Therefore, for any PXY ∈ D, ¯ ˆ E [er(hz ) − er(hz ∗ )] ≤ 4E (η0 − η (0; PXY ))2 ≤ 4E (η0 − η (0; PXY ))2 ˆ ¯ ≤ 4E E (η0 − η (0; PXY ))2 Nn ½[n/4,n] (Nn ) + 4P(Nn < n/4). (81) ¯ By a Chernoff bound, P(Nn < n/4) ≤ exp{−n/16}, and since the conditional distribution of Nn η0 given Nn is Binomial(Nn , η (0; PXY )), (81) is at most 4E 16 68 4 15 1 < . η (0; PXY )(1 − η (0; PXY )) + 4 · exp {−n/16} ≤ 4 · · + 4 · Nn ∨ n/4 n 64 n n 1570 ACTIVIZED L EARNING ˇ For any n ≥ ⌈68/ε ⌉, this is at most ε . Therefore, A p achieves a label complexity Λ p such that, for any PXY ∈ D, Λ p (ν + ε , PXY ) = ⌈68/ε ⌉ = Θ(1/ε ). Next we establish a corresponding lower bound for any active learning algorithm. Note that this requires more than a simple minimax lower bound, since we must have an asymptotic lower bound for a fixed PXY , rather than selecting a different PXY for each ε value; this is akin to the strong minimax lower bounds proven by Antos and Lugosi (1998) for passive learning in the realizable case. For this, we proceed by reduction from the task of estimating a binomial mean; toward this end, the following lemma will be useful. Lemma 55 For any nonempty (a, b) ⊂ [0, 1], and any sequence of estimators pn : {0, 1}n → [0, 1], ˆ there exists p ∈ (a, b) such that, if B1 , B2 , . . . are independent Bernoulli(p) random variables, also independent from every pn , then E ( pn (B1 , . . . , Bn ) − p)2 = o(1/n). ˆ ˆ Proof We first establish the claim when a = 0 and b = 1. For any p ∈ [0, 1], let B1 (p), B2 (p), . . . be i.i.d. Bernoulli(p) random variables, independent from any internal randomness of the pn estiˆ mators. We proceed by reduction from hypothesis testing, for which there are known lower bounds. Specifically, it is known (e.g., Wald, 1945; Bar-Yossef, 2003) that for any p, q ∈ (0, 1), δ ∈ (0, e−1 ), any (possibly randomized) q : {0, 1}n → {p, q}, and any n ∈ N, ˆ n< (1 − 8δ ) ln(1/8δ ) =⇒ 8KL(p q) max P (q(B1 (p∗ ), . . . , Bn (p∗ )) = p∗ ) > δ , ˆ p∗ ∈{p,q} where KL(p q) = p ln(p/q) + (1 − p) ln((1 − p)/(1 − q)). It is also known (e.g., Poland and Hutter, 2006) that for p, q ∈ [1/4, 3/4], KL(p q) ≤ (8/3)(p − q)2 . Combining this with the above fact, we have that for p, q ∈ [1/4, 3/4], max P (q(B1 (p∗ ), . . . , Bn (p∗ )) = p∗ ) ≥ (1/16) · exp −128(p − q)2 n/3 . ˆ p∗ ∈{p,q} (82) Given the estimator pn from the lemma statement, we construct a sequence of hypothesis tests as ˆ follows. For i ∈ N, let αi = exp −2i and ni = 1/αi2 . Define p∗ = 1/4, and for i ∈ N, induc0 tively define qi (b1 , . . . , bni ) = argmin p∈{p∗ ,p∗ +αi } | pni (b1 , . . . , bni ) − p| for b1 , . . . , bni ∈ {0, 1}, and ˆ ˆ i−1 i−1 ˆ p∗ = argmax p∈{p∗ ,p∗ +αi } P (qi (B1 (p), . . . , Bni (p)) = p). Finally, define p∗ = limi→∞ p∗ . Note that i i i−1 i−1 ∞ ∗ < 1/2, p∗ , p∗ + α ∈ [1/4, 3/4], and 0 ≤ p∗ − p∗ ≤ 2 ∀i ∈ N, pi i i i−1 i−1 j=i+1 α j < 2αi+1 = 2αi . We generally have 1 E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 − (p∗ − p∗ )2 ˆ ˆ i i 3 1 ˆ ≥ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 − 4αi4 . i 3 Furthermore, note that for any m ∈ {0, . . . , ni }, (p∗ )m (1 − p∗ )ni −m ≥ (p∗ )m (1 − p∗ )ni −m i i 1 − p∗ 1 − p∗ i ≥ 1 − 4αi2 1571 ni ≥ ni 1 − p∗ − 2αi2 i 1 − p∗ i ni ≥ exp −8αi2 ni ≥ e−8 , H ANNEKE so that the probability mass function of (B1 (p∗ ), . . . , Bni (p∗ )) is never smaller than e−8 times that of (B1 (p∗ ), . . . , Bni (p∗ )), which implies (by the law of the unconscious statistician) i i ˆ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ e−8 E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 . ˆ i i i i By a triangle inequality, we have E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ ˆ i i i αi2 P (qi (B1 (p∗ ), . . . , Bni (p∗ )) = p∗ ) . ˆ i i i 4 By (82), this is at least αi2 (1/16) · exp −128αi2 ni /3 ≥ 2−6 e−43 αi2 . 4 Combining the above, we have E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ 3−1 2−6 e−51 αi2 − 4αi4 ≥ 2−9 e−51 n−1 − 4n−2 . ˆ i i For i ≥ 5, this is larger than 2−11 e−51 n−1 . Since ni diverges as i → ∞, we have that i E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 = o(1/n), ˆ which establishes the result for a = 0 and b = 1. To extend this result to general nonempty ranges (a, b), we proceed by reduction from the above problem. Specifically, suppose p′ ∈ (0, 1), and consider the following independent random variables (also independent from the Bi (p′ ) variables and pn estimators). For each i ∈ N, Ci1 ∼ ˆ Bernoulli(a), Ci2 ∼ Bernoulli((b − a)/(1 − a)). Then for bi ∈ {0, 1}, define B′ (bi ) = max{Ci1 ,Ci2 · i bi }. For any given p′ ∈ (0, 1), the random variables B′ (Bi (p′ )) are i.i.d. Bernoulli (p), with p = i a + (b − a)p′ ∈ (a, b) (which forms a bijection between (0, 1) and (a, b)). Defining p′ (b1 , . . . , bn ) = ˆn ′ (b ), . . . , B′ (b )) − a)/(b − a), we have ( pn (B1 1 ˆ n n E ( pn (B1 (p), . . . , Bn (p)) − p)2 = (b − a)2 · E ˆ p′ (B1 (p′ ), . . . , Bn (p′ )) − p′ ˆn 2 . (83) We have already shown there exists a value of p′ ∈ (0, 1) such that the right side of (83) is not o(1/n). Therefore, the corresponding value of p = a + (b − a)p′ ∈ (a, b) has the left side of (83) not o(1/n), which establishes the result. We are now ready for the lower bound result for our setting. Lemma 56 For any label complexity Λa achieved by any active learning algorithm Aa , there exists a PXY ∈ D such that Λa (ν + ε , PXY ) = o(1/ε ). Proof The idea here is to reduce from the task of estimating the mean of iid Bernoulli trials, corresponding to the Yi values. Specifically, consider any active learning algorithm Aa ; we use Aa to construct an estimator for the mean of iid Bernoulli trials as follows. Suppose we have B1 , B2 , . . . , Bn i.i.d. Bernoulli(p), for some p ∈ (1/8, 3/8) and n ∈ N. We take the sequence of X1 , X2 , . . . random 1572 ACTIVIZED L EARNING variables i.i.d. with distribution P defined above (independent from the B j variables). For each i, we additionally have a random variable Ci with conditional distribution Bernoulli(Xi /2) given Xi , where the Ci are conditionally independent given the Xi sequence, and independent from the Bi sequence as well. We run Aa with this sequence of Xi values. For the t th label request made by the algorithm, say for the Yi value corresponding to some Xi , if it has previously requested this Yi already, then we simply repeat the same answer for Yi again, and otherwise we return to the algorithm the value 2 max{Bt ,Ci } − 1 for Yi . Note that in the latter case, the conditional distribution of max{Bt ,Ci } is Bernoulli(p + (1 − p)Xi /2), given the Xi that Aa requests the label of; thus, the Yi response has the same conditional distribution given Xi as it would have for the PXY ∈ D with η (0; PXY ) = p (i.e., η (Xi ; PXY ) = p + (1 − p)Xi /2). Since this Yi value is conditionally (given Xi ) independent from the previously returned labels and X j sequence, this is distributionally equivalent to running Aa under the PXY ∈ D with η (0; PXY ) = p. ˆ Let hn be the classifier returned by Aa (n) in the above context, and let zn denote the value ˆ 1−zn ˆ ˆ of z ∈ [2/5, 6/7] with minimum P(x : hz (x) = hn (x)). Then define pn = 2−zn ∈ [1/8, 3/8] and ˆ ˆ z z ∗ = 1−2p ∈ (2/5, 6/7). By a triangle inequality, we have |ˆ n − z ∗ | = 2P(x : hzn (x) = hz ∗ (x)) ≤ ˆ 1−p ˆ n (x) = hz ∗ (x)). Combining this with (80) and (78) implies that 4P(x : h 1 ˆ ˆ er(hn ) − er(hz ∗ ) ≥ P x : hn (x) = hz ∗ (x) 8 2 ≥ 1 1 (ˆ n − z ∗ )2 ≥ z ( pn − p)2 . ˆ 128 128 (84) In particular, by Lemma 55, we can choose p ∈ (1/8, 3/8) so that E ( pn − p)2 = o(1/n), which, by ˆ ˆ (84), implies E er(hn ) − ν = o(1/n). This means there is an increasing infinite sequence of values ˆ nk ∈ N, and a constant c ∈ (0, ∞) such that ∀k ∈ N, E er(hnk ) − ν ≥ c/nk . Supposing Aa achieves label complexity Λa , and taking the values εk = c/(2nk ), we have Λa (ν + εk , PXY ) > nk = c/(2εk ). Since εk > 0 and approaches 0 as k → ∞, we have Λa (ν + ε , PXY ) = o(1/ε ). Proof [of Theorem 22] The result follows from Lemmas 54 and 56. E.2 Proof of Lemma 26: Label Complexity of Algorithm 5 The proof of Lemma 26 essentially runs parallel to that of Theorem 16, with variants of each lemma from that proof adapted to the noise-robust Algorithm 5. As before, in this section we will fix a particular joint distribution PXY on X × {−1, +1} with marginal P on X , and then analyze the label complexity achieved by Algorithm 5 for that particular distribution. For our purposes, we will suppose PXY satisfies Condition 1 for some finite parameters µ and κ . We also fix any f ∈ cl(C(ε )). Furthermore, we will continue using the notation of ε >0 ⋆ Appendix B, such as S k (H), etc., and in particular we continue to denote Vm = {h ∈ C : ∀ℓ ≤ ⋆ m, h(Xℓ ) = f (Xℓ )} (though note that in this case, we may sometimes have f (Xℓ ) = Yℓ , so that Vm = C[Zm ]). As in the above proofs, we will prove a slightly more general result in which the “1/2” threshold in Step 5 can be replaced by an arbitrary constant γ ∈ (0, 1). ˆ For the estimators P4m used in the algorithm, we take the same definitions as in Appendix B.1. To be clear, we assume the sequences W1 and W2 mentioned there are independent from the entire 1573 H ANNEKE (X1 ,Y1 ), (X2 ,Y2 ), . . . sequence of data points; this is consistent with the earlier discussion of how these W1 and W2 sequences can be constructed in a preprocessing step. We will consider running Algorithm 5 with label budget n ∈ N and confidence parameter δ ∈ ˆ ˆ ˆ (0, e−3 ), and analyze properties of the internal sets Vi . We will denote by Vi , Li , and ik , the final values of Vi , Li , and ik , respectively, for each i and k in Algorithm 5. We also denote by m(k) ˆ ˆ and V (k) the final values of m and Vik +1 , respectively, obtained while k has the specified value in ˆ ˆ ˆ Algorithm 5; V (k) may be smaller than Viˆk when m(k) is not a power of 2. Additionally, define ⋆ = {(X ,Y )}2i Li m m m=2i−1 +1 . After establishing a few results concerning these, we will show that for n satisfying the condition in Lemma 26, the conclusion of the lemma holds. First, we have a few auxiliary definitions. For H ⊆ C, and any i ∈ N, define φi (H) = E sup h1 ,h2 ∈H er(h1 ) − erL⋆ (h1 ) − er(h2 ) − erL⋆ (h2 ) i i ˜ ˜ and Ui (H, δ ) = min K φi (H) + diam(H) ln(32i2 /δ ) ln(32i2 /δ ) + 2i−1 2i−1 ,1 , ˜ where for our purposes we can take K = 8272. It is known (see, e.g., Massart and N´ d´ lec, 2006; e e Gin´ and Koltchinskii, 2006) that for some universal constant c′ ∈ [2, ∞), e φi+1 (H) ≤ c′ max diam(H)2−i d log2 2 , 2−i di . diam(H) (85) We also generally have φi (H) ≤ 2 for every i ∈ N. The next lemma is taken from the work of Koltchinskii (2006) on data-dependent Rademacher complexity bounds on the excess risk. Lemma 57 For any δ ∈ (0, e−3 ), any H ⊆ C with f ∈ cl(H), and any i ∈ N, on an event Ki with P(Ki ) ≥ 1 − δ /4i2 , ∀h ∈ H, ˆ erL⋆ (h) − min erL⋆ (h′ ) ≤ er(h) − er( f ) + Ui (H, δ ) i i ′ h ∈H ˆ er(h) − er( f ) ≤ erL⋆ (h) − erL⋆ ( f ) + Ui (H, δ ) i i ˜ ˆ min Ui (H, δ ), 1 ≤ Ui (H, δ ). Lemma 57 essentially follows from a version of Talagrand’s inequality. The details of the proof may be extracted from the proofs of Koltchinskii (2006), and related derivations have previously been presented by Hanneke (2011) and Koltchinskii (2010). The only minor twist here is that f need only be in cl(H), rather than in H itself, which easily follows from Koltchinskii’s original results, since the Borel-Cantelli lemma implies that with probability one, every ε > 0 has some g ∈ H(ε ) (very close to f ) with erL⋆ (g) = erL⋆ ( f ). i i For our purposes, the important implications of Lemma 57 are summarized by the following lemma. Lemma 58 For any δ ∈ (0, e−3 ) and any n ∈ N, when running Algorithm 5 with label budget n and ˆ ˆ confidence parameter δ , on an event Jn (δ ) with P(Jn (δ )) ≥ 1 − δ /2, ∀i ∈ {0, 1, . . . , id+1 }, if V2⋆i ⊆ Vi 1574 ACTIVIZED L EARNING ˆ then ∀h ∈ Vi , ˆ ˆ erL⋆ (h) − min erL⋆ (h′ ) ≤ er(h) − er( f ) + Ui+1 (Vi , δ ) i+1 i+1 ˆ h′ ∈Vi ˆ ˆ er(h) − er( f ) ≤ erL⋆ (h) − erL⋆ ( f ) + Ui+1 (Vi , δ ) i+1 i+1 ˆ ˆ ˜ ˆ min Ui+1 (Vi , δ ), 1 ≤ Ui+1 (Vi , δ ). ˆ Proof For each i, consider applying Lemma 57 under the conditional distribution given Vi . The ⋆ is independent from V , as are the Rademacher variables in the definition of R ˆi ˆ i+1 (Vi ). Furˆ set Li+1 thermore, by Lemma 35, on H ′ , f ∈ cl V2⋆i , so that the conditions of Lemma 57 hold. The law of total probability then implies the existence of an event Ji of probability P(Ji ) ≥ 1 − δ /4(i + 1)2 , on ˆ which the claimed inequalities hold for that value of i if i ≤ id+1 . A union bound over values of i then implies the existence of an event Jn (δ ) = i Ji with probability P(Jn (δ )) ≥ 1 − i δ /4(i + 1)2 ≥ ˆ 1 − δ /2 on which the claimed inequalities hold for all i ≤ id+1 . Lemma 59 For some (C, PXY , γ )-dependent constants c, c∗ ∈ [1, ∞), for any δ ∈ (0, e−3 ) and integer n ≥ c∗ ln(1/δ ), when running Algorithm 5 with label budget n and confidence parameter δ , on (i) (ii) ˆ event Jn (δ ) ∩ Hn ∩ Hn , every i ∈ {0, 1, . . . , id˜f } satisfies V2⋆i di + ln(1/δ ) ˆ ⊆ Vi ⊆ C c 2i κ 2κ −1 , ˜ ˆ and furthermore V ⋆(d˜f ) ⊆ V (d f ) . m ˆ ˜ √ Proof Define c = 24Kc′ µ 2κ 2κ −1 , c∗ = max τ ∗ , 8d µ c1/κ r(1−γ )/6 1 2κ −1 log2 4µ c1/κ r(1−γ )/6 , and suppose n ≥ c∗ ln(1/δ ). We now proceed by induction. As the right side equals C for i = 0, the claimed ˆ inclusions are certainly true for V0 = C, which serves as our base case. Now suppose some i ∈ ˆd˜ } satisfies {0, 1, . . . , i f V2⋆i di + ln(1/δ ) ˆ ⊆ Vi ⊆ C c 2i κ 2κ −1 . (86) In particular, Condition 1 implies di + ln(1/δ ) ˆ diam(Vi ) ≤ diam C c 2i κ 2κ −1 ≤ µc 1 κ di + ln(1/δ ) 2i 1 2κ −1 . (87) ˆ ˆ ˆ If i < id˜f , then let k be the integer for which ik−1 ≤ i < ik , and otherwise let k = d˜f . Note that we ˆ certainly have i1 ≥ ⌊log2 (n/2)⌋, since m = ⌊n/2⌋ ≥ 2⌊log2 (n/2)⌋ is obtained while k = 1. Therefore, if k > 1, di + ln(1/δ ) 4d log2 (n) + 4 ln(1/δ ) ≤ , 2i n 1575 H ANNEKE so that (87) implies 1 2κ −1 4d log2 (n) + 4 ln(1/δ ) n 1 ˆ diam Vi ≤ µ c κ . By our choice of c∗ , the right side is at most r(1−γ )/6 . Therefore, since Lemma 35 implies f ∈ cl V2⋆i (i) ˆ ˆ on Hn , we have Vi ⊆ B f , r(1−γ )/6 when k > 1. Combined with (86), we have that V2⋆i ⊆ Vi , and ˆ either k = 1, or Vi ⊆ B( f , r(1−γ )/6 ) and 4m > 4⌊n/2⌋ ≥ n. Now consider any m with 2i + 1 ≤ m ≤ ˜ ⋆ min 2i+1 , m(d f ) , and for the purpose of induction suppose Vm−1 ⊆ Vi+1 upon reaching Step 5 for ˆ ˆ that value of m in Algorithm 5. Since Vi+1 ⊆ Vi and n ≥ τ ∗ , Lemma 41 (with ℓ = m − 1) implies that (i) (ii) on Hn ∩ Hn , ˆ (k) ˆ (k) ˆ (k) ∆4m (Xm ,W2 ,Vi+1 ) < γ =⇒ Γ4m (Xm , − f (Xm ),W2 ,Vi+1 ) < Γ4m (Xm , f (Xm ),W2 ,Vi+1 ) , ⋆ ⋆ so that after Step 8 we have Vm ⊆ Vi+1 . Since (86) implies that the Vm−1 ⊆ Vi+1 condition holds if i + 1 (at which time V ˆ Algorithm 5 reaches Step 5 with m = 2 i+1 = Vi ), we have by induction that (i) (ii) i+1 , m(d˜f ) . This establishes the ⋆ ⊆V ˆ on Hn ∩ Hn , Vm i+1 upon reaching Step 9 with m = min 2 final claim of the lemma, given that the first claim holds. For the remainder of this inductive proof, ˆ suppose i < id˜f . Since Step 8 enforces that, upon reaching Step 9 with m = 2i+1 , every h1 , h2 ∈ Vi+1 (i) (ii) have erLi+1 (h1 ) − erLi+1 (h2 ) = erL⋆ (h1 ) − erL⋆ (h2 ), on Jn (δ ) ∩ Hn ∩ Hn we have ˆ ˆ i+1 i+1 ˆ Vi+1 ⊆ ˆ ˆ ˆ h ∈ Vi : erL⋆ (h) − ′min erL⋆ (h′ ) ≤ Ui+1 Vi , δ i+1 i+1 ⋆ h ∈V 2i+1 ˆ ˆ ˆ ⊆ h ∈ Vi : erL⋆ (h) − erL⋆ ( f ) ≤ Ui+1 Vi , δ i+1 i+1 ˆ ˆ ˆ ⊆ Vi ∩ C 2Ui+1 Vi , δ ˆ ˜ ⊆ C 2Ui+1 Vi , δ , (88) where the second line follows from Lemma 35 and the last two inclusions follow from Lemma 58. ˆ Focusing on (88), combining (87) with (85) (and the fact that φi+1 (Vi ) ≤ 2), we can bound the value ˆ ˜ i+1 Vi , δ as follows. of U 2 1 ˆ ln(32(i + 1) /δ ) ≤ √µ c 2κ diam(Vi ) i 2 ≤ √ µc di + ln(1/δ ) 2i 2di + 2 ln(1/δ ) 2i+1 1 2κ √ 1 ≤ 4 µ c 2κ ′√ ˆ φi+1 (Vi ) ≤ c µc ′√ ≤ 4c 1 2κ µc 1 2κ 1 4κ −2 ln(32(i + 1)2 /δ ) 2i 1 4κ −2 d(i + 1) + ln(1/δ ) 2i+1 di + ln(1/δ ) 2i 1 4κ −2 d(i + 1) + ln(1/δ ) 2i+1 1576 1 2 8(i + 1) + 2 ln(1/δ ) 2i+1 κ 2κ −1 , d(i + 2) 2i κ 2κ −1 , 1 2 1 2 ACTIVIZED L EARNING and thus d(i + 1) + ln(1/δ ) 2i+1 ˜ ˆ ˜ √ Ui+1 (Vi , δ ) ≤ min 8Kc′ µ c 2κ 1 d(i + 1) + ln(1/δ ) 2i+1 ˜ √ ≤ 12Kc′ µ c 2κ 1 κ 2κ −1 κ 2κ −1 2 ˜ ln(32(i + 1) /δ ) , 1 +K 2i κ 2κ −1 d(i + 1) + ln(1/δ ) = (c/2) 2i+1 . Combining this with (88) now implies κ 2κ −1 d(i + 1) + ln(1/δ ) ˆ Vi+1 ⊆ C c 2i+1 . ˆ To complete the inductive proof, it remains only to show V2⋆i+1 ⊆ Vi+1 . Toward this end, recall (i) (ii) we have shown above that on Hn ∩ Hn , V2⋆i+1 ⊆ Vi+1 upon reaching Step 9 with m = 2i+1 , and that every h1 , h2 ∈ Vi+1 at this point have erLi+1 (h1 ) − erLi+1 (h2 ) = erL⋆ (h1 ) − erL⋆ (h2 ). Consider any ˆ ˆ i+1 i+1 (i) (ii) h ∈ V2⋆i+1 , and note that any other g ∈ V2⋆i+1 has erL⋆ (g) = erL⋆ (h). Thus, on Hn ∩ Hn , i+1 i+1 erLi+1 (h) − ′min erLi+1 (h′ ) = erL⋆ (h) − ′min erL⋆ (h′ ) ˆ ˆ i+1 i+1 h ∈Vi+1 h ∈Vi+1 ≤ erL⋆ (h) − min erL⋆ (h′ ) = inf erL⋆ (g) − min erL⋆ (h′ ). (89) i+1 i+1 i+1 i+1 ⋆ ˆ h′ ∈Vi g∈V 2i+1 (i) ˆ h′ ∈Vi (ii) Lemma 58 and (86) imply that on Jn (δ ) ∩ Hn ∩ Hn , the last expression in (89) is not larger (i) ˆ ˆ than infg∈V ⋆i+1 er(g) − er( f ) + Ui+1 (Vi , δ ), and Lemma 35 implies f ∈ cl V2⋆i+1 on Hn , so that 2 infg∈V ⋆i+1 er(g) = er( f ). We therefore have 2 ˆ ˆ erLi+1 (h) − ′min erLi+1 (h′ ) ≤ Ui+1 (Vi , δ ), ˆ ˆ h ∈Vi+1 ˆ ˆ so that h ∈ Vi+1 as well. Since this holds for any h ∈ V2⋆i+1 , we have V2⋆i+1 ⊆ Vi+1 . The lemma now follows by the principle of induction. Lemma 60 There exist (C, PXY , γ )-dependent constants c∗ , c∗ ∈ [1, ∞) such that, for any ε , δ ∈ 1 2 (0, e−3 ) and integer 1 2 1 ˜ n ≥ c∗ + c∗ θ f ε κ ε κ −2 log2 , 1 2 2 εδ ∗ when running Algorithm 5 with label budget n and confidence parameter δ , on an event Jn (ε , δ ) ∗ (ε , δ )) ≥ 1 − δ , we have V ˆiˆ ⊆ C(ε ). with P(Jn ˜ df Proof Define   ˜ c∗ = max 2d f +5 1  µ c1/κ r(1−γ )/6 2κ −1 d log2 d µ c1/κ   2 120 , ln 8c(i) , 1/3 ln 8c(ii)  ˜ ˜ r(1−γ )/6 δ 1/3 δf f 1577 H ANNEKE and   ˜ c∗ = max c∗ , 2d f +5 · 2  2κ −1 µ c1/κ r(1−γ )/6 ˜ , 2d f +15 · 1 µ c2 d 2 ˜ Fix any ε , δ ∈ (0, e−3 ) and integer n ≥ c∗ + c∗ θ f ε κ ε κ −2 log2 2 1 2 1 For each i ∈ {0, 1, . . .}, let ri = µ c κ ˜ ˜ i= 2− 1 2κ −1 di+ln(1/δ ) 2i 1 κ log2 ˜ γδ f 1 εδ   log2 (4dc) . 2  . . Also define c 2dc + log2 8d log2 ε εδ . ˇ ˇ ˆ and let i = min i ∈ N : sup j≥i r j < r(1−γ )/6 . For any i ∈ i, . . . , id˜f , let ˜ (d˜ ) ˆ Qi+1 = m ∈ 2i + 1, . . . , 2i+1 : ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 . ˜ Also define 1 2 2dc 96 ˜ ˜ θ ε κ · 2µ c2 · 8d log2 Q= · ε κ −2 . ˜f f εδ γδ (i) (ii) ˆ By Lemma 59 and Condition 1, on Jn (δ ) ∩ Hn ∩ Hn , if i ≤ id˜f , di + ln(1/δ ) ˆ Vi ⊆ C c 2i κ 2κ −1 ⊆ B ( f , ri ) . ˜ (90) (i) (ii) ˆ ˆ Lemma 59 also implies that, on Jn (δ ) ∩ Hn ∩ Hn , for i with id˜f −1 ≤ i ≤ id˜f , all of the sets Vi+1 ˆ obtained in Algorithm 5 while k = d˜f and m ∈ 2i + 1, . . . , 2i+1 satisfy V2⋆i+1 ⊆ Vi+1 ⊆ Vi . Recall that ˜f = 1 or else every m ∈ 2i + 1, . . . , 2i+1 has 4m > n. Also ˆ i1 ≥ ⌊log2 (n/2)⌋, so that we have either d (i) ˇ recall that Lemma 49 implies that when the above conditions are satisfied, and i ≥ i, on H ′ ∩ Gn , ˜f ) ˜f ) ˆ (d ˆ (d ∆4m (Xm ,W2 ,Vi+1 ) ≤ (3/2)∆4m (Xm ,W2 , B ( f , ri )), so that |Qi+1 | upper bounds the number of m ∈ ˜ i + 1, . . . , 2i+1 for which Algorithm 5 requests the label Y in Step 6 of the k = d round. Thus, ˜f 2 m (i) (ii) on Jn (δ ) ∩ Hn ∩ Hn , 2i + ˇ ˆ id˜ f ˇˆ i=max i,id˜ f −1 |Qi+1 | upper bounds the total number of label requests by Algorithm 5 while k = d˜f ; therefore, by the constraint in Step 3, we know that either this quantity ˆ i ˜ +1 ˜ is at least as big as 2−d f n , or else we have 2 d f > d˜f · 2n . In particular, on this event, if we can show that ˆ ˜ min id˜ ,i f ˜ ˜ |Qi+1 | < 2−d f n and 2i+1 ≤ d˜f · 2n , ˇ i 2+ ˇˆ i=max i,id˜ (91) f −1 ˜ ˆ then it must be true that i < id˜f . Next, we will focus on establishing this fact. ˇˆ ˆ ˜ and any m ∈ 2i + 1, . . . , 2i+1 . If d˜f = 1, Consider any i ∈ max i, id˜f −1 , . . . , min id˜f , i then ˜ ˜ ˜ ˆ (d ) ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 = P d f S d f (B ( f , ri )) . 1578 ACTIVIZED L EARNING ˜ ˆ (d ) Otherwise, if d˜f > 1, then by Markov’s inequality and the definition of ∆4mf (·, ·, ·) from (15), ˜ ˜ 3 ˆ (d ) ˆ (d ) ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 ≤ E ∆4mf (Xm ,W2 , B ( f , ri )) W2 2γ 1 3 = (d˜f ) 2γ M (B ( f , r )) ˜i 4m (4m)3 (d˜f ) P Ss s=1 (d˜f ) ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss ˜ (i) . (ii) By Lemma 39, Lemma 59, and (90), on Jn (δ ) ∩ Hn ∩ Hn , this is at most 3 1 ˜ f γ (4m)3 δ (4m)3 (d˜f ) P Ss s=1 24 1 ≤ ˜ f γ 43 23i+3 δ (d˜f ) ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss ˜ 43 23i+3 (d˜f ) P Ss s=1 (d˜f ) ˜ ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss . Note that this value is invariant to the choice of m ∈ 2i + 1, . . . , 2i+1 . By Hoeffding’s inequality, ∗ ∗ on an event Jn (i) of probability P (Jn (i)) ≥ 1 − δ /(16i2 ), this is at most ln(4i/δ ) ˜ ˜ + P d f S d f (B ( f , ri )) ˜ 43 23i+3 24 ˜ δf γ . (92) ˆ Since i ≥ i1 > log2 (n/4) and n ≥ ln(1/δ ), we have ln(4i/δ ) ≤ 2−i 43 23i+3 ln(4 log2 (n/4)/δ ) ≤ 2−i 128n ln(n/δ ) ≤ 2−i . 128n Thus, (92) is at most 24 ˜ ˜ 2−i + P d f S d f (B ( f , ri )) ˜ ˜f γ δ . 1 (i) (ii) ∗ ˜ In either case (d˜f = 1 or d˜f > 1), by definition of θ f ε κ , on Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), ∀m ∈ 2i + 1, . . . , 2i+1 we have ˜ 1 1 24 ˜ ˆ (d ) 2−i + θ f ε κ · max ri , ε κ ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 ≤ ˜ δf γ . (93) ˜ ˆ (d ) Furthermore, the ½[2γ /3,∞) ∆4mf (Xm ,W2 , B ( f , ri )) indicators are conditionally independent given ˜ ˜ W2 , so that we may bound P |Qi+1 | > Q W2 via a Chernoff bound. Toward this end, note that on (i) (ii) ∗ Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), (93) implies 2i+1 E |Qi+1 | W2 = ≤ 2i · m=2i +1 ˜ ˆ (d ) ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 1 1 24 ˜ 2−i + θ f ε κ · max ri , ε κ ˜ ˜ δf γ ≤ 1 24 ˜ 1 ˜ 1 + θ f ε κ · max 2i ri , 2i ε κ ˜ ˜ δf γ 1579 . (94) H ANNEKE Note that 1 2i ri = µ c κ (di + ln(1/δ )) 2κ −1 · 2i(1− 2κ −1 ) ˜ 1 ≤ µc ˜ 1 1 κ Then since 2−i 2κ −1 ≤ most ˜ d i + ln(1/δ ) ε c 1 κ 1 2κ −1 1 ·2 · 8d log2 2dc εδ 1 24 ˜ 1 ˜ 1 + θ f ε κ · µ · 2i ε κ ˜f γδ ≤ ˜ i(1− 2κ1 ) −1 − 2κ1 −1 ≤ µc 2dc 8d log2 εδ 1 κ 1 2κ −1 · 2i(1− 2κ −1 ) . ˜ , we have that the rightmost expression in (94) is at 1 2 2dc ˜ · ε κ −2 1 + θ f ε κ · 2µ c2 · 8d log2 εδ 24 ˜ γδ f 1 (i) ˜ ≤ Q/2. (ii) ∗ Therefore, a Chernoff bound implies that on Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), we have 2dc εδ ˜ ˜ P |Qi+1 | > Q W2 ≤ exp −Q/6 ≤ exp −8 log2 ≤ exp − log2 48 log2 (2dc/εδ ) δ ˜ ≤ δ /(8i). Combined with the law of total probability and a union bound over i values, this implies there exists (i) (ii) ∗ an event Jn (ε , δ ) ⊆ Jn (δ ) ∩ Hn ∩ Hn with ˜ i (i) (ii) ∗ P Jn (δ ) ∩ Hn ∩ Hn \ Jn (ε , δ ) ≤ ˇ i=i ˜ δ /(16i2 ) + δ /(8i) ≤ δ /4, ˜ has |Qi+1 | ≤ Q. ˜ ˜ ˇ We have chosen c∗ and c∗ large enough that 2i+1 < d˜f · 2n and 2i < 2−d f −2 n. In particular, this 2 1 ∗ (ε , δ ), means that on Jn ˇˆ ˆ ˜ on which every i ∈ max i, id˜f −1 , . . . , min id˜f , i ˜ˆ min i,id˜ f ˜ ˇ i ˜˜ |Qi+1 | < 2−d f −2 n + iQ. 2+ ˇˆ i=max i,id˜ f −1 ˜ Furthermore, since i ≤ 3 log2 4dc , we have εδ 13 2 1 2 ˜ ˜ ˜ 2 µ c d θ f ε κ · ε κ −2 · log2 4dc iQ ≤ 2 ˜f εδ γδ ≤ 1 2 1 213 µ c2 d log2 (4dc) ˜ ˜ 2 θ f ε κ · ε κ −2 · log2 ≤ 2−d f −2 n. 2 ˜f εδ γδ ∗ ˜ ˆ Combining the above, we have that (91) is satisfied on Jn (ε , δ ), so that id˜f > i. Combined with ∗ (ε , δ ), Lemma 59, this implies that on Jn ˆ Viˆ ˜ df ˜ d i + ln(1/δ ) ˆ ⊆ Vi˜ ⊆ C c 2i˜ 1580 κ 2κ −1 , ACTIVIZED L EARNING ˜ and by definition of i we have ˜ d i + ln(1/δ ) c 2i˜ κ 2κ −1 2dc ≤ c 8d log2 εδ 2dc ≤ c 8d log2 εδ κ 2κ −1 κ 2κ −1 ˜ κ · 2−i 2κ −1 2dc · (ε /c) · 8d log2 εδ − 2κκ −1 = ε, ˆ so that Viˆ ˜ ⊆ C(ε ). df ∗ Finally, to prove the stated bound on P(Jn (ε , δ )), by a union bound we have (i) ∗ 1 − P (Jn (ε , δ )) ≤ (1 − P(Jn (δ ))) + 1 − P Hn (i) (i) (ii) + P Hn \ Hn (ii) ∗ + P Jn (δ ) ∩ Hn ∩ Hn \ Jn (ε , δ ) 1/3 ˜ ˜ ≤ 3δ /4 + c(i) · exp −n3 δ f /8 + c(ii) · exp −nδ f /120 ≤ δ . We are now ready for the proof of Lemma 26. Proof [Lemma 26] First, note that because we break ties in the argmax of Step 7 in favor of a y value ˆ with Vik +1 [(Xm , y)] = ∅, if Vik +1 = ∅ before Step 8, then this remains true after Step 8. Furthermore, ˆ ˆ the Uik +1 estimator is nonnegative, and thus the update in Step 10 never removes from Vik +1 the minimizer of erLi +1 (h) among h ∈ Vik +1 . Therefore, by induction we have Vik = ∅ at all times in ˆ k ˆ ˆ Algorithm 5. In particular, Viˆd+1 +1 = ∅ so that the return classifier h exists. Also, by Lemma 60, for ∗ (ε , δ ), running Algorithm 5 with label budget n and confidence parameter n as in Lemma 60, on Jn ∗ ˆ δ results in Viˆ ˜ ⊆ C(ε ). Combining these two facts implies that for such a value of n, on Jn (ε , δ ), df ˆ ˆ ˆ ˆ h ∈ Viˆd+1 +1 ⊆ Viˆ ˜ ⊆ C(ε ), so that er h ≤ ν + ε . df E.3 The Misspecified Model Case Here we present a proof of Theorem 28, including a specification of the method A′ from the theorem a statement. Proof [Theorem 28] Consider a weakly universally consistent passive learning algorithm Au (Devroye, Gy¨ rfi, and Lugosi, 1996). Such a method must exist in our setting; for instance, Hoeffding’s o inequality and a union bound imply that it suffices to take Au (L) = argmin½± erL (½±i ) + ln(4i |L|) , B 2|L| Bi where {B1 , B2 , . . .} is a countable algebra that generates FX . Then Au achieves a label complexity Λu such that for any distribution PXY on X × {−1, +1}, ∀ε ∈ (0, 1), Λu (ε + ν ∗ (PXY ), PXY ) < ∞. In particular, if ν ∗ (PXY ) < ν (C; PXY ), then we have Λu ((ν ∗ (PXY ) + ν (C; PXY ))/2, PXY ) < ∞. Fix any n ∈ N and describe the execution of A′ (n) as follows. In a preprocessing step, witha hold the first mun = n − ⌊n/2⌋ − ⌊n/3⌋ ≥ n/6 examples {X1 , . . . , Xmun } and request their labels {Y1 , . . . ,Ymun }. Run Aa (⌊n/2⌋) on the remainder of the sequence {Xmun +1 , Xmun +2 , . . .} (i.e., shift 2 1581 H ANNEKE any index references in the algorithm by mun ), and let ha denote the classifier it returns. Also request the labels Ymun +1 , . . .Ymun +⌊n/3⌋ , and let hu = Au (Xmun +1 ,Ymun +1 ), . . . , (Xmun +⌊n/3⌋ ,Ymun +⌊n/3⌋ ) . ˆ ˆ If ermun (ha ) − ermun (hu ) > n−1/3 , return h = hu ; otherwise, return h = ha . This method achieves the stated result, for the following reasons. First, let us examine the final step of this algorithm. By Hoeffding’s inequality, with probability at least 1 − 2 · exp −n1/3 /12 , |(ermun (ha ) − ermun (hu )) − (er(ha ) − er(hu ))| ≤ n−1/3 . ˆ When this is the case, a triangle inequality implies er(h) ≤ min{er(ha ), er(hu ) + 2n−1/3 }. If PXY satisfies the benign noise case, then for any n ≥ 2Λa (ε /2 + ν (C; PXY ), PXY ), ˆ we have E[er(ha )] ≤ ν (C; PXY ) + ε /2, so E[er(h)] ≤ ν (C; PXY ) + ε /2 + 2 · exp{−n1/3 /12}, which 3 ln3 (4/ε ). So in this case, we can take λ (ε ) = 123 ln3 (4/ε ) . is at most ν (C; PXY ) + ε if n ≥ 12 On the other hand, if PXY is not in the benign noise case (i.e., the misspecified model case), then for any n ≥ 3Λu ((ν ∗ (PXY ) + ν (C; PXY ))/2, PXY ), E [er(hu )] ≤ (ν ∗ (PXY ) + ν (C; PXY ))/2, so that ˆ E[er(h)] ≤ E[er(hu )] + 2n−1/3 + 2 · exp{−n1/3 /12} ≤ (ν ∗ (PXY ) + ν (C; PXY ))/2 + 2n−1/3 + 2 · exp{−n1/3 /12}. 2 Again, this is at most ν (C; PXY ) + ε if n ≥ max 123 ln3 ε , 64(ν (C; PXY ) − ν ∗ (PXY ))−3 . So in this case, we can take 2 ν ∗ (PXY ) + ν (C; PXY ) 64 λ (ε ) = max 123 ln3 , 3Λu , PXY , ε 2 (ν (C; PXY ) − ν ∗ (PXY ))3 . In either case, we have λ (ε ) ∈ Polylog(1/ε ). References N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning, 1998. M. Alekhnovich, M. Braverman, V. Feldman, A. Klivans, and T. Pitassi. Learnability and automatizability. In Proceedings of the 45th Foundations of Computer Science, 2004. K. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. The Annals of Probability, 4:1041–1067, 1984. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 1582 ACTIVIZED L EARNING A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, 30:31–56, 1998. R. B. Ash and C. A. Dol´ ans-Dade. Probability & Measure Theory. Academic Press, 2000. e M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, 2006a. M.-F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels, margins, and lowdimensional mappings. Machine Learning Journal, 65(1):79–94, 2006b. M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Proceedings of the 20th Conference on Learning Theory, 2007. M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009. M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80(2–3):111–139, 2010. J. Baldridge and A. Palmer. How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2009. Z. Bar-Yossef. Sampling lower bounds via information theory. In Proceedings of the 35th Annual ACM Symposium on the Theory of Computing, 2003. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning, 2009. A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems 23, 2010. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929–965, 1989. F. Bunea, A. B. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169–194, 2009. C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In Proceedings of the 17th International Conference on Machine Learning, 2000. R. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339–2353, 2008. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. 1583 H ANNEKE S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, 2005. S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In Proceedings of the 18th Conference on Learning Theory, 2005. S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, 2007. S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. O. Dekel, C. Gentile, and K. Sridharan. Robust selective sampling from single and multiple teachers. In Proceedings of the 23rd Conference on Learning Theory, 2010. L. Devroye, L. Gy¨ rfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springero Verlag New York, Inc., 1996. R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Conference on Learning Theory, 2009. R. Gangadharaiah, R. D. Brown, and J. Carbonell. Active learning in example-based machine translation. In Proceedings of the 17th Nordic Conference on Computational Linguistics, 2009. E. Gin´ and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empire ical processes. The Annals of Probability, 34(3):1143–1216, 2006. S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:20–31, 1995. S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Conference on Learning Theory, 2007a. S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, 2007b. S. Hanneke. Adaptive rates of convergence in active learning. In Proceedings of the 22nd Conference on Learning Theory, 2009a. S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2009b. S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011. S. Har-Peled, D. Roth, and D. Zimak. Maximum margin coresets for active and noise tolerant learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. 1584 ACTIVIZED L EARNING D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992. D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115:248–292, 1994. T. Heged¨ s. Generalized teaching dimension and the query complexity of learning. In Proceedings u of the 8th Conference on Computational Learning Theory, 1995. L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many queries are needed to learn? Journal of the Association for Computing Machinery, 43(5):840–862, 1996. D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed concept classes. Machine Learning, 5:165–196, 1990. S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd International Conference on Machine Learning, 2006. M. K¨ ari¨ inen. Active learning in the non-realizable case. In Proceedings of the 17th International a¨ a Conference on Algorithmic Learning Theory, 2006. N. Karmarkar. A new polynomial-time algorithm for linear programming. Combinatorica, 4:373– 395, 1984. M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994. M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learning, 17:115–141, 1994. L. G. Khachiyan. A polynomial algorithm in linear programming. Soviet Mathematics Doklady, 20:191–194, 1979. V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006. V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11:2457–2485, 2010. V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In ´ ´e Ecole d’Et´ de Probabilit´ s de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics, 2033, e Springer, 2011. S. Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011. M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor classifiers. Machine Learning, 54:125–152, 2004. 1585 H ANNEKE T. Luo, K. Kramer, D. B. Goldgof, L. O. Hall, S. Samson, A. Remsen, and T. Hopkins. Active learning to recognize multiple types of plankton. Journal of Machine Learning Research, 6: 589–613, 2005. S. Mahalanabis. A note on active learning for smooth problems. arXiv:1103.3095, 2011. E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 1808–1829, 1999. ´ e e P. Massart and E. N´ d´ lec. Risk bounds for statistical learning. The Annals of Statistics, 34(5): 2326–2366, 2006. A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, 1998. P. Mitra, C. A. Murthy, and S. K. Pal. A probabilistic active support vector learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):413–418, 2004. J. R. Munkres. Topology. Prentice Hall, Inc., 2nd edition, 2000. I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning, 2002. R. D. Nowak. Generalized binary search. In Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing, 2008. L. Pitt and L. G. Valiant. Computational limitations on learning from examples. Journal of the Association for Computing Machinery, 35(4):965–984, 1988. J. Poland and M. Hutter. MDL convergence speed for Bernoulli sequences. Statistics and Computing, 16:161–175, 2006. G. V. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l1-penalized parametric M-estimators with applications to linear SVM and logistic regression. arXiv:0908.1940v1, 2009. D. Roth and K. Small. Margin-based active learning for structured output spaces. In European Conference on Machine Learning, 2006. N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, 2001. A. I. Schein and L. H. Ungar. Active learning for logistic regression: An evaluation. Machine Learning, 68(3):235–265, 2007. G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, 2000. B. Settles. Active learning literature survey. http://active-learning.net, 2010. S. M. Srivastava. A Course on Borel Sets. Springer-Verlag, 1998. 1586 ACTIVIZED L EARNING S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 2001. A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134–1142, 1984. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996. V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2): 117–186, 1945. L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems 22, 2009. L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. Journal of Machine Learning Research, 12:2269–2292, 2011. L. Wang and X. Shen. On L1-norm multiclass support vector machines. Journal of the American Statistical Association, 102(478):583–594, 2007. L. Yang, S. Hanneke, and J. Carbonell. The sample complexity of self-verifying Bayesian active learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011. 1587

3 0.050528824 100 jmlr-2012-Robust Kernel Density Estimation

Author: JooSeuk Kim, Clayton D. Scott

Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation

4 0.04687991 12 jmlr-2012-Active Clustering of Biological Sequences

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

5 0.043746199 68 jmlr-2012-Minimax Manifold Estimation

Author: Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman

Abstract: We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold. Under certain conditions, we show that the optimal rate of convergence is n−2/(2+d) . Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded. Keywords: manifold learning, minimax estimation

6 0.042951282 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

7 0.035113044 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

8 0.034615047 59 jmlr-2012-Linear Regression With Random Projections

9 0.033550944 80 jmlr-2012-On Ranking and Generalization Bounds

10 0.032533091 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

11 0.032261003 4 jmlr-2012-A Kernel Two-Sample Test

12 0.031162158 91 jmlr-2012-Plug-in Approach to Active Learning

13 0.030058281 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

14 0.02950329 20 jmlr-2012-Analysis of a Random Forests Model

15 0.02197928 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

16 0.021890199 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R

17 0.021212818 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

18 0.021191135 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

19 0.020836439 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

20 0.020185743 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.109), (1, 0.076), (2, -0.043), (3, -0.002), (4, 0.029), (5, -0.028), (6, -0.017), (7, 0.021), (8, 0.014), (9, 0.015), (10, 0.046), (11, -0.055), (12, -0.086), (13, -0.072), (14, 0.016), (15, 0.118), (16, -0.08), (17, -0.046), (18, -0.04), (19, -0.06), (20, 0.051), (21, -0.068), (22, 0.007), (23, -0.041), (24, -0.127), (25, -0.024), (26, -0.002), (27, 0.191), (28, -0.1), (29, 0.07), (30, 0.165), (31, 0.051), (32, -0.069), (33, -0.206), (34, 0.064), (35, -0.029), (36, -0.141), (37, 0.134), (38, 0.047), (39, -0.014), (40, 0.171), (41, 0.195), (42, 0.281), (43, 0.048), (44, 0.192), (45, -0.171), (46, -0.095), (47, -0.26), (48, 0.051), (49, -0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93962598 109 jmlr-2012-Stability of Density-Based Clustering

Author: Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman

Abstract: High density clusters can be characterized by the connected components of a level set L(λ) = {x : p(x) > λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. Keywords: clustering, density estimation, level sets, stability, model selection

2 0.4151721 12 jmlr-2012-Active Clustering of Biological Sequences

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

3 0.36498526 100 jmlr-2012-Robust Kernel Density Estimation

Author: JooSeuk Kim, Clayton D. Scott

Abstract: We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE). An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized IRWLS to converge to the global minimizer of the M-estimator objective function. The robustness of the RKDE is demonstrated with a representer theorem, the influence function, and experimental results for density estimation and anomaly detection. Keywords: outlier, reproducing kernel Hilbert space, kernel trick, influence function, M-estimation

4 0.26530001 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

Author: Steve Hanneke

Abstract: We study the theoretical advantages of active learning over passive learning. Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asymptotically strictly superior label complexity for all nontrivial target functions and distributions. We further provide a general characterization of the magnitudes of these improvements in terms of a novel generalization of the disagreement coefficient. We also extend these results to active learning in the presence of label noise, and find that even under broad classes of noise distributions, we can typically guarantee strict improvements over the known results for passive learning. Keywords: active learning, selective sampling, sequential design, statistical learning theory, PAC learning, sample complexity 1. Introduction and Background The recent rapid growth in data sources has spawned an equally rapid expansion in the number of potential applications of machine learning methodologies to extract useful concepts from these data. However, in many cases, the bottleneck in the application process is the need to obtain accurate annotation of the raw data according to the target concept to be learned. For instance, in webpage classification, it is straightforward to rapidly collect a large number of webpages, but training an accurate classifier typically requires a human expert to examine and label a number of these webpages, which may require significant time and effort. For this reason, it is natural to look for ways to reduce the total number of labeled examples required to train an accurate classifier. In the traditional machine learning protocol, here referred to as passive learning, the examples labeled by the expert are sampled independently at random, and the emphasis is on designing learning algorithms that make the most effective use of the number of these labeled examples available. However, it is possible to go beyond such methods by altering the protocol itself, allowing the learning algorithm to sequentially select the examples to be labeled, based on its observations of the labels of previously-selected examples; this interactive protocol is referred to as active learning. The objective in designing this selection mechanism is to focus the expert’s efforts toward labeling only the most informative data for the learning process, thus eliminating some degree of redundancy in the information content of the labeled examples. ∗. Some of these (and related) results previously appeared in the author’s doctoral dissertation (Hanneke, 2009b). c 2012 Steve Hanneke. H ANNEKE It is now well-established that active learning can sometimes provide significant practical and theoretical advantages over passive learning, in terms of the number of labels required to obtain a given accuracy. However, our current understanding of active learning in general is still quite limited in several respects. First, since we are lacking a complete understanding of the potential capabilities of active learning, we are not yet sure to what standards we should aspire for active learning algorithms to meet, and in particular this challenges our ability to characterize how a “good” active learning algorithm should behave. Second, since we have yet to identify a complete set of general principles for the design of effective active learning algorithms, in many cases the most effective known active learning algorithms have problem-specific designs (e.g., designed specifically for linear separators, or decision trees, etc., under specific assumptions on the data distribution), and it is not clear what components of their design can be abstracted and transferred to the design of active learning algorithms for different learning problems (e.g., with different types of classifiers, or different data distributions). Finally, we have yet to fully understand the scope of the relative benefits of active learning over passive learning, and in particular the conditions under which such improvements are achievable, as well as a general characterization of the potential magnitudes of these improvements. In the present work, we take steps toward closing this gap in our understanding of the capabilities, general principles, and advantages of active learning. Additionally, this work has a second theme, motivated by practical concerns. To date, the machine learning community has invested decades of research into constructing solid, reliable, and well-behaved passive learning algorithms, and into understanding their theoretical properties. We might hope that an equivalent amount of effort is not required in order to discover and understand effective active learning algorithms. In particular, rather than starting from scratch in the design and analysis of active learning algorithms, it seems desirable to leverage this vast knowledge of passive learning, to whatever extent possible. For instance, it may be possible to design active learning algorithms that inherit certain desirable behaviors or properties of a given passive learning algorithm. In this way, we can use a given passive learning algorithm as a reference point, and the objective is to design an active learning algorithm with performance guarantees strictly superior to those of the passive algorithm. Thus, if the passive learning algorithm has proven effective in a variety of common learning problems, then the active learning algorithm should be even better for those same learning problems. This approach also has the advantage of immediately supplying us with a collection of theoretical guarantees on the performance of the active learning algorithm: namely, improved forms of all known guarantees on the performance of the given passive learning algorithm. Due to its obvious practical advantages, this general line of informal thinking dominates the existing literature on empirically-tested heuristic approaches to active learning, as most of the published heuristic active learning algorithms make use of a passive learning algorithm as a subroutine (e.g., SVM, logistic regression, k-NN, etc.), constructing sets of labeled examples and feeding them into the passive learning algorithm at various times during the execution of the active learning algorithm (see the references in Section 7). Below, we take a more rigorous look at this general strategy. We develop a reduction-style framework for studying this approach to the design of active learning algorithms relative to a given passive learning algorithm. We then proceed to develop and analyze a variety of such methods, to realize this approach in a very general sense. Specifically, we explore the following fundamental questions. 1470 ACTIVIZED L EARNING • Is there a general procedure that, given any passive learning algorithm, transforms it into an active learning algorithm requiring significantly fewer labels to achieve a given accuracy? • If so, how large is the reduction in the number of labels required by the resulting active learning algorithm, compared to the number of labels required by the original passive algorithm? • What are sufficient conditions for an exponential reduction in the number of labels required? • To what extent can these methods be made robust to imperfect or noisy labels? In the process of exploring these questions, we find that for many interesting learning problems, the techniques in the existing literature are not capable of realizing the full potential of active learning. Thus, exploring this topic in generality requires us to develop novel insights and entirely new techniques for the design of active learning algorithms. We also develop corresponding natural complexity quantities to characterize the performance of such algorithms. Several of the results we establish here are more general than any related results in the existing literature, and in many cases the algorithms we develop use significantly fewer labels than any previously published methods. 1.1 Background The term active learning refers to a family of supervised learning protocols, characterized by the ability of the learning algorithm to pose queries to a teacher, who has access to the target concept to be learned. In practice, the teacher and queries may take a variety of forms: a human expert, in which case the queries may be questions or annotation tasks; nature, in which case the queries may be scientific experiments; a computer simulation, in which case the queries may be particular parameter values or initial conditions for the simulator; or a host of other possibilities. In our present context, we will specifically discuss a protocol known as pool-based active learning, a type of sequential design based on a collection of unlabeled examples; this seems to be the most common form of active learning in practical use today (e.g., Settles, 2010; Baldridge and Palmer, 2009; Gangadharaiah, Brown, and Carbonell, 2009; Hoi, Jin, Zhu, and Lyu, 2006; Luo, Kramer, Goldgof, Hall, Samson, Remsen, and Hopkins, 2005; Roy and McCallum, 2001; Tong and Koller, 2001; McCallum and Nigam, 1998). We will not discuss alternative models of active learning, such as online (Dekel, Gentile, and Sridharan, 2010) or exact (Heged¨ s, 1995). In the pool-based active learning u setting, the learning algorithm is supplied with a large collection of unlabeled examples (the pool), and is allowed to select any example from the pool to request that it be labeled. After observing the label of this example, the algorithm can then select another unlabeled example from the pool to request that it be labeled. This continues sequentially for a number of rounds until some halting condition is satisfied, at which time the algorithm returns a function intended to approximately mimic and generalize the observed labeling behavior. This setting contrasts with passive learning, where the learning algorithm is supplied with a collection of labeled examples without any interaction. Supposing the labels received agree with some true target concept, the objective is to use this returned function to approximate the true target concept on future (previously unobserved) data points. The hope is that, by carefully selecting which examples should be labeled, the algorithm can achieve improved accuracy while using fewer labels compared to passive learning. The motivation for this setting is simple. For many modern machine learning problems, unlabeled examples are inexpensive and available in abundance, while annotation is time-consuming or expensive. For instance, this is the case in the aforementioned webpage classification problem, where the pool would 1471 H ANNEKE be the set of all webpages, and labeling a webpage requires a human expert to examine the website content. Settles (2010) surveys a variety of other applications for which active learning is presently being used. To simplify the discussion, in this work we focus specifically on binary classification, in which there are only two possible labels. The results generalize naturally to multiclass classification as well. As the above description indicates, when studying the advantages of active learning, we are primarily interested in the number of label requests sufficient to achieve a given accuracy, a quantity referred to as the label complexity (Definition 1 below). Although active learning has been an active topic in the machine learning literature for many years now, our theoretical understanding of this topic was largely lacking until very recently. However, within the past few years, there has been an explosion of progress. These advances can be grouped into two categories: namely, the realizable case and the agnostic case. 1.1.1 T HE R EALIZABLE C ASE In the realizable case, we are interested in a particularly strict scenario, where the true label of any example is determined by a function of the features (covariates), and where that function has a specific known form (e.g., linear separator, decision tree, union of intervals, etc.); the set of classifiers having this known form is referred to as the concept space. The natural formalization of the realizable case is very much analogous to the well-known PAC model for passive learning (Valiant, 1984). In the realizable case, there are obvious examples of learning problems where active learning can provide a significant advantage compared to passive learning; for instance, in the problem of learning threshold classifiers on the real line (Example 1 below), a kind of binary search strategy for selecting which examples to request labels for naturally leads to exponential improvements in label complexity compared to learning from random labeled examples (passive learning). As such, there is a natural attraction to determine how general this phenomenon is. This leads us to think about general-purpose learning strategies (i.e., which can be instantiated for more than merely threshold classifiers on the real line), which exhibit this binary search behavior in various special cases. The first such general-purpose strategy to emerge in the literature was a particularly elegant strategy proposed by Cohn, Atlas, and Ladner (1994), typically referred to as CAL after its discoverers (Meta-Algorithm 2 below). The strategy behind CAL is the following. The algorithm examines each example in the unlabeled pool in sequence, and if there are two classifiers in the concept space consistent with all previously-observed labels, but which disagree on the label of this next example, then the algorithm requests that label, and otherwise it does not. For this reason, below we refer to the general family of algorithms inspired by CAL as disagreement-based methods. Disagreement-based methods are sometimes referred to as “mellow” active learning, since in some sense this is the least we can expect from a reasonable active learning algorithm; it never requests the label of an example whose label it can infer from information already available, but otherwise makes no attempt to seek out particularly informative examples to request the labels of. That is, the notion of informativeness implicit in disagreement-based methods is a binary one, so that an example is either informative or not informative, but there is no further ranking of the informativeness of examples. The disagreement-based strategy is quite general, and obviously leads to algorithms that are at least reasonable, but Cohn, Atlas, and Ladner (1994) did not study the label complexity achieved by their strategy in any generality. 1472 ACTIVIZED L EARNING In a Bayesian variant of the realizable setting, Freund, Seung, Shamir, and Tishby (1997) studied an algorithm known as query by committee (QBC), which in some sense represents a Bayesian variant of CAL. However, QBC does distinguish between different levels of informativeness beyond simple disagreement, based on the amount of disagreement on a random unlabeled example. They were able to analyze the label complexity achieved by QBC in terms of a type of information gain, and found that when the information gain is lower bounded by a positive constant, the algorithm achieves a label complexity exponentially smaller than the known results for passive learning. In particular, this is the case for the threshold learning problem, and also for the problem of learning higher-dimensional (nearly balanced) linear separators when the data satisfy a certain (uniform) distribution. Below, we will not discuss this analysis further, since it is for a slightly different (Bayesian) setting. However, the results below in our present setting do have interesting implications for the Bayesian setting as well, as discussed in the recent work of Yang, Hanneke, and Carbonell (2011). The first general analysis of the label complexity of active learning in the (non-Bayesian) realizable case came in the breakthrough work of Dasgupta (2005). In that work, Dasgupta proposed a quantity, called the splitting index, to characterize the label complexities achievable by active learning. The splitting index analysis is noteworthy for several reasons. First, one can show it provides nearly tight bounds on the minimax label complexity for a given concept space and data distribution. In particular, the analysis matches the exponential improvements known to be possible for threshold classifiers, as well as generalizations to higher-dimensional homogeneous linear separators under near-uniform distributions (as first established by Dasgupta, Kalai, and Monteleoni, 2005, 2009). Second, it provides a novel notion of informativeness of an example, beyond the simple binary notion of informativeness employed in disagreement-based methods. Specifically, it describes the informativeness of an example in terms of the number of pairs of well-separated classifiers for which at least one out of each pair will be contradicted, supposing the least-favorable label. Finally, unlike any other existing work on active learning (present work included), it provides an elegant description of the trade-off between the number of label requests and the number of unlabeled examples needed by the learning algorithm. Another interesting byproduct of Dasgupta’s work is a better understanding of the nature of the improvements achievable by active learning in the general case. In particular, his work clearly illustrates the need to study the label complexity as a quantity that varies depending on the particular target concept and data distribution. We will see this issue arise in many of the examples below. Coming from a slightly different perspective, Hanneke (2007a) later analyzed the label complexity of active learning in terms of an extension of the teaching dimension (Goldman and Kearns, 1995). Related quantities were previously used by Heged¨ s (1995) and Hellerstein, Pillaipakkamu natt, Raghavan, and Wilkins (1996) to tightly characterize the number of membership queries sufficient for Exact learning; Hanneke (2007a) provided a natural generalization to the PAC learning setting. At this time, it is not clear how this quantity relates to the splitting index. From a practical perspective, in some instances it may be easier to calculate (see the work of Nowak, 2008 for a discussion related to this), though in other cases the opposite seems true. The next progress toward understanding the label complexity of active learning came in the work of Hanneke (2007b), who introduced a quantity called the disagreement coefficient (Definition 9 below), accompanied by a technique for analyzing disagreement-based active learning algorithms. In particular, implicit in that work, and made explicit in the later work of Hanneke (2011), was the first general characterization of the label complexities achieved by the original CAL strategy for 1473 H ANNEKE active learning in the realizable case, stated in terms of the disagreement coefficient. The results of the present work are direct descendants of that 2007 paper, and we will discuss the disagreement coefficient, and results based on it, in substantial detail below. Disagreement-based active learners such as CAL are known to be sometimes suboptimal relative to the splitting index analysis, and therefore the disagreement coefficient analysis sometimes results in larger label complexity bounds than the splitting index analysis. However, in many cases the label complexity bounds based on the disagreement coefficient are surprisingly good considering the simplicity of the methods. Furthermore, as we will see below, the disagreement coefficient has the practical benefit of often being fairly straightforward to calculate for a variety of learning problems, particularly when there is a natural geometric interpretation of the classifiers and the data distribution is relatively smooth. As we discuss below, it can also be used to bound the label complexity of active learning in noisy settings. For these reasons (simplicity of algorithms, ease of calculation, and applicability beyond the realizable case), subsequent work on the label complexity of active learning has tended to favor the disagreement-based approach, making use of the disagreement coefficient to bound the label complexity (Dasgupta, Hsu, and Monteleoni, 2007; Friedman, 2009; Beygelzimer, Dasgupta, and Langford, 2009; Wang, 2009; Balcan, Hanneke, and Vaughan, 2010; Hanneke, 2011; Koltchinskii, 2010; Beygelzimer, Hsu, Langford, and Zhang, 2010; Mahalanabis, 2011; Wang, 2011). A significant part of the present paper focuses on extending and generalizing the disagreement coefficient analysis, while still maintaining the relative ease of calculation that makes the disagreement coefficient so useful. In addition to many positive results, Dasgupta (2005) also pointed out several negative results, even for very simple and natural learning problems. In particular, for many problems, the minimax label complexity of active learning will be no better than that of passive learning. In fact, Balcan, Hanneke, and Vaughan (2010) later showed that, for a certain type of active learning algorithm— namely, self-verifying algorithms, which themselves adaptively determine how many label requests they need to achieve a given accuracy—there are even particular target concepts and data distributions for which no active learning algorithm of that type can outperform passive learning. Since all of the above label complexity analyses (splitting index, teaching dimension, disagreement coefficient) apply to certain respective self-verifying learning algorithms, these negative results are also reflected in all of the existing general label complexity analyses. While at first these negative results may seem discouraging, Balcan, Hanneke, and Vaughan (2010) noted that if we do not require the algorithm to be self-verifying, instead simply measuring the number of label requests the algorithm needs to find a good classifier, rather than the number needed to both find a good classifier and verify that it is indeed good, then these negative results vanish. In fact, (shockingly) they were able to show that for any concept space with finite VC dimension, and any fixed data distribution, for any given passive learning algorithm there is an active learning algorithm with asymptotically superior label complexity for every nontrivial target concept! A positive result of this generality and strength is certainly an exciting advance in our understanding of the advantages of active learning. But perhaps equally exciting are the unresolved questions raised by that work, as there are potential opportunities to strengthen, generalize, simplify, and elaborate on this result. First, note that the above statement allows the active learning algorithm to be specialized to the particular distribution according to which the (unlabeled) data are sampled, and indeed the active learning method used by Balcan, Hanneke, and Vaughan (2010) in their proof has a rather strong direct dependence on the data distribution (which cannot be removed by simply replacing some calculations with data-dependent estimators). One interesting question is whether 1474 ACTIVIZED L EARNING an alternative approach might avoid this direct distribution-dependence in the algorithm, so that the claim can be strengthened to say that the active algorithm is superior to the passive algorithm for all nontrivial target concepts and data distributions. This question is interesting both theoretically, in order to obtain the strongest possible theorem on the advantages of active learning, as well as practically, since direct access to the distribution from which the data are sampled is typically not available in practical learning scenarios. A second question left open by Balcan, Hanneke, and Vaughan (2010) regards the magnitude of the gap between the active and passive label complexities. Specifically, although they did find particularly nasty learning problems where the label complexity of active learning will be close to that of passive learning (though always better), they hypothesized that for most natural learning problems, the improvements over passive learning should typically be exponentially large (as is the case for threshold classifiers); they gave many examples to illustrate this point, but left open the problem of characterizing general sufficient conditions for these exponential improvements to be achievable, even when they are not achievable by self-verifying algorithms. Another question left unresolved by Balcan, Hanneke, and Vaughan (2010) is whether this type of general improvement guarantee might be realized by a computationally efficient active learning algorithm. Finally, they left open the question of whether such general results might be further generalized to settings that involve noisy labels. The present work picks up where Balcan, Hanneke, and Vaughan (2010) left off in several respects, making progress on each of the above questions, in some cases completely resolving the question. 1.1.2 T HE AGNOSTIC C ASE In addition to the above advances in our understanding of active learning in the realizable case, there has also been wonderful progress in making these methods robust to imperfect teachers, feature space underspecification, and model misspecification. This general topic goes by the name agnostic active learning, from its roots in the agnostic PAC model (Kearns, Schapire, and Sellie, 1994). In contrast to the realizable case, in the agnostic case, there is not necessarily a perfect classifier of a known form, and indeed there may even be label noise so that there is no perfect classifier of any form. Rather, we have a given set of classifiers (e.g., linear separators, or depth-limited decision trees, etc.), and the objective is to identify a classifier whose accuracy is not much worse than the best classifier of that type. Agnostic learning is strictly more general, and often more difficult, than realizable learning; this is true for both passive learning and active learning. However, for a given agnostic learning problem, we might still hope that active learning can achieve a given accuracy using fewer labels than required for passive learning. The general topic of agnostic active learning got its first taste of real progress from Balcan, Beygelzimer, and Langford (2006a, 2009) with the publication of the A2 (agnostic active) algorithm. This method is a noise-robust disagreement-based algorithm, which can be applied with essentially arbitrary types of classifiers under arbitrary noise distributions. It is interesting both for its effectiveness and (as with CAL) its elegance. The original work of Balcan, Beygelzimer, and Langford (2006a, 2009) showed that, in some special cases (thresholds, and homogeneous linear separators under a uniform distribution), the A2 algorithm does achieve improved label complexities compared to the known results for passive learning. Using a different type of general active learning strategy, Hanneke (2007a) found that the teaching dimension analysis (discussed above for the realizable case) can be extended beyond the realizable case, arriving at general bounds on the label complexity under arbitrary noise distributions. 1475 H ANNEKE These bounds improve over the known results for passive learning in many cases. However, the algorithm requires direct access to a certain quantity that depends on the noise distribution (namely, the noise rate, defined in Section 6 below), which would not be available in many real-world learning problems. Later, Hanneke (2007b) established a general characterization of the label complexities achieved by A2 , expressed in terms of the disagreement coefficient. The result holds for arbitrary types of classifiers (of finite VC dimension) and arbitrary noise distributions, and represents the natural generalization of the aforementioned realizable-case analysis of CAL. In many cases, this result shows improvements over the known results for passive learning. Furthermore, because of the simplicity of the disagreement coefficient, the bound can be calculated for a variety of natural learning problems. Soon after this, Dasgupta, Hsu, and Monteleoni (2007) proposed a new active learning strategy, which is also effective in the agnostic setting. Like A2 , the new algorithm is a noise-robust disagreement-based method. The work of Dasgupta, Hsu, and Monteleoni (2007) is significant for at least two reasons. First, they were able to establish a general label complexity bound for this method based on the disagreement coefficient. The bound is similar in form to the previous label complexity bound for A2 by Hanneke (2007b), but improves the dependence of the bound on the disagreement coefficient. Second, the proposed method of Dasgupta, Hsu, and Monteleoni (2007) set a new standard for computational and aesthetic simplicity in agnostic active learning algorithms. This work has since been followed by related methods of Beygelzimer, Dasgupta, and Langford (2009) and Beygelzimer, Hsu, Langford, and Zhang (2010). In particular, Beygelzimer, Dasgupta, and Langford (2009) develop a method capable of learning under an essentially arbitrary loss function; they also show label complexity bounds similar to those of Dasgupta, Hsu, and Monteleoni (2007), but applicable to a larger class of loss functions, and stated in terms of a generalization of the disagreement coefficient for arbitrary loss functions. While the above results are encouraging, the guarantees reflected in these label complexity bounds essentially take the form of (at best) constant factor improvements; specifically, in some cases the bounds improve the dependence on the noise rate factor (defined in Section 6 below), compared to the known results for passive learning. In fact, K¨ ari¨ inen (2006) showed that any a¨ a label complexity bound depending on the noise distribution only via the noise rate cannot do better than this type of constant-factor improvement. This raised the question of whether, with a more detailed description of the noise distribution, one can show improvements in the asymptotic form of the label complexity compared to passive learning. Toward this end, Castro and Nowak (2008) studied a certain refined description of the noise conditions, related to the margin conditions of Mammen and Tsybakov (1999), which are well-studied in the passive learning literature. Specifically, they found that in some special cases, under certain restrictions on the noise distribution, the asymptotic form of the label complexity can be improved compared to passive learning, and in some cases the improvements can even be exponential in magnitude; to achieve this, they developed algorithms specifically tailored to the types of classifiers they studied (threshold classifiers and boundary fragment classes). Balcan, Broder, and Zhang (2007) later extended this result to general homogeneous linear separators under a uniform distribution. Following this, Hanneke (2009a, 2011) generalized these results, showing that both of the published general agnostic active learning algorithms (Balcan, Beygelzimer, and Langford, 2009; Dasgupta, Hsu, and Monteleoni, 2007) can also achieve these types of improvements in the asymptotic form of the label complexity; he further proved general bounds on the label complexities of these methods, again based on the disagreement coefficient, which apply to arbitrary types of classifiers, and which reflect these types of improvements 1476 ACTIVIZED L EARNING (under conditions on the disagreement coefficient). Wang (2009) later bounded the label complexity of A2 under somewhat different noise conditions, in particular identifying weaker noise conditions sufficient for these improvements to be exponential in magnitude (again, under conditions on the disagreement coefficient). Koltchinskii (2010) has recently improved on some of Hanneke’s results, refining certain logarithmic factors and simplifying the proofs, using a slightly different algorithm based on similar principles. Though the present work discusses only classes of finite VC dimension, most of the above references also contain results for various types of nonparametric classes with infinite VC dimension. At present, all of the published bounds on the label complexity of agnostic active learning also apply to self-verifying algorithms. As mentioned, in the realizable case, it is typically possible to achieve significantly better label complexities if we do not require the active learning algorithm to be self-verifying, since the verification of learning may be more difficult than the learning itself (Balcan, Hanneke, and Vaughan, 2010). We might wonder whether this is also true in the agnostic case, and whether agnostic active learning algorithms that are not self-verifying might possibly achieve significantly better label complexities than the existing label complexity bounds described above. We investigate this in depth below. 1.2 Summary of Contributions In the present work, we build on and extend the above results in a variety of ways, resolving a number of open problems. The main contributions of this work can be summarized as follows. • We formally define a notion of a universal activizer, a meta-algorithm that transforms any passive learning algorithm into an active learning algorithm with asymptotically strictly superior label complexities for all nontrivial distributions and target concepts in the concept space. • We analyze the existing strategy of disagreement-based active learning from this perspective, precisely characterizing the conditions under which this strategy can lead to a universal activizer for VC classes in the realizable case. • We propose a new type of active learning algorithm, based on shatterable sets, and construct universal activizers for all VC classes in the realizable case based on this idea; in particular, this overcomes the issue of distribution-dependence in the existing results mentioned above. • We present a novel generalization of the disagreement coefficient, along with a new asymptotic bound on the label complexities achievable by active learning in the realizable case; this new bound is often significantly smaller than the existing results in the published literature. • We state new concise sufficient conditions for exponential improvements over passive learning to be achievable in the realizable case, including a significant weakening of known conditions in the published literature. • We present a new general-purpose active learning algorithm for the agnostic case, based on the aforementioned idea involving shatterable sets. • We prove a new asymptotic bound on the label complexities achievable by active learning in the presence of label noise (the agnostic case), often significantly smaller than any previously published results. 1477 H ANNEKE • We formulate a general conjecture on the theoretical advantages of active learning over passive learning in the presence of arbitrary types of label noise. 1.3 Outline of the Paper The paper is organized as follows. In Section 2, we introduce the basic notation used throughout, formally define the learning protocol, and formally define the label complexity. We also define the notion of an activizer, which is a procedure that transforms a passive learning algorithm into an active learning algorithm with asymptotically superior label complexity. In Section 3, we review the established technique of disagreement-based active learning, and prove a new result precisely characterizing the scenarios in which disagreement-based active learning can be used to construct an activizer. In particular, we find that in many scenarios, disagreement-based active learning is not powerful enough to provide the desired improvements. In Section 4, we move beyond disagreementbased active learning, developing a new type of active learning algorithm based on shatterable sets of points. We apply this technique to construct a simple 3-stage procedure, which we then prove is a universal activizer for any concept space of finite VC dimension. In Section 5, we begin by reviewing the known results for bounding the label complexity of disagreement-based active learning in terms of the disagreement coefficient; we then develop a somewhat more involved procedure, again based on shatterable sets, which takes full advantage of the sequential nature of active learning. In addition to being an activizer, we show that this procedure often achieves dramatically superior label complexities than achievable by passive learning. In particular, we define a novel generalization of the disagreement coefficient, and use it to bound the label complexity of this procedure. This also provides us with concise sufficient conditions for obtaining exponential improvements over passive learning. Continuing in Section 6, we extend our framework to allow for label noise (the agnostic case), and discuss the possibility of extending the results from previous sections to these noisy learning problems. We first review the known results for noise-robust disagreement-based active learning, and characterizations of its label complexity in terms of the disagreement coefficient and Mammen-Tsybakov noise parameters. We then proceed to develop a new type of noise-robust active learning algorithm, again based on shatterable sets, and prove bounds on its label complexity in terms of our aforementioned generalization of the disagreement coefficient. Additionally, we present a general conjecture concerning the existence of activizers for certain passive learning algorithms in the agnostic case. We conclude in Section 7 with a host of enticing open problems for future investigation. 2. Definitions and Notation For most of the paper, we consider the following formal setting. There is a measurable space (X , FX ), where X is called the instance space; for simplicity, we suppose this is a standard Borel space (Srivastava, 1998) (e.g., Rm under the usual Borel σ -algebra), though most of the results generalize. A classifier is any measurable function h : X → {−1, +1}. There is a set C of classifiers called the concept space. In the realizable case, the learning problem is characterized as follows. There is a probability measure P on X , and a sequence ZX = {X1 , X2 , . . .} of independent X -valued random variables, each with distribution P. We refer to these random variables as the sequence of unlabeled examples; although in practice, this sequence would typically be large but finite, to simplify the discussion and focus strictly on counting labels, we will suppose this sequence is inexhaustible. There is additionally a special element f ∈ C, called the target function, and we denote by 1478 ACTIVIZED L EARNING Yi = f (Xi ); we further denote by Z = {(X1 ,Y1 ), (X2 ,Y2 ), . . .} the sequence of labeled examples, and for m ∈ N we denote by Zm = {(X1 ,Y1 ), (X2 ,Y2 ), . . . , (Xm ,Ym )} the finite subsequence consisting of the first m elements of Z. For any classifier h, we define the error rate er(h) = P(x : h(x) = f (x)). Informally, the learning objective in the realizable case is to identify some h with small er(h) using elements from Z, without direct access to f . An active learning algorithm A is permitted direct access to the ZX sequence (the unlabeled examples), but to gain access to the Yi values it must request them one at a time, in a sequential manner. Specifically, given access to the ZX values, the algorithm selects any index i ∈ N, requests to observe the Yi value, then having observed the value of Yi , selects another index i′ , observes the value of Yi′ , etc. The algorithm is given as input an integer n, called the label budget, and is permitted ˆ to observe at most n labels total before eventually halting and returning a classifier hn = A(n); that is, by definition, an active learning algorithm never attempts to access more than the given budget n ˆ number of labels. We will then study the values of n sufficient to guarantee E[er(hn )] ≤ ε , for any given value ε ∈ (0, 1). We refer to this as the label complexity. We will be particularly interested in the asymptotic dependence on ε in the label complexity, as ε → 0. Formally, we have the following definition. Definition 1 An active learning algorithm A achieves label complexity Λ(·, ·, ·) if, for every target function f , distribution P, ε ∈ (0, 1), and integer n ≥ Λ(ε , f , P), we have E [er (A(n))] ≤ ε . This definition of label complexity is similar to one originally studied by Balcan, Hanneke, and Vaughan (2010). It has a few features worth noting. First, the label complexity has an explicit dependence on the target function f and distribution P. As noted by Dasgupta (2005), we need this dependence if we are to fully understand the range of label complexities achievable by active learning; we further illustrate this issue in the examples below. The second feature to note is that the label complexity, as defined here, is simply a sufficient budget size to achieve the specified accuracy. That is, here we are asking only how many label requests are required for the algorithm to achieve a given accuracy (in expectation). However, as noted by Balcan, Hanneke, and Vaughan (2010), this number might not be sufficiently large to detect that the algorithm has indeed achieved the required accuracy based only on the observed data. That is, because the number of labeled examples used in active learning can be quite small, we come across the problem that the number of labels needed to learn a concept might be significantly smaller than the number of labels needed to verify that we have successfully learned the concept. As such, this notion of label complexity is most useful in the design of effective learning algorithms, rather than for predicting the number of labels an algorithm should request in any particular application. Specifically, to design effective active learning algorithms, we should generally desire small label complexity values, so that (in the extreme case) if some algorithm A has smaller label complexity values than some other algorithm A′ for all target functions and distributions, then (all other factors being equal) we should clearly prefer algorithm A over algorithm A′ ; this is true regardless of whether we have a means to detect (verify) how large the improvements offered by algorithm A over algorithm A′ are for any particular application. Thus, in our present context, performance guarantees in terms of this notion of label complexity play a role analogous to concepts such as universal consistency or admissibility, which are also generally useful in guiding the design of effective algorithms, but are not intended to be informative in the context of any particular application. See the work of Balcan, Hanneke, and Vaughan (2010) for a discussion of this issue, as it relates to a definition of label complexity similar 1479 H ANNEKE to that above, as well as other notions of label complexity from the active learning literature (some of which include a verification requirement). We will be interested in the performance of active learning algorithms, relative to the performance of a given passive learning algorithm. In this context, a passive learning algorithm A takes as input a finite sequence of labeled examples L ∈ n (X × {−1, +1})n , and returns a classifier ˆ h = A(L). We allow both active and passive learning algorithms to be randomized: that is, to have independent internal randomness, in addition to the given random data. We define the label complexity for a passive learning algorithm as follows. Definition 2 A passive learning algorithm A achieves label complexity Λ(·, ·, ·) if, for every target function f , distribution P, ε ∈ (0, 1), and integer n ≥ Λ(ε , f , P), we have E [er (A (Zn ))] ≤ ε . Although technically some algorithms may be able to achieve a desired accuracy without any observations, to make the general results easier to state (namely, those in Section 5), unless otherwise stated we suppose label complexities (both passive and active) take strictly positive values, among N ∪ {∞}; note that label complexities (both passive and active) can be infinite, indicating that the corresponding algorithm might not achieve expected error rate ε for any n ∈ N. Both the passive and active label complexities are defined as a number of labels sufficient to guarantee the expected error rate is at most ε . It is also common in the literature to discuss the number of label requests sufficient to guarantee the error rate is at most ε with high probability 1 − δ (e.g., Balcan, Hanneke, and Vaughan, 2010). In the present work, we formulate our results in terms of the expected error rate because it simplifies the discussion of asymptotics, in that we need only study the behavior of the label complexity as the single argument ε approaches 0, rather than the more complicated behavior of a function of ε and δ as both ε and δ approach 0 at various relative rates. However, we note that analogous results for these high-probability guarantees on the error rate can be extracted from the proofs below without much difficulty, and in several places we explicitly state results of this form. Below we employ the standard notation from asymptotic analysis, including O(·), o(·), Ω(·), ω (·), Θ(·), ≪, and ≫. In all contexts below not otherwise specified, the asymptotics are always considered as ε → 0 when considering a function of ε , and as n → ∞ when considering a function of n; also, in any expression of the form “x → 0,” we always mean the limit from above (i.e., x ↓ 0). For instance, when considering nonnegative functions of ε , λa (ε ) and λ p (ε ), the above notations λ are defined as follows. We say λa (ε ) = o(λ p (ε )) when lim λa (ε ) = 0, and this is equivalent to p (ε ) ε →0 writing λ p (ε ) = ω (λa (ε )), λa (ε ) ≪ λ p (ε ), or λ p (ε ) ≫ λa (ε ). We say λa (ε ) = O(λ p (ε )) when λ lim sup λa (ε ) < ∞, which can equivalently be expressed as λ p (ε ) = Ω(λa (ε )). Finally, we write p (ε ) ε →0 λa (ε ) = Θ(λ p (ε )) to mean that both λa (ε ) = O(λ p (ε )) and λa (ε ) = Ω(λ p (ε )) are satisfied. We also use the standard notation for the limit of a sequence of sets, such as lim Ar , defined by the r→0 property ½ lim Ar = lim ½Ar (if the latter exists), where ½A is the indicator function for the set A. r→0 r→0 Define the class of functions Polylog(1/ε ) as those g : (0, 1) → [0, ∞) such that, for some k ∈ [0, ∞), g(ε ) = O(logk (1/ε )). For a label complexity Λ, also define the set Nontrivial(Λ) as the collection of all pairs ( f , P) of a classifier and a distribution such that, ∀ε > 0, Λ(ε , f , P) < ∞, and ∀g ∈ Polylog(1/ε ), Λ(ε , f , P) = ω (g(ε )). 1480 ACTIVIZED L EARNING In this context, an active meta-algorithm is a procedure Aa taking as input a passive algorithm A p and a label budget n, such that for any passive algorithm A p , Aa (A p , ·) is an active learning algorithm. We define an activizer for a given passive algorithm as follows. Definition 3 We say an active meta-algorithm Aa activizes a passive algorithm A p for a concept space C if the following holds. For any label complexity Λ p achieved by A p , the active learning algorithm Aa (A p , ·) achieves a label complexity Λa such that, for every f ∈ C and every distribution P on X with ( f , P) ∈ Nontrivial(Λ p ), there exists a constant c ∈ [1, ∞) such that Λa (cε , f , P) = o (Λ p (ε , f , P)) . In this case, Aa is called an activizer for A p with respect to C, and the active learning algorithm Aa (A p , ·) is called the Aa -activized A p . We also refer to any active meta-algorithm Aa that activizes every passive algorithm A p for C as a universal activizer for C. One of the main contributions of this work is establishing that such universal activizers do exist for any VC class C. A bit of explanation is in order regarding Definition 3. We might interpret it as follows: an activizer for A p strongly improves (in a little-o sense) the label complexity for all nontrivial target functions and distributions. Here, we seek a meta-algorithm that, when given A p as input, results in an active learning algorithm with strictly superior label complexities. However, there is a sense in which some distributions P or target functions f are trivial relative to A p . For instance, perhaps A p has a default classifier that it is naturally biased toward (e.g., with minimal P(x : h(x) = +1), as in the Closure algorithm of Helmbold, Sloan, and Warmuth, 1990), so that when this default classifier is the target function, A p achieves a constant label complexity. In these trivial scenarios, we cannot hope to improve over the behavior of the passive algorithm, but instead can only hope to compete with it. The sense in which we wish to compete may be a subject of some controversy, but the implication of Definition 3 is that the label complexity of the activized algorithm should be strictly better than every nontrivial upper bound on the label complexity of the passive algorithm. For instance, if Λ p (ε , f , P) ∈ Polylog(1/ε ), then we are guaranteed Λa (ε , f , P) ∈ Polylog(1/ε ) as well, but if Λ p (ε , f , P) = O(1), we are still only guaranteed Λa (ε , f , P) ∈ Polylog(1/ε ). This serves the purpose of defining a framework that can be studied without requiring too much obsession over small additive terms in trivial scenarios, thus focusing the analyst’s efforts toward nontrivial scenarios where A p has relatively large label complexity, which are precisely the scenarios for which active learning is truly needed. In our proofs, we find that in fact Polylog(1/ε ) can be replaced with O(log(1/ε )), giving a slightly broader definition of “nontrivial,” for which all of the results below still hold. Section 7 discusses open problems regarding this issue of trivial problems. The definition of Nontrivial(·) also only requires the activized algorithm to be effective in scenarios where the passive learning algorithm has reasonable behavior (i.e., finite label complexities); this is only intended to keep with the reduction-based style of the framework, and in fact this restriction can easily be lifted using a trick from Balcan, Hanneke, and Vaughan (2010) (aggregating the activized algorithm with another algorithm that is always reasonable). Finally, we also allow a constant factor c loss in the ε argument to Λa . We allow this to be an arbitrary constant, again in the interest of allowing the analyst to focus only on the most significant aspects of the problem; for most reasonable passive learning algorithms, we typically expect Λ p (ε , f , P) = Poly(1/ε ), in which case c can be set to 1 by adjusting the leading constant factors of 1481 H ANNEKE Λa . A careful inspection of our proofs reveals that c can always be set arbitrarily close to 1 without affecting the theorems below (and in fact, we can even get c = (1 + o(1)), a function of ε ). ˆ Throughout this work, we will adopt the usual notation for probabilities, such as P(er(h) > ε ), and as usual we interpret this as measuring the corresponding event in the (implicit) underlying probability space. In particular, we make the usual implicit assumption that all sets involved in the analysis are measurable; where this assumption does not hold, we may turn to outer probabilities, though we will not make further mention of these technical details. We will also use the notation P k (·) to represent k-dimensional product measures; for instance, for a measurable set ′ ′ ′ ′ A ⊆ X k , P k (A) = P((X1 , . . . , Xk ) ∈ A), for independent P-distributed random variables X1 , . . . , Xk . 0 = {∅} and P 0 (X 0 ) = 1. Additionally, to simplify notation, we will adopt the convention that X Throughout, we will denote by ½A (z) the indicator function for a set A, which has the value 1 when z ∈ A and 0 otherwise; additionally, at times it will be more convenient to use the bipolar indicator function, defined as ½± (z) = 2½A (z) − 1. A We will require a few additional definitions for the discussion below. For any classifier h : X → {−1, +1} and finite sequence of labeled examples L ∈ m (X × {−1, +1})m , define the empirical error rate erL (h) = |L|−1 (x,y)∈L ½{−y} (h(x)); for completeness, define er∅ (h) = 0. Also, for L = Zm , the first m labeled examples in the data sequence, abbreviate this as erm (h) = erZm (h). For any probability measure P on X , set of classifiers H, classifier h, and r > 0, define BH,P (h, r) = {g ∈ H : P (x : h(x) = g(x)) ≤ r}; when P = P, the distribution of the unlabeled examples, and P is clear from the context, we abbreviate this as BH (h, r) = BH,P (h, r); furthermore, when P = P and H = C, the concept space, and both P and C are clear from the context, we abbreviate this as B(h, r) = BC,P (h, r). Also, for any set of classifiers H, and any sequence of labeled examples L ∈ m (X × {−1, +1})m , define H[L] = {h ∈ H : erL (h) = 0}; for any (x, y) ∈ X × {−1, +1}, abbreviate H[(x, y)] = H[{(x, y)}] = {h ∈ H : h(x) = y}. We also adopt the usual definition of “shattering” used in learning theory (e.g., Vapnik, 1998). Specifically, for any set of classifiers H, k ∈ N, and S = (x1 , . . . , xk ) ∈ X k , we say H shatters S if, ∀(y1 , . . . , yk ) ∈ {−1, +1}k , ∃h ∈ H such that ∀i ∈ {1, . . . , k}, h(xi ) = yi ; equivalently, H shatters S if ∃{h1 , . . . , h2k } ⊆ H such that for each i, j ∈ {1, . . . , 2k } with i = j, ∃ℓ ∈ {1, . . . , k} with hi (xℓ ) = h j (xℓ ). To simplify notation, we will also say that H shatters ∅ if and only if H = {}. As usual, we define the VC dimension of C, denoted d, as the largest integer k such that ∃S ∈ X k shattered by C (Vapnik and Chervonenkis, 1971; Vapnik, 1998). To focus on nontrivial problems, we will only consider concept spaces C with d > 0 in the results below. Generally, any such concept space C with d < ∞ is called a VC class. 2.1 Motivating Examples Throughout this paper, we will repeatedly refer to a few canonical examples. Although themselves quite toy-like, they represent the boiled-down essence of some important distinctions between various types of learning problems. In some sense, the process of grappling with the fundamental distinctions raised by these types of examples has been a driving force behind much of the recent progress in understanding the label complexity of active learning. The first example is perhaps the most classic, and is clearly the first that comes to mind when considering the potential for active learning to provide strong improvements over passive learning. Example 1 In the problem of learning threshold classifiers, we consider X = [0, 1] and C = {hz (x) = ½± (x) : z ∈ (0, 1)}. [z,1] 1482 ACTIVIZED L EARNING There is a simple universal activizer for threshold classifiers, based on a kind of binary search. Specifically, suppose n ∈ N and that A p is any given passive learning algorithm. Consider the points in {X1 , X2 , . . . , Xm }, for m = 2n−1 , and sort them in increasing order: X(1) , X(2) , . . . , X(m) . Also initialize ℓ = 0 and u = m + 1, and define X(0) = 0 and X(m+1) = 1. Now request the label of X(i) for i = ⌊(ℓ + u)/2⌋ (i.e., the median point between ℓ and u); if the label is −1, let ℓ = i, and otherwise let u = i; repeat this (requesting this median point, then updating ℓ or u accordingly) until we have u = ℓ+1. Finally, let z = X(u) , construct the labeled sequence L = {(X1 , hz (X1 )) , . . . , (Xm , hz (Xm ))}, ˆ ˆ ˆ ˆ and return the classifier h = A p (L). Since each label request at least halves the set of integers between ℓ and u, the total number of label requests is at most log2 (m) + 1 = n. Supposing f ∈ C is the target function, this procedure maintains the invariant that f (X(ℓ) ) = −1 and f (X(u) ) = +1. Thus, once we reach u = ℓ + 1, since f is a threshold, it must be some hz with z ∈ (ℓ, u]; therefore every X( j) with j ≤ ℓ has f (X( j) ) = −1, and likewise every X( j) with j ≥ u has f (X( j) ) = +1; in particular, this means L ˆ equals Zm , the true labeled sequence. But this means h = A p (Zm ). Since n = log2 (m) + 1, this active learning algorithm will achieve an equivalent error rate to what A p achieves with m labeled examples, but using only log2 (m) + 1 label requests. In particular, this implies that if A p achieves label complexity Λ p , then this active learning algorithm achieves label complexity Λa such that Λa (ε , f , P) ≤ log2 Λ p (ε , f , P) + 2; as long as 1 ≪ Λ p (ε , f , P) < ∞, this is o(Λ p (ε , f , P)), so that this procedure activizes A p for C. The second example we consider is almost equally simple (only increasing the VC dimension from 1 to 2), but is far more subtle in terms of how we must approach its analysis in active learning. Example 2 In the problem of learning interval classifiers, we consider X = [0, 1] and C = {h[a,b] (x) = ½± (x) : 0 < a ≤ b < 1}. [a,b] For the intervals problem, we can also construct a universal activizer, though slightly more complicated. Specifically, suppose again that n ∈ N and that A p is any given passive learning algorithm. We first request the labels {Y1 ,Y2 , . . . ,Y⌈n/2⌉ } of the first ⌈n/2⌉ examples in the sequence. If every one of these labels is −1, then we immediately return the all-negative constant classifier ˆ h(x) = −1. Otherwise, consider the points {X1 , X2 , . . . , Xm }, for m = max 2⌊n/4⌋−1 , n , and sort them in increasing order X(1) , X(2) , . . . , X(m) . For some value i ∈ {1, . . . , ⌈n/2⌉} with Yi = +1, let j+ denote the corresponding index j such that X( j) = Xi . Also initialize ℓ1 = 0, u1 = ℓ2 = j+ , and u2 = m + 1, and define X(0) = 0 and X(m+1) = 1. Now if ℓ1 + 1 < u1 , request the label of X(i) for i = ⌊(ℓ1 + u1 )/2⌋ (the median point between ℓ1 and u1 ); if the label is −1, let ℓ1 = i, and otherwise let u1 = i; repeat this (requesting this median point, then updating ℓ1 or u1 accordingly) until we have u1 = ℓ1 + 1. Now if ℓ2 + 1 < u2 , request the label of X(i) for i = ⌊(ℓ2 + u2 )/2⌋ (the median point between ℓ2 and u2 ); if the label is −1, let u2 = i, and otherwise let ℓ2 = i; repeat this (requesting this median point, then updating u2 or ℓ2 accordingly) until we have u2 = ℓ2 + 1. Finally, let a = u1 and ˆ ˆ = ℓ2 , construct the labeled sequence L = X1 , h ˆ (X1 ) , . . . , Xm , h ˆ (Xm ) , and return the b [a,b] ˆ [a,b] ˆ ˆ classifier h = A p (L). Since each label request in the second phase halves the set of values between either ℓ1 and u1 or ℓ2 and u2 , the total number of label requests is at most min {m, ⌈n/2⌉ + 2 log2 (m) + 2} ≤ n. Suppose f ∈ C is the target function, and let w( f ) = P(x : f (x) = +1). If w( f ) = 0, then with ˆ ˆ probability 1 the algorithm will return the constant classifier h(x) = −1, which has er(h) = 0 in this 2 1 case. Otherwise, if w( f ) > 0, then for any n ≥ w( f ) ln ε , with probability at least 1 − ε , there exists 1483 H ANNEKE i ∈ {1, . . . , ⌈n/2⌉} with Yi = +1. Let H+ denote the event that such an i exists. Supposing this is the case, the algorithm will make it into the second phase. In this case, the procedure maintains the invariant that f (X(ℓ1 ) ) = −1, f (X(u1 ) ) = f (X(ℓ2 ) ) = +1, and f (X(u2 ) ) = −1, where ℓ1 < u1 ≤ ℓ2 < u2 . Thus, once we have u1 = ℓ1 + 1 and u2 = ℓ2 + 1, since f is an interval, it must be some h[a,b] with a ∈ (ℓ1 , u1 ] and b ∈ [ℓ2 , u1 ); therefore, every X( j) with j ≤ ℓ1 or j ≥ u2 has f (X( j) ) = −1, and likewise every X( j) with u1 ≤ j ≤ ℓ2 has f (X( j) ) = +1; in particular, this means L equals Zm , the true ˆ labeled sequence. But this means h = A p (Zm ). Supposing A p achieves label complexity Λ p , and ˆ that n ≥ max 8 + 4 log2 Λ p (ε , f , P), 2 ln 1 , then m ≥ 2⌊n/4⌋−1 ≥ Λ p (ε , f , P) and E er(h) ≤ w( f ) ε ˆ E er(h)½H+ + (1 − P(H+ )) ≤ E [er(A p (Zm ))] + ε ≤ 2ε . In particular, this means this active learning algorithm achieves label complexity Λa such that, for any f ∈ C with w( f ) = 0, Λa (2ε , f , P) = 0, 2 1 and for any f ∈ C with w( f ) > 0, Λa (2ε , f , P) ≤ max 8 + 4 log2 Λ p (ε , f , P), w( f ) ln ε . If ( f , P) ∈ 1 2 Nontrivial(Λ p ), then w( f ) ln ε = o(Λ p (ε , f , P)) and 8 + 4 log2 Λ p (ε , f , P) = o(Λ p (ε , f , P)), so that Λa (2ε , f , P) = o(Λ p (ε , f , P)). Therefore, this procedure activizes A p for C. This example also brings to light some interesting phenomena in the analysis of the label complexity of active learning. Note that unlike the thresholds example, we have a much stronger dependence on the target function in these label complexity bounds, via the w( f ) quantity. This issue is fundamental to the problem, and cannot be avoided. In particular, when P([0, x]) is continuous, this is the very issue that makes the minimax label complexity for this problem (i.e., minΛa max f ∈C Λa (ε , f , P)) no better than passive learning (Dasgupta, 2005). Thus, this problem emphasizes the need for any informative label complexity analysis of active learning to explicitly describe the dependence of the label complexity on the target function, as advocated by Dasgupta (2005). This example also highlights the unverifiability phenomenon explored by Balcan, Hanneke, and Vaughan (2010), since in the case of w( f ) = 0, the error rate of the returned classifier is zero, but (for nondegenerate P) there is no way for the algorithm to verify this fact based only on the finite number of labels it observes. In fact, Balcan, Hanneke, and Vaughan (2010) have shown that under continuous P, for any f ∈ C with w( f ) = 0, the number of labels required to both find a classifier of small error rate and verify that the error rate is small based only on observable quantities is essentially no better than for passive learning. These issues are present to a small degree in the intervals example, but were easily handled in a very natural way. The target-dependence shows up only in an initial phase of waiting for a positive example, and the always-negative classifiers were handled by setting a default return value. However, we can amplify these issues so that they show up in more subtle and involved ways. Specifically, consider the following example, studied by Balcan, Hanneke, and Vaughan (2010). Example 3 In the problem of learning unions of i intervals, we consider X = [0, 1] and C = hz (x) = ½±i j=1 [z2 j−1 ,z2 j ] (x) : 0 < z1 ≤ z2 ≤ . . . ≤ z2i < 1 . The challenge of this problem is that, because sometimes z j = z j+1 for some j values, we do not know how many intervals are required to minimally represent the target function: only that it is at most i. This issue will be made clearer below. We can essentially think of any effective strategy here as having two components: one component that searches (perhaps randomly) with the purpose of identifying at least one example from each decision region, and another component that refines our estimates of the end-points of the regions the first component identifies. Later, we will go through the behavior of a universal activizer for this problem in detail. 1484 ACTIVIZED L EARNING 3. Disagreement-Based Active Learning At present, perhaps the best-understood active learning algorithms are those choosing their label requests based on disagreement among a set of remaining candidate classifiers. The canonical algorithm of this type, a version of which we discuss below in Section 5.1, was proposed by Cohn, Atlas, and Ladner (1994). Specifically, for any set H of classifiers, define the region of disagreement: DIS(H) = {x ∈ X : ∃h1 , h2 ∈ H s.t. h1 (x) = h2 (x)} . The basic idea of disagreement-based algorithms is that, at any given time in the algorithm, there is a subset V ⊆ C of remaining candidates, called the version space, which is guaranteed to contain the target f . When deciding whether to request a particular label Yi , the algorithm simply checks whether Xi ∈ DIS(V ): if so, the algorithm requests Yi , and otherwise it does not. This general strategy is reasonable, since for any Xi ∈ DIS(V ), the label agreed upon by V must be f (Xi ), / so that we would get no information by requesting Yi ; that is, for Xi ∈ DIS(V ), we can accurately / infer Yi based on information already available. This type of algorithm has recently received substantial attention, not only for its obvious elegance and simplicity, but also because (as we discuss in Section 6) there are natural ways to extend the technique to the general problem of learning with label noise and model misspecification (the agnostic setting). The details of disagreement-based algorithms can vary in how they update the set V and how frequently they do so, but it turns out almost all disagreement-based algorithms share many of the same fundamental properties, which we describe below. 3.1 A Basic Disagreement-Based Active Learning Algorithm In Section 5.1, we discuss several known results on the label complexities achievable by these types of active learning algorithms. However, for now let us examine a very basic algorithm of this type. The following is intended to be a simple representative of the family of disagreement-based active learning algorithms. It has been stripped down to the bare essentials of what makes such algorithms work. As a result, although the gap between its label complexity and that achieved by passive learning is not necessarily as large as those achieved by the more sophisticated disagreement-based active learning algorithms of Section 5.1, it has the property that whenever those more sophisticated methods have label complexities asymptotically superior to those achieved by passive learning, that guarantee will also be true for this simpler method, and vice versa. The algorithm operates in only 2 phases. In the first, it uses one batch of label requests to reduce the version space V to a subset of C; in the second, it uses another batch of label requests, this time only requesting labels for points in DIS(V ). Thus, we have isolated precisely that aspect of disagreement-based active learning that involves improvements due to only requesting the labels of examples in the region of disagreement. ˆ The procedure is formally defined as follows, in terms of an estimator Pn (DIS(V )) specified below. 1485 H ANNEKE Meta-Algorithm 0 Input: passive algorithm A p , label budget n ˆ Output: classifier h 0. 1. 2. 3. 4. 5. 6. 7. 8. Request the first ⌊n/2⌋ labels {Y1 , . . . ,Y⌊n/2⌋ }, and let t ← ⌊n/2⌋ Let V = {h ∈ C : er⌊n/2⌋ (h) = 0} ˆ ˆ Let ∆ ← Pn (DIS(V )) Let L ← {} ˆ For m = ⌊n/2⌋ + 1, . . . ⌊n/2⌋ + ⌊n/(4∆)⌋ If Xm ∈ DIS(V ) and t < n, request the label Ym of Xm , and let y ← Ym and t ← t + 1 ˆ Else let y ← h(Xm ) for an arbitrary h ∈ V ˆ Let L ← L ∪ {(Xm , y)} ˆ Return A p (L) ˆ Meta-Algorithm 0 depends on a data-dependent estimator Pn (DIS(V )) of P(DIS(V )), which we can define in a variety of ways using only unlabeled examples. In particular, for the theorems ˆ below, we will take the following definition for Pn (DIS(V )), designed to be a confidence upper bound on P(DIS(V )). Let Un = {Xn2 +1 , . . . , X2n2 }. Then define   2 4 ˆ n (DIS(V )) = max P ½DIS(V ) (x), . (1)  n2 n x∈Un Meta-Algorithm 0 is divided into two stages: one stage where we focus on reducing V , and a second stage where we construct the sample L for the passive algorithm. This might intuitively seem somewhat wasteful, as one might wish to use the requested labels from the first stage to augment those in the second stage when constructing L, thus feeding all of the observed labels into the passive algorithm A p . Indeed, this can improve the label complexity in some cases (albeit only by a constant factor); however, in order to get the general property of being an activizer for all passive algorithms A p , we construct the sample L so that the conditional distribution of the X components in L given |L| is P |L| , so that it is (conditionally) an i.i.d. sample, which is essential to our analysis. The choice of the number of (unlabeled) examples to process in the second stage guarantees (by a Chernoff bound) that the “t < n” constraint in Step 5 is redundant; this is a trick we will employ in several of the methods below. As explained above, because f ∈ V , this implies that every (x, y) ∈ L has y = f (x). To give some basic intuition for how this algorithm behaves, consider the example of learning ˆ threshold classifiers (Example 1); to simplify the explanation, for now we ignore the fact that Pn is only an estimate, as well as the “t < n” constraint in Step 5 (both of which will be addressed in the general analysis below). In this case, suppose the target function is f = hz . Let a = max{Xi : Xi < z, 1 ≤ i ≤ ⌊n/2⌋} and b = min{Xi : Xi ≥ z, 1 ≤ i ≤ ⌊n/2⌋}. Then V = {hz ′ : a < z ′ ≤ b} and DIS(V ) = (a, b), so that the second phase of the algorithm only requests labels for a number of points in the region (a, b). With probability 1 − ε , the probability mass in this region is at most O(log(1/ε )/n), so that |L| ≥ ℓn,ε = Ω(n2 / log(1/ε )); also, since the labels in L are all correct, and the Xm values in L are conditionally iid (with distribution P) given |L|, we see that the conditional distribution of L given |L| = ℓ is the same as the (unconditional) distribution of Zℓ . In particular, if ˆ A p achieves label complexity Λ p , and hn is the classifier returned by Meta-Algorithm 0 applied to 1486 ACTIVIZED L EARNING A p , then for any n = Ω ˆ E er hn Λ p (ε , f , P) log(1/ε ) chosen so that ℓn,ε ≥ Λ p (ε , f , P), we have ≤ ε + sup E [er (A p (Zℓ ))] ≤ ε + ℓ≥ℓn,ε sup ℓ≥Λ p (ε , f ,P) E [er (A p (Zℓ ))] ≤ 2ε . This indicates the active learning algorithm achieves label complexity Λa with Λa (2ε , f , P) = O Λ p (ε , f , P) log(1/ε ) . In particular, if ∞ > Λ p (ε , f , P) = ω (log(1/ε )), then Λa (2ε , f , P) = o(Λ p (ε , f , P)). Therefore, Meta-Algorithm 0 is a universal activizer for the space of threshold classifiers. In contrast, consider the problem of learning interval classifiers (Example 2). In this case, suppose the target function f has P(x : f (x) = +1) = 0, and that P is uniform in [0, 1]. Since (with probability one) every Yi = −1, we have V = {h[a,b] : {X1 , . . . , X⌊n/2⌋ } ∩ [a, b] = ∅}. But this contains classifiers h[a,a] for every a ∈ (0, 1) \ {X1 , . . . , X⌊n/2⌋ }, so that DIS(V ) = (0, 1) \ {X1 , . . . , X⌊n/2⌋ }. Thus, P(DIS(V )) = 1, and |L| = O(n); that is, A p gets run with no more labeled examples than simple passive learning would use. This indicates we should not expect Meta-Algorithm 0 to be a universal activizer for interval classifiers. Below, we formalize this by constructing a passive learning algorithm A p that Meta-Algorithm 0 does not activize in scenarios of this type. 3.2 The Limiting Region of Disagreement In this subsection, we generalize the examples from the previous subsection. Specifically, we prove that the performance of Meta-Algorithm 0 is intimately tied to a particular limiting set, referred to as the disagreement core. A similar definition was given by Balcan, Hanneke, and Vaughan (2010) (there referred to as the boundary, for reasons that will become clear below); it is also related to certain quantities in the work of Hanneke (2007b, 2011) described below in Section 5.1. Definition 4 Define the disagreement core of a classifier f with respect to a set of classifiers H and probability measure P as ∂H,P f = lim DIS (BH,P ( f , r)) . r→0 When P = P, the data distribution on X , and P is clear from the context, we abbreviate this as ∂H f = ∂H,P f ; if additionally H = C, the full concept space, which is clear from the context, we further abbreviate this as ∂ f = ∂C f = ∂C,P f . As we will see, disagreement-based algorithms often tend to focus their label requests around the disagreement core of the target function. As such, the concept of the disagreement core will be essential in much of our discussion below. We therefore go through a few examples to build intuition about this concept and its properties. Perhaps the simplest example to start with is C as the class of threshold classifiers (Example 1), under P uniform on [0, 1]. For any hz ∈ C and sufficiently small r > 0, B( f , r) = {hz ′ : |z ′ − z| ≤ r}, and DIS(B( f , r)) = [z − r, z + r). Therefore, ∂hz = lim DIS(B(hz , r)) = lim [z − r, z + r) = {z}. Thus, in this case, the disagreement core r→0 r→0 of hz with respect to C and P is precisely the decision boundary of the classifier. As a slightly more involved example, consider again the example of interval classifiers (Example 2), again under P uniform on [0, 1]. Now for any h[a,b] ∈ C with b − a > 0, for any sufficiently small r > 0, B(h[a,b] , r) = {h[a′ ,b′ ] : |a − a′ | + |b − b′ | ≤ r}, and DIS(B(h[a,b] , r)) = [a − r, a + r) ∪ (b − r, b + r]. Therefore, ∂h[a,b] = lim DIS(B(h[a,b] , r)) = lim [a − r, a + r) ∪ (b − r, b + r] = {a, b}. Thus, in this r→0 r→0 case as well, the disagreement core of h[a,b] with respect to C and P is again the decision boundary of the classifier. 1487 H ANNEKE As the above two examples illustrate, ∂ f often corresponds to the decision boundary of f in some geometric interpretation of X and f . Indeed, under fairly general conditions on C and P, the disagreement core of f does correspond to (a subset of) the set of points dividing the two label regions of f ; for instance, Friedman (2009) derives sufficient conditions, under which this is the case. In these cases, the behavior of disagreement-based active learning algorithms can often be interpreted in the intuitive terms of seeking label requests near the decision boundary of the target function, to refine an estimate of that boundary. However, in some more subtle scenarios this is no longer the case, for interesting reasons. To illustrate this, let us continue the example of interval classifiers from above, but now consider h[a,a] (i.e., h[a,b] with a = b). This time, for any r ∈ (0, 1) we have B(h[a,a] , r) = {h[a′ ,b′ ] ∈ C : b′ − a′ ≤ r}, and DIS(B(h[a,a] , r)) = (0, 1). Therefore, ∂h[a,a] = lim DIS(B(h[a,a] , r)) = lim (0, 1) = (0, 1). r→0 r→0 This example shows that in some cases, the disagreement core does not correspond to the decision boundary of the classifier, and indeed has P(∂ f ) > 0. Intuitively, as in the above example, this typically happens when the decision surface of the classifier is in some sense simpler than it could be. For instance, consider the space C of unions of two intervals (Example 3 with i = 2) under uniform P. The classifiers f ∈ C with P(∂ f ) > 0 are precisely those representable (up to probability zero differences) as a single interval. The others (with 0 < z1 < z2 < z3 < z4 < 1) have ∂hz = {z1 , z2 , z3 , z4 }. In these examples, the f ∈ C with P(∂ f ) > 0 are not only simpler than other nearby classifiers in C, but they are also in some sense degenerate relative to the rest of C; however, it turns out this is not always the case, as there exist scenarios (C, P), even with d = 2, and even with countable C, for which every f ∈ C has P(∂ f ) > 0; in these cases, every classifier is in some important sense simpler than some other subset of nearby classifiers in C. In Section 3.3, we show that the label complexity of disagreement-based active learning is intimately tied to the disagreement core. In particular, scenarios where P(∂ f ) > 0, such as those mentioned above, lead to the conclusion that disagreement-based methods are sometimes insufficient for activized learning. This motivates the design of more sophisticated methods in Section 4, which overcome this deficiency, along with a corresponding refinement of the definition of “disagreement core ” in Section 5.2 that eliminates the above issue with “simple” classifiers. 3.3 Necessary and Sufficient Conditions for Disagreement-Based Activized Learning In the specific case of Meta-Algorithm 0, for large n we may intuitively expect it to focus its second batch of label requests in and around the disagreement core of the target function. Thus, whenever P(∂ f ) = 0, we should expect the label requests to be quite focused, and therefore the algorithm should achieve smaller label complexity compared to passive learning. On the other hand, if P(∂ f ) > 0, then the label requests will not become focused beyond a constant fraction of the space, so that the improvements achieved by Meta-Algorithm 0 over passive learning should be, at best, a constant factor. This intuition is formalized in the following general theorem, the proof of which is included in Appendix A. Theorem 5 For any VC class C, Meta-Algorithm 0 is a universal activizer for C if and only if every f ∈ C and distribution P has P (∂C,P f ) = 0. While the formal proof is given in Appendix A, the general idea is simple. As we always have f ∈ V , any y inferred in Step 6 must equal f (x), so that all of the labels in L are correct. Also, as n ˆ grows large, classic results on passive learning imply the diameter of the set V will become small, 1488 ACTIVIZED L EARNING shrinking to zero as n → ∞ (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989). Therefore, as n → ∞, DIS(V ) should converge to a subset of ∂ f , ˆ so that in the case P(∂ f ) = 0, we have ∆ → 0; thus |L| ≫ n, which implies an asymptotic strict improvement in label complexity over the passive algorithm A p that L is fed into in Step 8. On the other hand, since ∂ f is defined by classifiers arbitrarily close to f , it is unlikely that any finite sample of correctly labeled examples can contradict enough classifiers to make DIS(V ) significantly smaller ˆ than ∂ f , so that we always have P(DIS(V )) ≥ P(∂ f ). Therefore, if P(∂ f ) > 0, then ∆ converges to some nonzero constant, so that |L| = O(n), representing only a constant factor improvement in label complexity. In fact, as is implied from this sketch (and is proven in Appendix A), the targets f and distributions P for which Meta-Algorithm 0 achieves asymptotic strict improvements for all passive learning algorithms (for which f and P are nontrivial) are precisely those (and only those) for which P(∂C,P f ) = 0. There are some general conditions under which the zero-probability disagreement cores condition of Theorem 5 will hold. For instance, it is not difficult to show this will always hold when X is countable. Furthermore, with some effort one can show it will hold for most classes having VC dimension one (e.g., any countable C with d = 1). However, as we have seen, not all spaces C satisfy this zero-probability disagreement cores property. In particular, for the interval classifiers studied in Section 3.2, we have P(∂h[a,a] ) = P((0, 1)) = 1. Indeed, the aforementioned special cases aside, for most nontrivial spaces C, one can construct distributions P that in some sense make C mimic the intervals problem, so that we should typically expect disagreement-based methods will not be activizers. For detailed discussions of various scenarios where the P(∂C,P f ) = 0 condition is (or is not) satisfied for various C, P, and f , see the works of Hanneke (2009b), Hanneke (2007b), Hanneke (2011), Balcan, Hanneke, and Vaughan (2010), Friedman (2009), Wang (2009) and Wang (2011). 4. Beyond Disagreement: A Basic Activizer Since the zero-probability disagreement cores condition of Theorem 5 is not always satisfied, we are left with the question of whether there could be other techniques for active learning, beyond simple disagreement-based methods, which could activize every passive learning algorithm for every VC class. In this section, we present an entirely new type of active learning algorithm, unlike anything in the existing literature, and we show that indeed it is a universal activizer for any class C of finite VC dimension. 4.1 A Basic Activizer As mentioned, the case P(∂ f ) = 0 is already handled nicely by disagreement-based methods, since the label requests made in the second stage of Meta-Algorithm 0 will become focused into a small region, and L therefore grows faster than n. Thus, the primary question we are faced with is what to do when P(∂ f ) > 0. Since (loosely speaking) we have DIS(V ) → ∂ f in Meta-Algorithm 0, P(∂ f ) > 0 corresponds to scenarios where the label requests of Meta-Algorithm 0 will not become focused beyond a certain extent; specifically, as we show in Appendix B (Lemmas 35 and 36), P(DIS(V ) ⊕ ∂ f ) → 0 almost surely, where ⊕ is the symmetric difference, so that we expect MetaAlgorithm 0 will request labels for at least some constant fraction of the examples in L. On the one hand, this is definitely a major problem for disagreement-based methods, since it prevents them from improving over passive learning in those cases. On the other hand, if we do not 1489 H ANNEKE restrict ourselves to disagreement-based methods, we may actually be able to exploit properties of this scenario, so that it works to our advantage. In particular, in addition to the fact that P(DIS(V ) ⊕ ∂C f ) → 0, we show in Appendix B (Lemma 35) that P(∂V f ⊕ ∂C f ) = 0 (almost surely) in MetaAlgorithm 0; this implies that for sufficiently large n, a random point x1 in DIS(V ) is likely to be in ∂V f . We can exploit this fact by using x1 to split V into two subsets: V [(x1 , +1)] and V [(x1 , −1)]. Now, if x1 ∈ ∂V f , then (by definition of the disagreement core) inf er(h) = inf er(h) = h∈V [(x1 ,+1)] h∈V [(x1 ,−1)] 0. Therefore, for almost every point x ∈ DIS(V [(x1 , +1)]), the label agreed upon for x by classifiers / in V [(x1 , +1)] should be f (x). Likewise, for almost every point x ∈ DIS(V [(x1 , −1)]), the label / agreed upon for x by classifiers in V [(x1 , −1)] should be f (x). Thus, we can accurately infer the label of any point x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) (except perhaps a zero-probability subset). / With these sets V [(x1 , +1)] and V [(x1 , −1)] in hand, there is no longer a need to request the labels of points for which either of them has agreement about the label, and we can focus our label requests to the region DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]), which may be much smaller than DIS(V ). Now if P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) → 0, then the label requests will become focused to a shrinking region, and by the same reasoning as for Theorem 5 we can asymptotically achieve strict improvements over passive learning by a method analogous to Meta-Algorithm 0 (with the above changes). Already this provides a significant improvement over disagreement-based methods in many cases; indeed, in some cases (such as intervals) this fully addresses the nonzero-probability disagreement core issue in Theorem 5. In other cases (such as unions of two intervals), it does not completely address the issue, since for some targets we do not have P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) → 0. However, by repeatedly applying this same reasoning, we can address the issue in full generality. Specifically, if P(DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)])) 0, then DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) essentially converges to a region ∂C[(x1 ,+1)] f ∩ ∂C[(x1 ,−1)] f , which has nonzero probability, and is nearly equivalent to ∂V [(x1 ,+1)] f ∩ ∂V [(x1 ,−1)] f . Thus, for sufficiently large n, a random x2 in DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) will likely be in ∂V [(x1 ,+1)] f ∩ ∂V [(x1 ,−1)] f . In this case, we can repeat the above argument, this time splitting V into four sets (V [(x1 , +1)][(x2 , +1)], V [(x1 , +1)][(x2 , −1)], V [(x1 , −1)][(x2 , +1)], and V [(x1 , −1)][(x2 , −1)]), each with infimum error rate equal zero, so that for a point x in the region of agreement of any of these four sets, the agreed-upon label will (almost surely) be f (x), so that we can infer that label. Thus, we need only request the labels of those points in the intersection of all four regions of disagreement. We can further repeat this process as many times as needed, until we get a partition of V with shrinking probability mass in the intersection of the regions of disagreement, which (as above) can then be used to obtain asymptotic improvements over passive learning. Note that the above argument can be written more concisely in terms of shattering. That is, any x ∈ DIS(V ) is simply an x such that V shatters {x}; a point x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]) is simply one for which V shatters {x1 , x}, and for any x ∈ DIS(V [(x1 , +1)]) ∩ DIS(V [(x1 , −1)]), the / label y we infer about x has the property that the set V [(x, −y)] does not shatter {x1 }. This continues for each repetition of the above idea, with x in the intersection of the four regions of disagreement simply being one for which V shatters {x1 , x2 , x}, and so on. In particular, this perspective makes it clear that we need only repeat this idea at most d times to get a shrinking intersection region, since no set of d + 1 points is shatterable. Note that there may be unobservable factors (e.g., the target function) determining the appropriate number of iterations of this idea sufficient to have a shrinking probability of requesting a label, while maintaining the accuracy of inferred labels. To address this, 1490 ACTIVIZED L EARNING we can simply try all d + 1 possibilities, and then select one of the resulting d + 1 classifiers via a kind of tournament of pairwise comparisons. Also, in order to reduce the probability of a mistaken inference due to x1 ∈ ∂V f (or similarly for later xi ), we can replace each single xi with multiple / samples, and then take a majority vote over whether to infer the label, and which label to infer if we do so; generally, we can think of this as estimating certain probabilities, and below we write ˆ these estimators as Pm , and discuss the details of their implementation later. Combining MetaAlgorithm 0 with the above reasoning motivates a new type of active learning algorithm, referred to as Meta-Algorithm 1 below, and stated as follows. Meta-Algorithm 1 Input: passive algorithm A p , label budget n ˆ Output: classifier h 0. Request the first mn = ⌊n/3⌋ labels, {Y1 , . . . ,Ymn }, and let t ← mn 1. Let V = {h ∈ C : ermn (h) = 0} 2. For k = 1, 2, . . . , d + 1 ˆ ˆ ˆ 3. ∆(k) ← Pmn x : P S ∈ X k−1 : V shatters S ∪ {x}|V shatters S ≥ 1/2 4. Let Lk ← {} ˆ 5. For m = mn + 1, . . . , mn + ⌊n/(6 · 2k ∆(k) )⌋ ˆ m S ∈ X k−1 : V shatters S ∪ {Xm }|V shatters S ≥ 1/2 and t < ⌊2n/3⌋ 6. If P 7. Request the label Ym of Xm , and let y ← Ym and t ← t + 1 ˆ ˆ m S ∈ X k−1 :V [(Xm , −y)] does not shatter S|V shatters S 8. Else, let y ← argmax P ˆ y∈{−1,+1} 9. Let Lk ← Lk ∪ {(Xm , y)} ˆ 10. Return ActiveSelect({A p (L1 ), A p (L2 ), . . . , A p (Ld+1 )}, ⌊n/3⌋, {Xmn +maxk |Lk |+1 , . . .}) Subroutine: ActiveSelect Input: set of classifiers {h1 , h2 , . . . , hN }, label budget m, sequence of unlabeled examples U ˆ Output: classifier h 0. 1. 2. 3. 4. For each j, k ∈ {1, 2, . . . , N} s.t. j < k, m Let R jk be the first j(N− j) ln(eN) points in U ∩{x : h j (x) = hk (x)} (if such values exist) Request the labels for R jk and let Q jk be the resulting set of labeled examples Let mk j = erQ jk (hk ) ˆ Return hk , where k = max k ∈ {1, . . . , N} : max j 2 er(h ∗∗ ). In particular, this implies er(h |{x : Now suppose j ∈ {k j j k hk∗∗ (x) = h j (x)}) > 2/3 and P(x : h j (x) = hk∗∗ (x)) > 0, which again means (with probability one) |{XM , XM+1 , . . .} ∩ {x : h j (x) = hk∗∗ (x)}| ≥ Mk∗∗ . By Hoeffding’s inequality, we have that P m jk∗∗ ≤ 7/12 ≤ exp {−Mk∗∗ /72} ≤ exp {1 − m/ (72k∗ N ln(eN))} . By a union bound, we have that P ∃ j > k∗∗ : er(h j ) > 2 er(hk∗∗ ) and m jk∗∗ ≤ 7/12 ≤ (N − k∗∗ ) · exp {1 − m/ (72k∗ N ln(eN))} . ˆ In particular, when k ≥ k∗∗ , and m jk∗∗ > 7/12 for all j > k∗∗ with er(h j ) > 2 er(hk∗∗ ), it must be true that er(hk ) ≤ 2 er(hk∗∗ ) ≤ 2 er(hk∗ ). ˆ ˆ So, by a union bound, with probability ≥ 1 − eN · exp {−m/ (72k∗ N ln(eN))}, the k chosen by ActiveSelect has er(hk ) ≤ 2 er(hk∗ ). ˆ ⋆ The next two lemmas describe the limiting behavior of S k (Vm ). In particular, we see that its k limiting value is precisely ∂C f (up to zero-probability differences). Lemma 35 establishes that k (V ⋆ ) does not decrease below ∂ k f (except for a zero-probability set), and Lemma 36 establishes S m C k that its limit is not larger than ∂C f (again, except for a zero-probability set). Lemma 35 There is an event H ′ with P(H ′ ) = 1 such that on H ′ , ∀m ∈ N, ∀k ∈ {0, . . . , d˜f − 1}, for ⋆ any H with Vm ⊆ H ⊆ C, k k k P k S k (H) ∂C f = P k ∂H f ∂C f = 1, and ∀i ∈ N, ½∂ k Hf (k+1) Si = ½∂ k f Si (k+1) C . (k) k k Also, on H ′ , every such H has P k ∂H f = P k ∂C f , and Mℓ (H) → ∞ as ℓ → ∞. ⋆ Proof We will show the first claim for the set Vm , and the result will then hold for H by monotonicity. In particular, we will show this for any fixed k ∈ {0, . . . , d˜f − 1} and m ∈ N, and the k ⋆ existence of H ′ then holds by a union bound. Fix any set S ∈ ∂C f . Suppose BVm ( f , r) does not (i) (i) (i) shatter S for some r > 0. There is an infinite sequence of sets {{h1 , h2 , . . . , h2k }}i with ∀ j ≤ 2k , (i) (i) (i) ⋆ P(x : h j (x) = f (x)) ↓ 0, such that each {h1 , . . . , h2k } ⊆ B( f , r) and shatters S. Since BVm ( f , r) does not shatter S, / ⋆ 1 = inf ½ ∃ j : h j ∈ BVm ( f , r) = inf ½ ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) . (i) i (i) i 1536 ACTIVIZED L EARNING But P inf ½ ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) = 1 ≤ inf P ∃ j ≤ 2k , ℓ ≤ m : h j (Xℓ ) = f (Xℓ ) (i) (i) i i ≤ lim i→∞ (i) (i) mP x : h j (x) = f (x) = j≤2k j≤2k m lim P x : h j (x) = f (x) = 0, i→∞ ⋆ where the second inequality follows by a union bound. Therefore, ∀r > 0, P S ∈ S k BVm ( f , r) = / ¯ ⋆ 0. Furthermore, since S k BVm ( f , r) is monotonic in r, the dominated convergence theorem gives us that ⋆ P S ∈ ∂Vm f = E lim ½S k (BV ⋆ ( f ,r)) (S) = lim P S ∈ S k BVm ( f , r) / k⋆ / ¯ r→0 r→0 m = 0. ⋆ This implies that (letting S ∼ P k be independent from Vm ) k k ¯k⋆ ¯k⋆ P P k ∂Vm f ∂C f > 0 = P P k ∂Vm f ∩ ∂C f > 0 k ¯k⋆ = lim P P k ∂Vm f ∩ ∂C f > ξ ξ →0 ≤ lim 1 ξ →0 ξ k ¯k⋆ E P k ∂Vm f ∩ ∂C f 1 E ξ →0 ξ = lim ½∂C f (S)P S ∈ ∂Vm f S / k⋆ k (Markov) (Fubini) = lim 0 = 0. ξ →0 ⋆ This establishes the first claim for Vm , on an event of probability 1, and monotonicity extends the ⋆ claim to any H ⊇ Vm . Also note that, on this event, k k k k k k k P k ∂H f ≥ P k ∂H f ∩ ∂C f = P k ∂H f ∂C f P k ∂C f = P k ∂C f , k k where the last equality follows from the first claim. Noting that for H ⊆ C, ∂H f ⊆ ∂C f , we must have k k P k ∂H f = P k ∂C f . This establishes the third claim. From the first claim, for any given value of i ∈ N the second claim (k+1) ⋆ holds for Si (with H = Vm ) on an additional event of probability 1; taking a union bound over (k) all i ∈ N extends this claim to every Si on an event of probability 1. Monotonicity then implies ½∂C f Si(k+1) = ½∂V ⋆ f Si(k+1) ≤ ½∂H f Si(k+1) ≤ ½∂C f Si(k+1) , k k k k m k extending the result to general H. Also, as k < d˜f , we know P k ∂C f > 0, and since we also know ⋆ is independent from W , the strong law of large numbers implies the final claim (for V ⋆ ) on an Vm 2 m ⋆ additional event of probability 1; again, monotonicity extends this claim to any H ⊇ Vm . Intersecting the above events over values m ∈ N and k < d˜f gives the event H ′ , and as each of the above events has probability 1 and there are countably many such events, a union bound implies P(H ′ ) = 1. 1537 H ANNEKE ⋆ Note that one specific implication of Lemma 35, obtained by taking k = 0, is that on H ′ , Vm = ∅ 0 f = X 0 so that P 0 ∂ 0 f = 1, (even if f ∈ cl(C) \ C). This is because, for f ∈ cl(C), we have ∂C C ⋆ 0 0 which means P 0 ∂Vm f = 1 (on H ′ ), so that we must have ∂Vm f = X 0 , which implies Vm = ∅. In ⋆ ⋆ ⋆ particular, this also means f ∈ cl (Vm ). Lemma 36 There is a monotonic function q(r) = o(1) (as r → 0) such that, on event H ′ , for any ⋆ k ∈ 0, . . . , d˜f − 1 , m ∈ N, r > 0, and set H such that Vm ⊆ H ⊆ B( f , r), ¯k P k ∂C f S k (H) ≤ q(r). In particular, for τ ∈ N and δ > 0, on Hτ (δ ) ∩ H ′ (where Hτ (δ ) is from Lemma 29), every m ≥ τ ⋆ ¯k and k ∈ 0, . . . , d˜f − 1 has P k ∂C f S k (Vm ) ≤ q(φ (τ ; δ )). Proof Fix any k ∈ 0, . . . , d˜f − 1 . By Lemma 35, we know that on event H ′ , ¯k Pk P k ∂C f ∩ S k (H) ¯k ≤ P k ∂C f S k (H) = P k (S k (H)) ¯k P k ∂C f ∩ S k (H) Pk = ≤ k P k ∂C f ¯k ∂C f ∩ S k (H) k P k ∂H f ¯ ∂ k f ∩ S k (B ( f , r)) C k P k ∂C f . ¯k Define qk (r) as this latter quantity. Since P k ∂C f ∩ S k (B( f , r)) is monotonic in r, ¯k P k ∂C f ∩ lim S k (B( f , r)) ¯k P k ∂C f ∩ S k (B( f , r)) r→0 = lim k k ∂k f r→0 P P k ∂C f C = k ¯k P k ∂C f ∩ ∂C f = 0. k P k ∂C f This proves qk (r) = o(1). Defining q(r) = max qk (r) : k ∈ 0, 1, . . . , d˜f − 1 = o(1) completes the proof of the first claim. ⋆ For the final claim, simply recall that by Lemma 29, on Hτ (δ ), every m ≥ τ has Vm ⊆ Vτ⋆ ⊆ B( f , φ (τ ; δ )). Lemma 37 For ζ ∈ (0, 1), define rζ = sup {r ∈ (0, 1) : q(r) < ζ } /2. ⋆ On H ′ , ∀k ∈ 0, . . . , d˜f − 1 , ∀ζ ∈ (0, 1), ∀m ∈ N, for any set H such that Vm ⊆ H ⊆ B( f , rζ ), ¯ P x : P k S k (H[(x, f (x))]) S k (H) > ζ k ¯ = P x : P k S k (H[(x, f (x))]) ∂H f > ζ = 0. (16) In particular, for δ ∈ (0, 1), defining τ (ζ ; δ ) = min τ ∈ N : sup φ (m; δ ) ≤ rζ , ∀τ ≥ τ (ζ ; δ ), and ⋆ ∀m ≥ τ , on Hτ (δ ) ∩ H ′ , (16) holds for H = Vm . 1538 m≥τ ACTIVIZED L EARNING ¯k Proof Fix k, m, H as described above, and suppose q = P k ∂C f |S k (H) < ζ ; by Lemma 36, this ′ . Since, ∂ k f ⊆ S k (H), we have that ∀x ∈ X , happens on H H k k ¯ ¯ P k S k (H[(x, f (x))]) S k (H) = P k S k (H[(x, f (x))]) ∂H f P k ∂H f S k (H) ¯k ¯k ¯ + P k S k (H[(x, f (x))]) S k (H) ∩ ∂H f P k ∂H f S k (H) . Since all probability values are bounded by 1, we have k ¯k ¯ ¯ P k S k (H[(x, f (x))]) S k (H) ≤ P k S k (H[(x, f (x))]) ∂H f + P k ∂H f S k (H) . (17) Isolating the right-most term in (17), by basic properties of probabilities we have ¯k P k ∂H f S k (H) k k ¯k ¯k ¯k ¯k = P k ∂H f S k (H) ∩ ∂C f P k ∂C f S k (H) + P k ∂H f S k (H) ∩ ∂C f P k ∂C f S k (H) k ¯k ¯k ≤ P k ∂C f S k (H) + P k ∂H f S k (H) ∩ ∂C f . (18) By assumption, the left term in (18) equals q. Examining the right term in (18), we see that k k k ¯k ¯k P k ∂H f S k (H) ∩ ∂C f = P k S k (H) ∩ ∂H f ∂C f /P k S k (H) ∂C f k k k ¯k ≤ P k ∂H f ∂C f /P k ∂H f ∂C f . (19) By Lemma 35, on H ′ the denominator in (19) is 1 and the numerator is 0. Thus, combining this fact with (17) and (18), we have that on H ′ , k ¯ ¯ P x : P k S k (H[(x, f (x))]) S k (H) > ζ ≤ P x : P k S k (H[(x, f (x))]) ∂H f > ζ − q . (20) Note that proving the right side of (20) equals zero will suffice to establish the result, since it upper bounds both the first expression of (16) (as just established) and the second expression of (16) (by monotonicity of measures). Letting X ∼ P be independent from the other random variables (Z,W1 ,W2 ), by Markov’s inequality, the right side of (20) is at most 1 k ¯ E P k S k (H[(X, f (X))]) ∂H f ζ −q H = k ¯ E P k S k (H[(X, f (X))]) ∩ ∂H f k (ζ − q)P k ∂H f H , and by Fubini’s theorem, this is (letting S ∼ P k be independent from the other random variables) E / ½∂H f (S)P x : S ∈ S k (H[(x, f (x))]) H k k (ζ − q)P k ∂H f . Lemma 35 implies this equals E ½∂H f (S)P x : S ∈ S k (H[(x, f (x))]) H / k k (ζ − q)P k ∂C f 1539 . (21) H ANNEKE (i) k For any fixed S ∈ ∂H f , there is an infinite sequence of sets (i) (i) (i) 2k , P x : h j (x) = f (x) ↓ 0, such that each h1 , . . . , h2k not shatter S, then (i) (i) h1 , h2 , . . . , h2k i∈N with ∀ j ≤ ⊆ H and shatters S. If H[(x, f (x))] does 1 = inf ½ ∃ j : h j ∈ H[(x, f (x))] = inf ½ ∃ j : h j (x) = f (x) . / (i) (i) i i In particular, P x : S ∈ S k (H[(x, f (x))]) ≤ P x : inf ½ ∃ j : h j (x) = f (x) = 1 / (i) i (i) (i) =P ≤ inf P x : ∃ j s.t. h j (x) = f (x) x : ∃ j : h j (x) = f (x) i i (i) (i) ≤ lim i→∞ j≤2k P x : h j (x) = f (x) = j≤2k lim P x : h j (x) = f (x) = 0. i→∞ Thus (21) is zero, which establishes the result. ⋆ The final claim is then implied by Lemma 29 and monotonicity of Vm in m: that is, on Hτ (δ ), ⋆ Vm ⊆ Vτ⋆ ⊆ B( f , φ (τ ; δ )) ⊆ B( f , rζ ). Lemma 38 For any ζ ∈ (0, 1), there are values n ∈ N and ε > 0, on event H⌊n/3⌋ ˜ (ε /2) ∩ H ′ , (ζ ) ∆n (ε ) : n ∈ N, ε ∈ (0, 1) letting V ˜ such that, for any ⋆ = V⌊n/3⌋ , (ζ ) ˜ ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ≥ ζ ≤ ∆n (ε ), (ζ ) and for any N-valued N(ε ) = ω (log(1/ε )), ∆N(ε ) (ε ) = o(1). Proof Throughout, we suppose the event H⌊n/3⌋ (ε /2) ∩ H ′ , and fix some ζ ∈ (0, 1). We have ∀x, ˜ ˜ ˜ ˜ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ˜ ˜ ˜ ˜ d˜ −1 f P d f −1 ∂Cf ˜ ˜ ˜ ˜ d˜ −1 ˜ ˜ ¯ d −1 f P d f −1 ∂Cf f S d f −1 (V ) ˜ ˜ ˜ ˜ d˜ −1 ˜ = P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ¯ + P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ≤ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf d˜ −1 ˜ ˜ f S d f −1 (V ) ˜ ˜ ˜ ¯ d −1 f +P d f −1 ∂Cf f S d f −1 (V ) . (22) By Lemma 35, the left term in (22) equals ˜ ˜ d˜ −1 ˜ ˜ P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ∩ ∂Cf ˜ ˜ ˜ d˜ −1 = P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf 1540 f , ˜ ˜ d˜ −1 f P d f −1 S d f −1 (V ) ∂Cf f ACTIVIZED L EARNING and by Lemma 36, the right term in (22) is at most q(φ (⌊n/3⌋; ε /2)). Thus, we have ˜ ˜ ˜ ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) S d f −1 (V ) ≥ ζ ˜ ˜ ˜ d˜ −1 ≤ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf f ≥ ζ − q(φ (⌊n/3⌋; ε /2)) . (23) (ζ ) For n < 3τ (ζ /2; ε /2) (for τ (·; ·) defined in Lemma 37), we define ∆n (ε ) = 1. Otherwise, suppose n ≥ 3τ (ζ /2; ε /2), so that q(φ (⌊n/3⌋; ε /2)) < ζ /2, and thus (23) is at most ˜ ˜ ˜ d˜ −1 P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (V ) ∂Cf f ≥ ζ /2 . By Lemma 29, this is at most ˜ ˜ d˜ −1 ˜ P x : P d f −1 S ∈ X d f −1 : S ∪ {x} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) ∂Cf f ≥ ζ /2 . Letting X ∼ P, by Markov’s inequality this is at most 2 ˜ ˜ ˜ d˜ −1 E P d f −1 S ∈ X d f −1 : S ∪ {X} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) ∂Cf f ζ 2 d˜f ˜ ˜ d˜ −1 = P S ∪ {x} ∈ X d f : S ∪ {x} ∈ S d f (B( f , φ (⌊n/3⌋; ε /2))) and S ∈ ∂Cf f ˜f ζδ 2 d˜f ˜ ≤ P S d f (B( f , φ (⌊n/3⌋; ε /2))) . ˜f ζδ (24) (ζ ) Thus, defining ∆n (ε ) as (24) for n ≥ 3τ (ζ /2; ε /2) establishes the first claim. It remains only to prove the second claim. Let N(ε ) = ω (log(1/ε )). Since τ (ζ /2; ε /2) ≤ 4 4 d ln r4e + ln ε = O(log(1/ε )), we have that for all sufficiently small ε > 0, N(ε ) ≥ r ζ /2 ζ /2 (ζ ) ˜ 3τ (ζ /2; ε /2), so that ∆N(ε ) (ε ) equals (24) (with n = N(ε )). Furthermore, since δ f > 0, while ˜ d˜ P d f ∂Cf f = 0, and φ (⌊N(ε )/3⌋; ε /2) = o(1), by continuity of probability measures we know (ζ ) (24) is o(1) when n = N(ε ), so that we generally have ∆N(ε ) (ε ) = o(1). For any m ∈ N, define ˜ ˜ M(m) = m3 δ f /2. Lemma 39 There is a (C, P, f )-dependent constant c(i) ∈ (0, ∞) such that, for any τ ∈ N there is (i) an event Hτ ⊆ H ′ with (i) ˜ P Hτ ≥ 1 − c(i) · exp −M(τ )/4 (i) such that on Hτ , if d˜f ≥ 2, then ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , ∀ℓ ∈ N, for any set H such that Vℓ⋆ ⊆ H ⊆ C, (k) ˜ Mm (H) ≥ M(m). 1541 H ANNEKE Proof On H ′ , Lemma 35 implies every ½S k−1 (H) Si(k) ≥ ½∂H f Si(k) = ½∂C f Si(k) , so we k−1 k−1 (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) on an appropriate event. We know focus on showing P ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) = 1 − P ∃k ∈ 2, . . . , d˜f , m ≥ τ : d˜f ≥ 1− (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) , P m≥τ k=2 where the last line follows by a union bound. Thus, we will focus on bounding d˜f P m≥τ k=2 (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) . Fix any k ∈ 2, . . . , d˜f , and integer m ≥ τ . Since E (k) k−1 ˜ = P k−1 ∂C f m3 ≥ δ f m3 , k−1 Si : i ≤ m3 ∩ ∂C f a Chernoff bound implies that P (k) k−1 k−1 ˜ Si : i ≤ m3 ∩ ∂C f < M(m) ≤ exp −m3 P k−1 ∂C f /8 ˜ ≤ exp −m3 δ f /8 . Thus, we have that (25) is at most d˜f m≥τ k=2 ˜ exp −m3 δ f /8 ≤ m≥τ ˜ d˜f · exp −m3 δ f /8 ≤ ˜ ≤ d˜f · exp −M(τ )/4 + d˜f · ∞ τ3 m≥τ 3 ˜ d˜f · exp −mδ f /8 ˜ exp −xδ f /8 dx ˜ ˜ = d˜f · 1 + 8/δ f · exp −M(τ )/4 ˜ ˜ ≤ 9d˜f /δ f · exp −M(τ )/4 . Note that since P(H ′ ) = 1, defining (i) Hτ = ∀k ∈ 2, . . . , d˜f , ∀m ≥ τ , (k) k−1 ˜ Si : i ≤ m3 ∩ ∂C f ≥ M(m) ∩ H ′ has the required properties. 1542 (25) ACTIVIZED L EARNING (i) Lemma 40 For any τ ∈ N, there is an event Gτ with (i) (i) P Hτ \ Gτ ˜ ˜ ≤ 121d˜f /δ f · exp −M(τ )/60 (i) such that, on Gτ , if d˜f ≥ 2, then for every integer s ≥ τ and k ∈ 2, . . . , d˜f , ∀r ∈ 0, r1/6 , (k) Ms (B ( f , r)) ≤ (3/2) (k) k−1 Si : i ≤ s3 ∩ ∂C f . (k) ˆ Proof Fix integers s ≥ τ and k ∈ 2, . . . , d˜f , and let r = r1/6 . Define the set S k−1 = Si : i ≤ s3 ∩ (k) ˆ ˆ S k−1 (B ( f , r)). Note S k−1 = Ms (B ( f , r)) and the elements of S k−1 are conditionally i.i.d. given (k) (k) Ms (B ( f , r)), each with conditional distribution equivalent to the conditional distribution of S1 (k) given S1 ∈ S k−1 (B ( f , r)) . In particular, (k) (k) k−1 k−1 ˆ E S k−1 ∩ ∂C f Ms (B ( f , r)) = P k−1 ∂C f S k−1 (B ( f , r)) Ms (B ( f , r)) . Define the event (i) Gτ (k, s) = k−1 ˆ ˆ S k−1 ≤ (3/2) S k−1 ∩ ∂C f . By Lemma 36 (indeed by definition of q(r) and r1/6 ) we have (k) (i) 1 − P Gτ (k, s) Ms (B ( f , r)) =P (k) (k) k−1 ˆ S k−1 ∩ ∂C f < (2/3)Ms (B ( f , r)) Ms (B ( f , r)) ≤P (k) (k) k−1 ˆ S k−1 ∩ ∂C f < (4/5) (1 − q (r)) Ms (B ( f , r)) Ms (B ( f , r)) ≤P (k) (k) k−1 k−1 ˆ S k−1 ∩ ∂C f < (4/5)P k−1 ∂C f S k−1 (B ( f , r)) Ms (B ( f , r)) Ms (B ( f , r)) . (26) By a Chernoff bound, (26) is at most (k) k−1 exp −Ms (B ( f , r)) P k−1 ∂C f S k−1 (B ( f , r)) /50 (k) (k) ≤ exp −Ms (B ( f , r)) (1 − q (r)) /50 ≤ exp −Ms (B ( f , r)) /60 . Thus, by Lemma 39, (i) (i) P Hτ \ Gτ (k, s) ≤ P =E (i) (k) (i) ˜ Ms (B ( f , r)) ≥ M(s) \ Gτ (k, s) (k) 1 − P Gτ (k, s) Ms (B ( f , r)) (k) ≤ E exp −Ms (B ( f , r)) /60 ½[M(s),∞) Ms(k) (B ( f , r)) ˜ ½[M(s),∞) Ms(k) (B ( f , r)) ˜ 1543 ˜ ≤ exp −M(s)/60 . H ANNEKE (i) Now defining Gτ = (i) d˜f (i) k=2 Gτ (k, s), s≥τ (i) P Hτ \ Gτ ≤ s≥τ a union bound implies ˜ d˜f · exp −M(s)/60 ˜ ≤ d˜f exp −M(τ )/60 + ∞ τ3 ˜ exp −xδ f /120 dx ˜ ˜ = d˜f 1 + 120/δ f · exp −M(τ )/60 ˜ ˜ ≤ 121d˜f /δ f · exp −M(τ )/60 . This completes the proof for r = r1/6 . Monotonicity extends the result to any r ∈ 0, r1/6 . Lemma 41 There exist (C, P, f , γ )-dependent constants τ ∗ ∈ N and c(ii) ∈ (0, ∞) such that, for any (ii) (i) integer τ ≥ τ ∗ , there is an event Hτ ⊆ Gτ with (i) (ii) P Hτ \ Hτ ˜ ≤ c(ii) · exp −M(τ )1/3 /60 (27) (i) (ii) such that, on Hτ ∩ Hτ , ∀s, m, ℓ, k ∈ N with ℓ < m and k ≤ d˜f , for any set of classifiers H with ⋆ ⊆ H, if either k = 1, or s ≥ τ and H ⊆ B( f , r Vℓ (1−γ )/6 ), then ˆ (k) ˆ (k) ˆ (k) ∆s (Xm ,W2 , H) < γ =⇒ Γs (Xm , − f (Xm ),W2 , H) < Γs (Xm , f (Xm ),W2 , H) . (i) (ii) In particular, for δ ∈ (0, 1) and τ ≥ max{τ ((1 − γ )/6; δ ), τ ∗ }, on Hτ (δ ) ∩ Hτ ∩ Hτ , this is true for H = Vℓ⋆ for every k, ℓ, m, s ∈ N satisfying τ ≤ ℓ < m, τ ≤ s, and k ≤ d˜f . ˜ Proof Let τ ∗ = (6/(1− γ ))· 2/δ f 1/3 , and consider any τ , k, ℓ, m, s, H as described above. If k = 1, (i) ⋆ the result clearly holds. In particular, Lemma 35 implies that on Hτ , H[(Xm , f (Xm ))] ⊇ Vm = ∅, so that some h ∈ H has h(Xm ) = f (Xm ), and therefore ˆ (1) Γs (Xm , − f (Xm ),W2 , H) = ½ {h(Xm )} (− f (Xm )) = 0, h∈H ˆ (1) ˆ (1) / and since ∆s (Xm ,W2 , H) = ½DIS(H) (Xm ), if ∆s (Xm ,W2 , H) < γ , then since γ < 1 we have Xm ∈ DIS(H), so that ˆ (1) Γs (Xm , f (Xm ),W2 , H) = ½ 1544 {h(Xm )} ( f (Xm )) h∈H = 1. ACTIVIZED L EARNING (i) (i) Otherwise, suppose 2 ≤ k ≤ d˜f . Note that on Hτ ∩ Gτ , ∀m ∈ N, and any H with Vℓ⋆ ⊆ H ⊆ B( f , r(1−γ )/6 ) for some ℓ ∈ N, ˆ (k) Γs (Xm , − f (Xm ),W2 , H) = (k) Ms (H) i=1 ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) ½S k−1 (H) Si(k) s3 1 ≤ (k) Si :i≤ s3 k−1 ∩ ∂H f (k) Si :i≤ s3 k−1 ∩ ∂H f = (k) Si :i≤ s3 i=1 s3 1 k−1 ∩ ∂C f s3 3 (k) 2Ms (B( f , r(1−γ )/6 )) i=1 ½S¯k−1 (Vm ) Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ⋆ (monotonicity) ½∂ k−1 f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ¯ ⋆ (monotonicity) ½∂C f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) ¯ k−1 (Lemma 35) i=1 s3 1 ≤ ≤ s3 1 Vm i=1 ½∂C f Si(k) ½S k−1 (B( f ,r(1−γ )/6 )) Si(k) . ¯ k−1 (Lemma 40) (k) ˆ For brevity, let Γ denote this last quantity, and let Mks = Ms B f , r(1−γ )/6 . By Hoeffding’s inequality, we have ¯ k−1 ˆ P (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 −1/3 + Mks 1/3 ≤ exp −2Mks Mks . Thus, by Lemmas 36, 39, and 40, P (i) (i) ˆ (k) ˜ (2/3)Γs (Xm , − f (Xm ),W2 , H) > q r(1−γ )/6 + M(s)−1/3 ∩ Hτ ∩ Gτ (i) ≤P ¯ k−1 ˆ (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 ˜ + M(s)−1/3 ∩ Hτ ≤P ¯ k−1 ˆ (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 + Mks ¯ k−1 ˆ = E P (2/3)Γ > P k−1 ∂C f S k−1 B f , r(1−γ )/6 1/3 ≤ E exp −2Mks −1/3 ˜ ∩ {Mks ≥ M(s)} −1/3 + Mks Mks ½[M(s),∞) (Mks ) ˜ ˜ ½[M(s),∞) (Mks ) ≤ exp −2M(s)1/3 . ˜ (i) (ii) (ii) (ii) ˜ Thus, there is an event Hτ (k, s) with P Hτ ∩ Gτ \ Hτ (k, s) ≤ exp −2M(s)1/3 such that ˆ (k) ˜ Γs (Xm , − f (Xm ),W2 , H) ≤ (3/2) q r(1−γ )/6 + M(s)−1/3 holds for these particular values of k and s. 1545 H ANNEKE (ii) (i) To extend to the full range of values, we simply take Hτ = Gτ ∩ ˜ ˜ τ ≥ (2/δ f )1/3 , we have M(τ ) ≥ 1, so a union bound implies (i) (i) (ii) P Hτ ∩ Gτ \ Hτ ≤ s≥τ s≥τ k≤d˜f (ii) Hτ (k, s). Since ˜ d˜f · exp −2M(s)1/3 ∞ ˜ ≤ d˜f · exp −2M(τ )1/3 + τ ˜ exp −2M(x)1/3 dx ˜ −1/3 · exp −2M(τ )1/3 ≤ 2d˜f δ −1/3 · exp −2M(τ )1/3 . ˜ ˜ ˜ = d˜f 1 + 2−2/3 δ f f Then Lemma 40 and a union bound imply (i) (ii) P Hτ \ Hτ ˜ ˜ −1/3 · exp −2M(τ )1/3 + 121d˜f δ −1 · exp −M(τ )/60 ˜ ˜ ≤ 2d˜f δ f f ˜ ˜ ≤ 123d˜f δ f−1 · exp −M(τ )1/3 /60 . (i) (ii) On Hτ ∩ Hτ , every such s, m, ℓ, k and H satisfy ˆ (k) ˜ Γs (Xm , − f (Xm ),W2 , H) ≤ (3/2) q(r(1−γ )/6 ) + M(s)−1/3 < (3/2) ((1 − γ )/6 + (1 − γ )/6) = (1 − γ )/2, (28) where the second inequality follows by definition of r(1−γ )/6 and s ≥ τ ≥ τ ∗ . ˆ (k) If ∆s (Xm ,W2 , H) < γ , then 1 ˆ (k) 1 − γ < 1 − ∆s (Xm ,W2 , H) = s3 (k) Ms (H) i=1 ½S k−1 (H) Si(k) ½S¯k (H) Si(k) ∪ {Xm } . Finally, noting that we always have ½S¯k (H) Si(k) ∪ {Xm } ≤ ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) + ½S¯k−1 (H[(Xm ,− f (Xm ))]) Si(k) , (i) (ii) (k) ˆ we have that, on the event Hτ ∩ Hτ , if ∆s (Xm ,W2 , H) < γ , then ˆ (k) Γs (Xm , − f (Xm ),W2 , H) < (1 − γ )/2 = −(1 − γ )/2 + (1 − γ ) < −(1 − γ )/2 + ≤ −(1 − γ )/2 + s3 (k) Ms (H) i=1 1 s3 (k) Ms (H) i=1 1 s3 ½S k−1 (H) Si(k) ½S¯k (H) Si(k) ∪ {Xm } by (29) ½S k−1 (H) Si(k) ½S¯k−1 (H[(Xm , f (Xm ))]) Si(k) ½ ½ (k) (k) ¯ S k−1 (H) Si S k−1 (H[(Xm ,− f (Xm ))]) Si (k) Ms (H) i=1 ˆ (k) ˆ (k) −(1 − γ )/2 + Γs (Xm , − f (Xm ),W2 , H) + Γs (Xm , f (Xm ),W2 , H) + = 1 by (28) (k) ˆ < Γs (Xm , f (Xm ),W2 , H) . by (28) 1546 (29) ACTIVIZED L EARNING The final claim in the lemma statement is then implied by Lemma 29, since we have Vℓ⋆ ⊆ Vτ⋆ ⊆ B ( f , φ (τ ; δ )) ⊆ B f , r(1−γ )/6 on Hτ (δ ). For any k, ℓ, m ∈ N, and any x ∈ X , define (k) ˆ px (k, ℓ, m) = ∆m (x,W2 ,Vℓ⋆ ) ˆ px (k, ℓ) = P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) S k−1 (Vℓ⋆ ) . Lemma 42 For any ζ ∈ (0, 1), there is a (C, P, f , ζ )-dependent constant c(iii) (ζ ) ∈ (0, ∞) such (iii) that, for any τ ∈ N, there is an event Hτ (ζ ) with (i) (iii) ˜ P Hτ \ Hτ (ζ ) ≤ c(iii) (ζ ) · exp −ζ 2 M(τ ) (iii) (i) such that on Hτ ∩ Hτ (ζ ), ∀k, ℓ, m ∈ N with τ ≤ ℓ ≤ m and k ≤ d˜f , for any x ∈ X , ˜ P (x : |px (k, ℓ) − px (k, ℓ, m)| > ζ ) ≤ exp −ζ 2 M(m) . ˆ Proof Fix any k, ℓ, m ∈ N with τ ≤ ℓ ≤ m and k ≤ d˜f . Recall our convention that X 0 = {∅} and P 0 X 0 = 1; thus, if k = 1, px (k, ℓ, m) = ½DIS(V ⋆ ) (x) = ½S 1 (V ⋆ ) (x) = px (k, ℓ), so the result clearly ˆ ℓ ℓ holds for k = 1. (k) For the remaining case, suppose 2 ≤ k ≤ d˜f . To simplify notation, let m = Mm (Vℓ⋆ ), X = Xℓ+1 , ˜ px = px (k, ℓ) and px = px (k, ℓ, m). Consider the event ˆ ˆ ˜ ˆ H (iii) (k, ℓ, m, ζ ) = P (x : |px − px | > ζ ) ≤ exp −ζ 2 M(m) . We have (i) P Hτ \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ (30) ≤P ˜ m ≥ M(m) \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ ˜ =P 2 ˜ ˜ ˆ ˜ ˜ m ≥ M(m) ∩ P esm|pX − pX | > esmζ W2 ,Vℓ⋆ > e−ζ M(m) ˜ (by Lemma 39) Vℓ⋆ , (31) for any value s > 0. Proceeding as in Chernoff’s bounding technique, by Markov’s inequality (31) is at most P ≤P =E ˜ ˜ ˆ ˜ m ≥ M(m) ∩ e−smζ E esm|pX − pX | W2 ,Vℓ⋆ > e−ζ ˜ 2 M(m) ˜ Vℓ⋆ ˜ ˆ ˜ ˆ ˜ ˜ m ≥ M(m) ∩ e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ > e−ζ ˜ ˜ ˜ ˆ ˜ ˆ ˜ ½[M(m),∞) (m) P e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ > e−ζ ˜ 1547 2 M(m) ˜ 2 M(m) ˜ Vℓ⋆ m,Vℓ⋆ ˜ Vℓ⋆ H ANNEKE By Markov’s inequality, this is at most E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ E e−smζ E esm(pX − pX ) + esm( pX −pX ) W2 ,Vℓ⋆ m,Vℓ⋆ Vℓ⋆ =E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ e−smζ E esm(pX − pX ) + esm( pX −pX ) m,Vℓ⋆ Vℓ⋆ =E ˜ ½[M(m),∞) (m) eζ ˜ 2 M(m) ˜ ˜ ˆ ˜ ˆ ˜ ˜ ˜ e−smζ E E esm(pX − pX ) + esm( pX −pX ) X, m,Vℓ⋆ m,Vℓ⋆ Vℓ⋆ . (32) ∞ ˜ The conditional distribution of m pX given (X, m,Vℓ⋆ ) is Binomial (m, pX ), so letting B j (pX ) j=1 ˜ˆ ˜ denote a sequence of random variables, conditionally independent given (X, m,Vℓ⋆ ), with the condi˜ tional distribution of each B j (pX ) being Bernoulli(pX ) given (X, m,Vℓ⋆ ), we have ˜ ˜ ˆ ˜ ˆ E esm(pX − pX ) + esm( pX −pX ) X, m,Vℓ⋆ ˜ ˜ ˆ ˜ ˆ ˜ = E esm(pX − pX ) X, m,Vℓ⋆ + E esm( pX −pX ) X, m,Vℓ⋆ ˜ m ˜ =E ˜ ∏ es(pX −Bi (pX )) X, m,Vℓ⋆ + E i=1 m ˜ = E es(pX −B1 (pX )) X, m,Vℓ⋆ ˜ m ˜ ˜ ∏ es(B (p )−p ) X, m,Vℓ⋆ i X X i=1 + E es(B1 (pX )−pX ) X, m,Vℓ⋆ ˜ m ˜ . (33) 2 It is known that for B ∼ Bernoulli(p), E es(B−p) and E es(p−B) are at most es /8 (see, e.g., Lemma ˜ 2 8.1 of Devroye, Gy¨ rfi, and Lugosi, 1996). Thus, taking s = 4ζ , (33) is at most 2e2mζ , and (32) is o at most E ˜ ½[M(m),∞) (m) 2eζ ˜ 2 M(m) ˜ ˜ ˜ e−4mζ e2mζ Vℓ⋆ = E 2 2 ˜ ½[M(m),∞) (m) 2eζ ˜ 2 M(m) ˜ ˜ e−2mζ Vℓ⋆ 2 ˜ ≤ 2 exp −ζ 2 M(m) . Since this bound holds for (30), the law of total probability implies (i) (i) P Hτ \ H (iii) (k, ℓ, m, ζ ) = E P Hτ \ H (iii) (k, ℓ, m, ζ ) Vℓ⋆ 1548 ˜ ≤ 2 · exp −ζ 2 M(m) . ACTIVIZED L EARNING d˜ (iii) f Defining Hτ (ζ ) = ℓ≥τ m≥ℓ k=2 H (iii) (k, ℓ, m, ζ ), we have the required property for the claimed ranges of k, ℓ and m, and a union bound implies (iii) (i) P Hτ \ Hτ (ζ ) ≤ ≤ 2d˜f · = 2d˜f · ℓ≥τ ℓ≥τ ℓ≥τ m≥ℓ ˜ 2d˜f · exp −ζ 2 M(m) ˜ exp −ζ 2 M(ℓ) + ∞ ℓ3 ˜ exp −xζ 2 δ f /2 dx ˜ ˜ 1 + 2ζ −2 δ f−1 · exp −ζ 2 M(ℓ) ˜ ˜ ≤ 2d˜f · 1 + 2ζ −2 δ f−1 · exp −ζ 2 M(τ ) + ˜ = 2d˜f · 1 + 2ζ −2 δ f−1 2 ∞ τ3 ˜ exp −xζ 2 δ f /2 dx ˜ · exp −ζ 2 M(τ ) ˜ ˜ ≤ 18d˜f ζ −4 δ f−2 · exp −ζ 2 M(τ ) . For k, ℓ, m ∈ N and ζ ∈ (0, 1), define ˆ pζ (k, ℓ, m) = P (x : px (k, ℓ, m) ≥ ζ ) . ¯ (34) √ (i) Lemma 43 For any α , ζ , δ ∈ (0, 1), β ∈ 0, 1 − α , and integer τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ ∩ (iii) Hτ (β ζ ), for any k, ℓ, ℓ′ , m ∈ N with τ ≤ ℓ ≤ ℓ′ ≤ m and k ≤ d˜f , ˜ pζ (k, ℓ′ , m) ≤ P (x : px (k, ℓ) ≥ αζ ) + exp −β 2 ζ 2 M(m) . ¯ (35) √ Proof Fix any α , ζ , δ ∈ (0, 1), β ∈ 0, 1 − α , τ , k, ℓ, ℓ′ , m ∈ N with τ (β ; δ ) ≤ τ ≤ ℓ ≤ ℓ′ ≤ m and k ≤ d˜f . If k = 1, the result clearly holds. In particular, we have pζ (1, ℓ′ , m) = P (DIS (Vℓ⋆ )) ≤ P (DIS (Vℓ⋆ )) = P (x : px (1, ℓ) ≥ αζ ) . ¯ ′ Otherwise, suppose 2 ≤ k ≤ d˜f . By a union bound, pζ (k, ℓ′ , m) = P x : px (k, ℓ′ , m) ≥ ζ ¯ ˆ √ √ ≤ P x : px (k, ℓ′ ) ≥ αζ + P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ . ˆ (36) Since √ ˆ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ ≤ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > β ζ , ˆ (i) (iii) Lemma 42 implies that, on Hτ ∩ Hτ (β ζ ), √ ˜ ˆ P x : px (k, ℓ′ ) − px (k, ℓ′ , m) > (1 − α )ζ ≤ exp −β 2 ζ 2 M(m) . 1549 (37) H ANNEKE It remains only to examine the first term on the right side of (36). For this, if P k−1 S k−1 Vℓ⋆ = 0, ′ then the first term is 0 by our aforementioned convention, and thus (35) holds; otherwise, since ∀x ∈ X , S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ⊆ S k−1 (Vℓ⋆ ) , ′ ′ we have = P x : P k−1 √ αζ = P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) S k−1 (Vℓ⋆ ) ≥ ′ ′ √ S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 S k−1 (Vℓ⋆ ) . ′ ′ P x : px (k, ℓ′ ) ≥ √ αζ (38) (i) By Lemma 35 and monotonicity, on Hτ ⊆ H ′ , (38) is at most √ k−1 P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 ∂C f ′ , and monotonicity implies this is at most P x : P k−1 S ∈ X k−1 : S ∪ {x} ∈ S k (Vℓ⋆ ) ≥ √ k−1 αζ P k−1 ∂C f . (39) (i) By Lemma 36, for τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ , √ ¯ k−1 P k−1 ∂C f S k−1 (Vℓ⋆ ) ≤ q(φ (τ ; δ )) < β ≤ 1 − α , which implies k−1 k−1 P k−1 ∂C f ≥ P k−1 ∂C f ∩ S k−1 (Vℓ⋆ ) ¯ k−1 = 1 − P k−1 ∂C f S k−1 (Vℓ⋆ ) P k−1 S k−1 (Vℓ⋆ ) ≥ √ k−1 k−1 ⋆ αP S (Vℓ ) . (i) Altogether, for τ ≥ τ (β ; δ ), on Hτ (δ ) ∩ Hτ , (39) is at most P x : P k−1 S ∈ X k−1 : S∪{x} ∈ S k (Vℓ⋆ ) ≥ αζ P k−1 S k−1 (Vℓ⋆ ) = P (x : px (k, ℓ) ≥ αζ ), which, combined with (36) and (37), establishes (35). (iv) Lemma 44 There are events Hτ : τ ∈ N with (iv) P Hτ ≥ 1 − 3d˜f · exp {−2τ } s.t. for any ξ ∈ (0, γ /16], δ ∈ (0, 1), letting τ (iv) (ξ ; δ ) = max τ (4ξ /γ ; δ ), 4 ˜ δf ξ2 ln 4 ˜ δf ξ2 1/3 , (i) (iii) (iv) for any integer τ ≥ τ (iv) (ξ ; δ ), on Hτ (δ ) ∩ Hτ ∩ Hτ (ξ ) ∩ Hτ , ∀k ∈ 1, . . . , d˜f , ∀ℓ ∈ N with ℓ ≥ τ, ˆ (k) ˜ P x : px (k, ℓ) ≥ γ /2 + exp −γ 2 M(ℓ)/256 ≤ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ P (x : px (k, ℓ) ≥ γ /8) + 4ℓ−1 . 1550 (40) (41) ACTIVIZED L EARNING Proof For any k, ℓ ∈ N, by Hoeffding’s inequality and the law of total probability, on an event G(iv) (k, ℓ) with P G(iv) (k, ℓ) ≥ 1 − 2 exp {−2ℓ}, we have ℓ3 pγ /4 (k, ℓ, ℓ) − ℓ ¯ (iv) Define the event Hτ = (iv) 1 − P Hτ −3 i=1 ˆ ½[γ /4,∞) ∆(k) (wi ,W2 ,Vℓ⋆ ) ≤ ℓ−1 . ℓ d˜f (iv) (k, ℓ). k=1 G ℓ≥τ ≤ 2d˜f · ℓ≥τ (42) By a union bound, we have exp {−2ℓ} ≤ 2d˜f · exp {−2τ } + ∞ τ exp {−2x} dx = 3d˜f · exp {−2τ } . Now fix any ℓ ≥ τ and k ∈ 1, . . . , d˜f . By a union bound, P (x : px (k, ℓ) ≥ γ /2) ≤ P (x : px (k, ℓ, ℓ) ≥ γ /4) + P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > γ /4) . ˆ ˆ (i) (43) (iii) By Lemma 42, on Hτ ∩ Hτ (ξ ), ˜ P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > γ /4) ≤ P (x : |px (k, ℓ) − px (k, ℓ, ℓ)| > ξ ) ≤ exp −ξ 2 M(ℓ) . (44) ˆ ˆ (iv) Also, on Hτ , (42) implies P (x : px (k, ℓ, ℓ) ≥ γ /4) = pγ /4 (k, ℓ, ℓ) ˆ ¯ ℓ3 ≤ℓ −1 +ℓ −3 ˆ ½[γ /4,∞) ∆(k) (wi ,W2 ,Vℓ⋆ ) ℓ i=1 ˆ (k) = ∆ℓ (W1 ,W2 ,Vℓ⋆ ) − ℓ−1 . (45) Combining (43) with (44) and (45) yields (k) ˆ ˜ P (x : px (k, ℓ) ≥ γ /2) ≤ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) − ℓ−1 + exp −ξ 2 M(ℓ) . (46) ˜ ˜ For τ ≥ τ (iv) (ξ ; δ ), exp −ξ 2 M(ℓ) − ℓ−1 ≤ − exp −γ 2 M(ℓ)/256 , so that (46) implies the first inequality of the lemma: namely (40). (iv) For the second inequality (i.e., (41)), on Hτ , (42) implies we have ˆ (k) ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ pγ /4 (k, ℓ, ℓ) + 3ℓ−1 . ¯ (47) √ Also, by Lemma 43 (with α = 1/2, ζ = γ /4, β = ξ /ζ < 1 − α ), for τ ≥ τ (iv) (ξ ; δ ), on Hτ (δ ) ∩ (iii) (i) Hτ ∩ Hτ (ξ ), ˜ pγ /4 (k, ℓ, ℓ) ≤ P (x : px (k, ℓ) ≥ γ /8) + exp −ξ 2 M(ℓ) . ¯ (48) Thus, combining (47) with (48) yields ˆ (k) ˜ ∆ℓ (W1 ,W2 ,Vℓ⋆ ) ≤ P (x : px (k, ℓ) ≥ γ /8) + 3ℓ−1 + exp −ξ 2 M(ℓ) . 1551 H ANNEKE ˜ For τ ≥ τ (iv) (ξ ; δ ), we have exp −ξ 2 M(ℓ) ≤ ℓ−1 , which establishes (41). For n ∈ N and k ∈ {1, . . . , d + 1}, define the set (k) (k) ˆ Un = mn + 1, . . . , mn + n/ 6 · 2k ∆mn (W1 ,W2 ,V ) , (k) where mn = ⌊n/3⌋; Un represents the set of indices processed in the inner loop of Meta-Algorithm 1 for the specified value of k. ˆ ˆ Lemma 45 There are ( f , C, P, γ )-dependent constants c1 , c2 ∈ (0, ∞) such that, for any ε ∈ (0, 1) ˆ and integer n ≥ c1 ln(c2 /ε ), on an event Hn (ε ) with ˆ ˆ ˆ P(Hn (ε )) ≥ 1 − (3/4)ε , (49) ⋆ we have, for V = Vmn , (k) (k) ˆ m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ ∀k ∈ 1, . . . , d˜f , ≤ n/ 3 · 2k , (50) (γ /8) ˆ (d ) ∆mnf (W1 ,W2 ,V ) ≤ ∆n (ε ) + 4m−1 , n ˜ (d˜f ) and ∀m ∈ Un (51) , ˜ ˜ ˜ ˆ (d ) ˆ (d ) ˆ (d ) ∆m f (Xm ,W2 ,V ) < γ ⇒ Γm f (Xm , − f (Xm ),W2 ,V ) < Γm f (Xm , f (Xm ),W2 ,V ). (52) Proof Suppose n ≥ c1 ln(c2 /ε ), where ˆ ˆ ˜ c1 = max ˆ 2d f +12 24 24 , , 3τ ∗ , ˜ f γ 2 r(1/16) r(1−γ )/6 δ and c2 = max 4 c(i) + c(ii) + c(iii) (γ /16) + 6d˜f , 4 ˆ 4e r(1/16) d ,4 4e d r(1−γ )/6 . In particular, we have chosen c1 and c2 large enough so that ˆ ˆ mn ≥ max τ (1/16; ε /2), τ (iv) (γ /16; ε /2), τ ((1 − γ )/6; ε /2), τ ∗ . We begin with (50). By Lemmas 43 and 44, on the event (iii) (i) (iv) ˆ (1) Hn (ε ) = Hmn (ε /2) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn , (k) ∀m ∈ Un , ∀k ∈ 1, . . . , d˜f , ˜ pγ (k, mn , m) ≤ P (x : px (k, mn ) ≥ γ /2) + exp −γ 2 M(m)/256 ¯ (k) ˆ ˜ ≤ P (x : px (k, mn ) ≥ γ /2) + exp −γ 2 M(mn )/256 ≤ ∆mn (W1 ,W2 ,V ) . 1552 (53) ACTIVIZED L EARNING Recall that (k) (k) ˆ is a sample of size n/(6 · 2k ∆mn (W1 ,W2 ,V )) , conditionally i.i.d. Xm : m ∈ Un (1) ˆ (given (W1 ,W2 ,V )) with conditional distributions P. Thus, ∀k ∈ 1, . . . , d˜f , on Hn (ε ), P ≤P (k) (k) ˆ m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ > n/ 3 · 2k (k) (k) ˆ (k) m ∈ Un : ∆m (Xm ,W2 ,V ) ≥ γ (k) > 2 Un (k) (k) ˆ ≤ P B |Un |, ∆mn (W1 ,W2 ,V ) > 2 Un W1 ,W2 ,V ˆ (k) ∆mn (W1 ,W2 ,V ) W1 ,W2 ,V ˆ (k) ∆mn (W1 ,W2 ,V ) W1 ,W2 ,V , (54) where this last inequality follows from (53), and B(u, p) ∼ Binomial(u, p) is independent from W1 ,W2 ,V (for any fixed u and p). By a Chernoff bound, (54) is at most (k) ˆ exp − n/ 6 · 2k ∆mn (W1 ,W2 ,V ) ˆ (k) ∆mn (W1 ,W2 ,V )/3 ≤ exp 1 − n/ 18 · 2k . ˆ (2) By the law of total probability and a union bound, there exists an event Hn with ˆ (2) ≤ d˜f · exp 1 − n/ 18 · 2d˜f ˆ (1) P Hn (ε ) \ Hn ˆ (1) ˆ (2) such that, on Hn (ε ) ∩ Hn , (50) holds. ˆ (1) Next, by Lemma 44, on Hn (ε ), ˜ ˆ (d ) ∆mnf (W1 ,W2 ,V ) ≤ P x : px d˜f , mn ≥ γ /8 + 4m−1 , n (γ /8) (1) ˆ and by Lemma 38, on Hn (ε ), this is at most ∆n (ε ) + 4m−1 , which establishes (51). n (˜ (1) (ii) ˆ n (ε ) ∩ Hmn , ∀m ∈ Und f ) , (52) holds. Finally, Lemma 41 implies that on H Thus, defining (ii) ˆ (1) ˆ (2) ˆ Hn (ε ) = Hn (ε ) ∩ Hn ∩ Hmn , it remains only to establish (49). By a union bound, we have (i) (i) ˆ 1 − P Hn ≤ (1 − P (Hmn (ε /2))) + 1 − P Hmn (i) (iii) (ii) + P Hmn \ Hmn (iv) + P Hmn \ Hmn (γ /16) + 1 − P Hmn ˆ (1) ˆ (2) + P Hn (ε ) \ Hn . ˜ ˜ ≤ ε /2 + c(i) · exp −M(mn )/4 + c(ii) · exp −M(mn )1/3 /60 ˜ + c(iii) (γ /16) · exp −M(mn )γ 2 /256 + 3d˜f · exp {−2mn } ˜ + d˜f · exp 1 − n/ 18 · 2d f ˜ ˜ ≤ ε /2 + c(i) + c(ii) + c(iii) (γ /16) + 6d˜f · exp −nδ f γ 2 2−d f −12 . We have chosen n large enough so that (55) is at most (3/4)ε , which establishes (49). The following result is a slightly stronger version of Theorem 6. 1553 (55) H ANNEKE Lemma 46 For any passive learning algorithm A p , if A p achieves a label complexity Λ p with ∞ > Λ p (ε , f , P) = ω (log(1/ε )), then Meta-Algorithm 1, with A p as its argument, achieves a label complexity Λa such that Λa (3ε , f , P) = o(Λ p (ε , f , P)). Proof Suppose A p achieves label complexity Λ p with ∞ > Λ p (ε , f , P) = ω (log(1/ε )). Let ε ∈ (γ /8) ˜ (0, 1), define L(n; ε ) = n/ 6 · 2d f ∆n (ε ) + 4m−1 n max {n ∈ N : L(n; ε ) < m} (for any m ∈ (0, ∞)). Define c1 = max c1 , 2 · 63 (d + 1)d˜f ln(e(d + 1)) ˆ (for any n ∈ N), and let L−1 (m; ε ) = and c2 = max {c2 , 4e(d + 1)} , ˆ and suppose n ≥ max c1 ln(c2 /ε ), 1 + L−1 (Λ p (ε , f , P); ε ) . Consider running Meta-Algorithm 1 with A p and n as inputs, while f is the target function and P is the data distribution. ˆ Letting hn denote the classifier returned from Meta-Algorithm 1, Lemma 34 implies that on an ˆ ˆ event En with P(En ) ≥ 1 − e(d + 1) · exp −⌊n/3⌋/(72d˜f (d + 1) ln(e(d + 1))) ≥ 1 − ε /4, we have ˆ er(hn ) ≤ 2 er A p Ld˜f . ˆ ˆ ˆ ˆ By a union bound, the event Gn (ε ) = En ∩ Hn (ε ) has P Gn (ε ) ≥ 1 − ε . Thus, ˆ E er hn ≤E ˆ ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) er hn ˆ ˆ + P Gn (ε ) ∩ |Ld˜f | < Λ p (ε , f , P) ≤E ˆ + P Gn (ε )c ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) 2 er A p Ld˜f ˆ ˆ + P Gn (ε ) ∩ |Ld˜f | < Λ p (ε , f , P) + ε. (56) ˆ On Gn (ε ), (51) of Lemma 45 implies |Ld˜f | ≥ L(n; ε ), and we chose n large enough so that L(n; ε ) ≥ Λ p (ε , f , P). Thus, the second term in (56) is zero, and we have ˆ E er hn ≤ 2·E ½Gn (ε ) ½ |Ld˜f | ≥ Λ p (ε , f , P) er A p Ld˜f ˆ = 2·E E ½Gn (ε ) er A p Ld˜f ˆ |Ld˜f | +ε ½ |Ld˜f | ≥ Λ p (ε , f , P) + ε . (d˜f ) Note that for any ℓ with P(|Ld˜f | = ℓ) > 0, the conditional distribution of Xm : m ∈ Un (57) given |Ld˜f | = ℓ is simply the product P ℓ (i.e., conditionally i.i.d.), which is the same as the distribution ˆ of {X1 , X2 , . . . , Xℓ }. Furthermore, on Gn (ε ), (50) implies that the t < ⌊2n/3⌋ condition is always satisfied in Step 6 of Meta-Algorithm 1 while k ≤ d˜f , and (52) implies that the inferred labels from Step 8 for k = d˜f are all correct. Therefore, for any such ℓ with ℓ ≥ Λ p (ε , f , P), we have E ½Gn (ε ) er A p Ld˜f ˆ |Ld˜f | = ℓ 1554 ≤ E [er (A p (Zℓ ))] ≤ ε . ACTIVIZED L EARNING In particular, this means (57) is at most 3ε . This implies that Meta-Algorithm 1, with A p as its argument, achieves a label complexity Λa such that Λa (3ε , f , P) ≤ max c1 ln(c2 /ε ), 1 + L−1 (Λ p (ε , f , P); ε ) . Since Λ p (ε , f , P) = ω (log(1/ε )) ⇒ c1 ln(c2 /ε ) = o (Λ p (ε , f , P)), it remains only to show that L−1 (Λ p (ε , f , P); ε ) = o (Λ p (ε , f , P)). Note that ∀ε ∈ (0, 1), L(1; ε ) = 0 and L(n; ε ) is diverging in n. Furthermore, by Lemma 38, we know that for any N-valued N(ε ) = ω (log(1/ε )), we have (γ /8) ∆N(ε ) (ε ) = o(1), which implies L(N(ε ); ε ) = ω (N(ε )). Thus, since Λ p (ε , f , P) = ω (log(1/ε )), Lemma 31 implies L−1 (Λ p (ε , f , P); ε ) = o (Λ p (ε , f , P)), as desired. This establishes the result for an arbitrary γ ∈ (0, 1). To specialize to the specific procedure stated as Meta-Algorithm 1, we simply take γ = 1/2. Proof [Theorem 6] Theorem 6 now follows immediately from Lemma 46. Specifically, we have proven Lemma 46 for an arbitrary distribution P on X , an arbitrary f ∈ cl(C), and an arbitrary passive algorithm A p . Therefore, it will certainly hold for every P and f ∈ C, and since every ( f , P) ∈ Nontrivial(Λ p ) has ∞ > Λ p (ε , f , P) = ω (log(1/ε )), the implication that Meta-Algorithm 1 activizes every passive algorithm A p for C follows. Careful examination of the proofs above reveals that the “3” in Lemma 46 can be set to any arbitrary constant strictly larger than 1, by an appropriate modification of the “7/12” threshold ˆ in ActiveSelect. In fact, if we were to replace Step 4 of ActiveSelect by instead selecting k = argmink max j=k mk j (where mk j = erQk j (hk ) when k < j), then we could even make this a certain (1 + o(1)) function of ε , at the expense of larger constant factors in Λa . Appendix C. The Label Complexity of Meta-Algorithm 2 As mentioned, Theorem 10 is essentially implied by the details of the proof of Theorem 16 in Appendix D below. Here we present a proof of Theorem 13, along with two useful related lemmas. The first, Lemma 47, lower bounds the expected number of label requests Meta-Algorithm 2 would make while processing a given number of random unlabeled examples. The second, Lemma 48, bounds the amount by which each label request is expected to reduce the probability mass in the region of disagreement. Although we will only use Lemma 48 in our proof of Theorem 13, Lemma 47 may be of independent interest, as it provides additional insights into the behavior of disagreement based methods, as related to the disagreement coefficient, and is included for this reason. Throughout, we fix an arbitrary class C, a target function f ∈ C, and a distribution P, and we ⋆ continue using the notational conventions of the proofs above, such as Vm = {h ∈ C : ∀i ≤ m, h(Xi ) = f (Xi )} (with V0⋆ = C). Additionally, for t ∈ N, define the random variable m M(t) = min m ∈ N : ℓ=1 ½DIS(Vℓ−1 ) (Xℓ ) = t , ⋆ which represents the index of the t th unlabeled example Meta-Algorithm 2 would request the label of (assuming it has not yet halted). The two aforementioned lemmas are formally stated as follows. 1555 H ANNEKE Lemma 47 For any r ∈ (0, 1) and ℓ ∈ N, E [P (DIS (Vℓ⋆ ∩ B ( f , r)))] ≥ (1 − r)ℓ P (DIS (B ( f , r))) ,  ⌈1/r⌉ P (DIS (B( f , r))) ½DIS(Vm−1 ∩B( f ,r)) (Xm ) ≥ E . ⋆ 2r  and m=1 Lemma 48 For any r ∈ (0, 1) and n ∈ N, ⋆ E P DIS VM(n) ∩ B ( f , r) Note these results immediately imply that  ⌈1/r⌉ E and ≥ P (DIS (B( f , r))) − nr.  ½DIS(Vm−1 ) (Xm ) ≥ ⋆ m=1 ⋆ E P DIS VM(n) P (DIS (B( f , r))) 2r ≥ P (DIS (B( f , r))) − nr, which are then directly relevant to the expected number of label requests made by Meta-Algorithm 2 among the first m data points, and the probability Meta-Algorithm 2 requests the label of the next point, after already making n label requests, respectively. Before proving these lemmas, let us first mention their relevance to the disagreement coefficient analysis. Specifically, for any ε ∈ (0, r], we have     ⌈1/ε ⌉ ⌈1/r⌉ P (DIS (B( f , r))) E ½DIS(Vm−1 ) (Xm ) ≥ E  ½DIS(Vm−1 ) (Xm ) ≥ . ⋆ ⋆ 2r m=1 m=1 In particular, maximizing over r > ε , we have  ⌈1/ε ⌉ E m=1  ½DIS(Vm−1 ) (Xm ) ≥ θ f (ε )/2. ⋆ Thus, the expected number of label requests among the first ⌈1/ε ⌉ unlabeled examples processed by Meta-Algorithm 2 is at least θ f (ε )/2 (assuming it does not halt first). Similarly, for any ε ∈ (0, r], for any n ≤ P(DIS(B( f , r)))/(2r), Lemma 48 implies ⋆ E P DIS VM(n) ≥ P (DIS (B( f , r))) /2 ≥ P (DIS (B( f , ε ))) /2. Maximizing over r > ε , we see that ⋆ n ≤ θ f (ε )/2 =⇒ E P DIS VM(n) ≥ P (DIS (B( f , ε ))) /2. In other words, for Meta-Algorithm 2 to arrive at a region of disagreement with expected probability mass less than P(DIS(B( f , ε )))/2 requires a budget n of at least θ f (ε )/2. 1556 ACTIVIZED L EARNING We now present proofs of Lemmas 47 and 48. ⋆ Proof [Lemma 47] Let Dm = DIS (Vm ∩ B( f , r)). Since  ⌈1/r⌉ E m=1  ½Dm−1 (Xm ) = = ⌈1/r⌉ m=1 ⌈1/r⌉ ⋆ E P Xm ∈ Dm−1 Vm−1 E [P (Dm−1 )] , (58) m=1 we focus on lower bounding E [P (Dm )] for m ∈ N ∪ {0}. Note that for any x ∈ DIS(B( f , r)), there ⋆ exists some hx ∈ B( f , r) with hx (x) = f (x), and if this hx ∈ Vm , then x ∈ Dm as well. This means ⋆ ∀x, ½Dm (x) ≥ ½DIS(B( f ,r)) (x) · ½Vm (hx ) = ½DIS(B( f ,r)) (x) · ∏m ½DIS({hx , f })c (Xℓ ). Therefore, ℓ=1 E [P (Dm )] = P (Xm+1 ∈ Dm ) = E E ½Dm (Xm+1 ) Xm+1 m ≥E E ½DIS(B( f ,r)) (Xm+1 ) · ∏ ½DIS({hXm+1 , f })c (Xℓ ) Xm+1 ℓ=1 m =E ∏P hXm+1 (Xℓ ) = f (Xℓ ) Xm+1 ½DIS(B( f ,r)) (Xm+1 ) (59) ℓ=1 ≥ E (1 − r)m ½DIS(B( f ,r)) (Xm+1 ) = (1 − r)m P(DIS(B( f , r))), (60) where the equality in (59) is by conditional independence of the ½DIS({hXm+1 , f })c (Xℓ ) indicators, given Xm+1 , and the inequality in (60) is due to hXm+1 ∈ B( f , r). This indicates (58) is at least ⌈1/r⌉ m=1 (1 − r)m−1 P (DIS (B( f , r))) = 1 − (1 − r)⌈1/r⌉ ≥ 1− 1 e P (DIS (B( f , r))) r P (DIS (B( f , r))) P (DIS (B( f , r))) ≥ . r 2r ⋆ Proof [Lemma 48] For each m ∈ N ∪ {0}, let Dm = DIS (B( f , r) ∩Vm ). For convenience, let M(0) = 0. We prove the result by induction. We clearly have E P DM(0) = E [P (D0 )] = P(DIS(B( f , r))), which serves as our base case. Now fix any n ∈ N and take as the inductive hypothesis that E P DM(n−1) ≥ P(DIS(B( f , r))) − (n − 1)r. ⋆ As in the proof of Lemma 47, for any x ∈ DM(n−1) , there exists hx ∈ B( f , r) ∩VM(n−1) with hx (x) = ⋆ f (x); unlike the proof of Lemma 47, here hx is a random variable, determined by VM(n−1) . If hx is ⋆ ⋆ also in VM(n) , then x ∈ DM(n) as well. Thus, ∀x, ½DM(n) (x) ≥ ½DM(n−1) (x) · ½VM(n) (hx ) = ½DM(n−1) (x) · ½DIS({hx , f })c (XM(n) ), where this last equality is due to the fact that every m ∈ {M(n − 1) + 1, . . . , ⋆ M(n) − 1} has Xm ∈ DIS Vm−1 , so that in particular hx (Xm ) = f (Xm ). Therefore, letting X ∼ P be / 1557 H ANNEKE independent of the data Z, =E ½DM(n) (X) ≥ E ½DM(n−1) (X) · ½DIS({hX , f })c (XM(n) ) =E E P DM(n) ⋆ ½DM(n−1) (X) · P hX (XM(n) ) = f (XM(n) ) X,VM(n−1) . (61) ⋆ The conditional distribution of XM(n) given VM(n−1) is merely P but with support restricted to ⋆ DIS VM(n−1) ⋆ and renormalized to a probability measure: that is P · DIS VM(n−1) . Thus, ⋆ since any x ∈ DM(n−1) has DIS({hx , f }) ⊆ DIS VM(n−1) , we have ⋆ P hx (XM(n) ) = f (XM(n) ) VM(n−1) = P (DIS({hx , f })) ⋆ P DIS VM(n−1) ≤ r P DM(n−1) , ⋆ where the inequality follows from hx ∈ B( f , r) and DM(n−1) ⊆ DIS VM(n−1) . Therefore, (61) is at least E ½DM(n−1) (X)· 1 − r P(DM(n−1) ) = E P X ∈ DM(n−1) DM(n−1) · 1 − = E P DM(n−1) · 1 − r r P(DM(n−1) ) P(DM(n−1) ) = E P DM(n−1) − r. By the inductive hypothesis, this is at least P(DIS(B( f , r))) − nr. With Lemma 48 in hand, we are ready for the proof of Theorem 13. Proof [Theorem 13] Let C, f , P, and λ be as in the theorem statement. For m ∈ N, let λ −1 (m) = inf{ε > 0 : λ (ε ) ≤ m}, or 1 if this is not defined. We define A p as a randomized algorithm such that, for m ∈ N and L ∈ (X ×{−1, +1})m , A p (L) returns f with probability 1− λ −1 (|L|) and returns − f with probability λ −1 (|L|) (independent of the contents of L). Note that, for any integer m ≥ λ (ε ), E [er (A p (Zm ))] = λ −1 (m) ≤ λ −1 (λ (ε )) ≤ ε . Therefore, A p achieves some label complexity Λ p with Λ p (ε , f , P) = λ (ε ) for all ε > 0. If θ f λ (ε )−1 = ω (1), then monotonicity implies θ f λ (ε )−1 = O(1), and since every label complexity Λa is Ω(1), the result clearly holds. Otherwise, suppose θ f λ (ε )−1 = ω (1); in particular, this means ∃ε0 ∈ (0, 1/2) such that θ f λ (2ε0 )−1 ≥ 12. Fix any ε ∈ (0, ε0 ), let r > λ (2ε )−1 be such that P(DIS(B( f ,r))) ≥ θ f λ (2ε )−1 /2, and let n ∈ N satisfy n ≤ θ f λ (2ε )−1 /4. r ˆ Consider running Meta-Algorithm 2 with arguments A p and n, and let L denote the final value of the set L, and let m denote the value of m upon reaching Step 6. Note that any m < λ (2ε ) and ˇ m has er (A (L)) = λ −1 (m) ≥ inf{ε ′ > 0 : λ (ε ′ ) < λ (2ε )} ≥ 2ε . Therefore, L ∈ (X × {−1, +1}) p we have ˆ E er A p L ˆ ≥ 2ε P |L| < λ (2ε ) = 2ε P ˆ = 2ε P ∆ > n 6λ (2ε ) 1558 ˆ n/ 6∆ < λ (2ε ) ˆ = 2ε 1 − P ∆ ≤ n 6λ (2ε ) . (62) ACTIVIZED L EARNING Since n ≤ θ f λ (2ε )−1 /4 ≤ P(DIS(B( f , r)))/(2r) < λ (2ε )P(DIS(B( f , r)))/2, we have ˆ P ∆≤ n 6λ (2ε ) ˆ ≤ P ∆ < P(DIS(B( f , r)))/12 ⋆ ⋆ ˆ P (DIS (Vm )) < P(DIS(B( f , r)))/12 ∪ ∆ < P (DIS (Vm )) ˇ ˇ ≤P . (63) Since m ≤ M(⌈n/2⌉), monotonicity and a union bound imply this is at most ˇ ⋆ P P DIS VM(⌈n/2⌉) ⋆ ˆ < P(DIS(B( f , r)))/12 + P ∆ < P (DIS (Vm )) . ˇ (64) Markov’s inequality implies ⋆ P P DIS VM(⌈n/2⌉) < P(DIS(B( f , r)))/12 11 P(DIS(B( f , r))) 12 11 ⋆ ≤ P P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ∩ B( f , r) > P(DIS(B( f , r))) 12 ⋆ = P P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ≤ ⋆ E P(DIS(B( f , r))) − P DIS VM(⌈n/2⌉) ∩ B( f , r) 11 12 P(DIS(B( f , r)))  ⋆ E P DIS VM(⌈n/2⌉) ∩ B( f , r) 12  = 1− 11 P(DIS(B( f , r))) Lemma 48 implies this is at most ⌈n/2⌉r 12 11 P(DIS(B( f ,r))) ≤  . 12 11 3/2 has ⌈a⌉ ≤ (3/2)a, and θ f λ (2ε )−1 ≥ 12 implies ≤ > 3 P(DIS(B( f ,r))) , 8 r so that 12 11 P(DIS(B( f ,r))) 4r ⋆ P P DIS VM(⌈n/2⌉) P(DIS(B( f ,r))) r 4r P(DIS(B( f ,r))) . Since any a ≥ P(DIS(B( f ,r))) ≥ 3/2, we have P(DIS(B( f ,r))) 4r 4r r P(DIS(B( f ,r))) ≤ 9 22 . Combining the above, we have < P(DIS(B( f , r)))/12 ≤ 9 . 22 (65) ˆ Examining the second term in (64), Hoeffding’s inequality and the definition of ∆ from (13) imply ⋆ ⋆ ⋆ ˆ ˆ P ∆ < P (DIS (Vm )) = E P ∆ < P (DIS (Vm )) Vm , m ˇ ˇ ˇ ˇ ˇ ≤ E e−8m ≤ e−8 < 1/11. (66) Combining (62), (63), (64), (65), and (66) implies ˆ E er A p L > 2ε 1 − 1 9 − 22 11 = ε. Thus, for any label complexity Λa achieved by running Meta-Algorithm 2 with A p as its argument, we must have Λa (ε , f , P) > θ f λ (2ε )−1 /4. Since this is true for all ε ∈ (0, ε0 ), this establishes the result. 1559 H ANNEKE Appendix D. The Label Complexity of Meta-Algorithm 3 As in Appendix B, we will assume C is a fixed VC class, P is some arbitrary distribution, and f ∈ cl(C) is an arbitrary fixed function. We continue using the notation introduced above: in k ˜ ¯k ¯ particular, S k (H) = S ∈ X k : H shatters S , S k (H) = X k \ S k (H), ∂H f = X k \ ∂H f , and δ f = ˜ d˜ −1 P d f −1 ∂Cf f . Also, as above, we will prove a more general result replacing the “1/2” in Steps 5, 9, and 12 of Meta-Algorithm 3 with an arbitrary value γ ∈ (0, 1); thus, the specific result for the stated algorithm will be obtained by taking γ = 1/2. ˆ For the estimators Pm in Meta-Algorithm 3, we take precisely the same definitions as given in ˆ (k) Appendix B.1 for the estimators in Meta-Algorithm 1. In particular, the quantities ∆m (x,W2 , H), (k) ˆ (k) ˆ ˆ (k) ∆m (W1 ,W2 , H), Γm (x, y,W2 , H), and Mm (H) are all defined as in Appendix B.1, and the Pm estimators are again defined as in (11), (12) and (13). Also, we sometimes refer to quantities defined above, such as pζ (k, ℓ, m) (defined in (34)), as ¯ (i) (ii) well as the various events from the lemmas of the previous appendix, such as Hτ (δ ), H ′ , Hτ , Hτ , (iii) (iv) (i) Hτ (ζ ), Hτ , and Gτ . D.1 Proof of Theorem 16 Throughout the proof, we will make reference to the sets Vm defined in Meta-Algorithm 3. Also let V (k) denote the final value of V obtained for the specified value of k in Meta-Algorithm 3. Both Vm and V (k) are implicitly functions of the budget, n, given to Meta-Algorithm 3. As above, we ⋆ continue to denote by Vm = {h ∈ C : ∀i ≤ m, h(Xm ) = f (Xm )}. One important fact we will use ⋆ ⋆ repeatedly below is that if Vm = Vm for some m, then since Lemma 35 implies that Vm = ∅ on H ′ , we must have that all of the previous y values were consistent with f , which means that ∀ℓ ≤ m, ˆ ⋆ . In particular, if V (k′ ) = V ⋆ for the largest m value obtained while k = k′ in Meta-Algorithm Vℓ = Vℓ m 3, then Vℓ = Vℓ⋆ for all ℓ obtained while k ≤ k′ in Meta-Algorithm 3. Additionally, define mn = ⌊n/24⌋, and note that the value m = ⌈n/6⌉ is obtained while k = 1 in ˜ Meta-Algorithm 3. We also define the following quantities, which we will show are typically equal ˆ to related quantities in Meta-Algorithm 3. Define m0 = 0, T0⋆ = ⌈2n/3⌉, and t0 = 0, and for each ˆ k ∈ {1, . . . , d + 1}, inductively define 1560 ACTIVIZED L EARNING ⋆ ˆ Tk⋆ = Tk−1 − tk−1 , ⋆ ⋆ ˆ Imk = ½[γ ,∞) ∆m Xm ,W2 ,Vm−1 , ∀m ∈ N,   m   ⋆ mk = min m ≥ mk−1 : ˇ ˆ Iℓk = ⌈Tk⋆ /4⌉ ∪ {max {k · 2n + 1, mk−1 }} , ˆ   (k) ℓ=mk−1 +1 ˆ (k) ⋆ ˆ mk = mk + Tk⋆ / 3∆mk W1 ,W2 ,Vmk ˆ ˇ ˇ ˇ ˇ Uk = (mk−1 , mk ] ∩ N, ˆ ˇ ˆ Uk = (mk , mk ] ∩ N, ˇ ˆ  ⋆ Cmk = ½[0,⌊3T ⋆ /4⌋)  k Q⋆ k = ˆ m∈Uk m−1 ℓ=mk−1 +1 ˆ ⋆ ⋆ Imk ·Cmk , ˆ and tk = Q⋆ + k ,  ⋆ Iℓk  ⋆ Imk . ˇ m∈Uk The meaning of these values can be understood in the context of Meta-Algorithm 3, under the ⋆ condition that Vm = Vm for values of m obtained for the respective value of k. Specifically, under ⋆ corresponds to T , t represents the final value t for round k, m represents the ˇk this condition, Tk k ˆk value of m upon reaching Step 9 in round k, while mk represents the value of m at the end of round k, ˆ ˇ ˆ Uk corresponds to the set of indices arrived at in Step 4 during round k, while Uk corresponds to the ⋆ indicates whether the label of X ˇ set of indices arrived at in Step 11 during round k, for m ∈ Uk , Imk m ˆk , I ⋆ · C⋆ indicates whether the label of Xm is requested. Finally Q⋆ is requested, while for m ∈ U mk mk k corresponds to the number of label requests in Step 13 during round k. In particular, note m1 ≥ mn . ˇ ˜ (i) Lemma 49 For any τ ∈ N, on the event H ′ ∩ Gτ , ∀k, ℓ, m ∈ N with k ≤ d˜f , ∀x ∈ X , for any sets H and H′ with Vℓ⋆ ⊆ H ⊆ H′ ⊆ B( f , r1/6 ), if either k = 1 or m ≥ τ , then ˆ (k) ˆ (k) ∆m (x,W2 , H) ≤ (3/2)∆m x,W2 , H′ . (i) In particular, for any δ ∈ (0, 1) and τ ≥ τ (1/6; δ ), on H ′ ∩ Hτ (δ ) ∩ Gτ , ∀k, ℓ, ℓ′ , m ∈ N with m ≥ τ , ˆ (k) ˆ (k) ℓ ≥ ℓ′ ≥ τ , and k ≤ d˜f , ∀x ∈ X , ∆m (x,W2 ,Vℓ⋆ ) ≤ (3/2)∆m x,W2 ,Vℓ⋆ . ′ Proof First note that ∀m ∈ N, ∀x ∈ X , ˆ (1) ˆ (1) ∆m (x,W2 , H) = ½DIS(H) (x) ≤ ½DIS(H′ ) (x) = ∆m x,W2 , H′ , (k) so the result holds for k = 1. Lemma 35, Lemma 40, and monotonicity of Mm (·) imply that on (i) H ′ ∩ Gτ , for any m ≥ τ and k ∈ 2, . . . , d˜f , m3 (k) Mm (H) ≥ i=1 (k) (k) ½∂C f Si(k) ≥ (2/3)Mm B( f , r1/6 ) ≥ (2/3)Mm H′ , k−1 1561 H ANNEKE so that ∀x ∈ X , m3 (k) ˆ (k) ∆m (x,W2 , H) = Mm (H)−1 i=1 m3 ≤ (k) Mm (H)−1 ≤ (k) (3/2)Mm i=1 H ½S k (H) Si(k) ∪ {x} ½S k (H′ ) Si(k) ∪ {x} ′ −1 m3 i=1 ˆm ½S k (H′ ) Si(k) ∪ {x} = (3/2)∆(k) x,W2 , H′ . The final claim follows from Lemma 29. ˆ Lemma 50 For any k ∈ {1, . . . , d + 1}, if n ≥ 3·4k−1 , then Tk⋆ ≥ 41−k (2n/3) and tk ≤ 3Tk⋆ /4 . Proof Recall T1⋆ = ⌈2n/3⌉ ≥ 2n/3. If n ≥ 2, we also have ⌊3T1⋆ /4⌋ ≥ ⌈T1⋆ /4⌉, so that (due to the ⋆ ˆ Cm1 factors) t1 ≤ ⌊3T1⋆ /4⌋. For the purpose of induction, suppose some k ∈ {2, . . . , d + 1} has n ≥ ⋆ ⋆ ⋆ ⋆ ˆ ˆ 3 · 4k−1 , Tk−1 ≥ 42−k (2n/3), and tk−1 ≤ ⌊3Tk−1 /4⌋. Then Tk⋆ = Tk−1 − tk−1 ≥ Tk−1 /4 ≥ 41−k (2n/3), ⋆ ˆ and since n ≥ 3 · 4k−1 , we also have ⌊3Tk⋆ /4⌋ ≥ ⌈Tk⋆ /4⌉, so that tk ≤ ⌊3Tk⋆ /4⌋ (again, due to the Cmk k−1 . factors). Thus, by induction, this holds for all k ∈ {1, . . . , d + 1} with n ≥ 3 · 4 The next lemma indicates that the “t < ⌊3Tk /4⌋” constraint in Step 12 is redundant for k ≤ d˜f . It ˆ is similar to (50) in Lemma 45, but is made only slightly more complicated by the fact that the ∆(k) estimate is calculated in Step 9 based on a set Vm different from the ones used to decide whether or not to request a label in Step 12. (i) (i) ˜ ˜ Lemma 51 There exist (C, P, f , γ )-dependent constants c1 , c2 ∈ [1, ∞) such that, for any δ ∈ (i) (i) (0, 1), and any integer n ≥ c1 ln c2 /δ , on an event ˜ ˜ (i) (i) (iv) (iii) ˜ (i) Hn (δ ) ⊆ Gmn ∩ Hmn (δ ) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn ˜ ˜ ˜ ˜ ˜ ˜ (i) ˆ with P Hn (δ ) ≥ 1 − 2δ , ∀k ∈ 1, . . . , d˜f , tk = mk ˆ m=mk−1 +1 ˆ ⋆ Imk ≤ 3Tk⋆ /4. Proof Define the constants (i) c1 = max ˜ (i) d˜ +6 192d 3·4 f ˜ r(3/32) , δ f γ 2 8e (i) , c2 = max ˜ r(3/32) ˜ , c(i) + c(iii) (γ /16) + 125d˜f δ f−1 (i) and let n(i) (δ ) = c1 ln c2 /δ . Fix any integer n ≥ n(i) (δ ) and consider the event ˜ ˜ (i) (i) (iii) (iv) ˜ (1) Hn (δ ) = Gmn ∩ Hmn (δ ) ∩ Hmn ∩ Hmn (γ /16) ∩ Hmn . ˜ ˜ ˜ ˜ ˜ 1562 , ACTIVIZED L EARNING (1) ˜ By Lemma 49 and the fact that mk ≥ mn for all k ≥ 1, since n ≥ n(i) (δ ) ≥ 24τ (1/6; δ ), on Hn (δ ), ˇ ˜ ˜f , ∀m ∈ Uk , ˆ ∀k ∈ 1, . . . , d ⋆ ⋆ ˆ (k) ˆ (k) ∆m Xm ,W2 ,Vm−1 ≤ (3/2)∆m Xm ,W2 ,Vmk . ˇ (67) Now fix any k ∈ 1, . . . , d˜f . Since n ≥ n(i) (δ ) ≥ 27·4k−1 , Lemma 50 implies Tk⋆ ≥ 18, which means ⋆ ≤ T ⋆ /4 . Let N = (4/3)∆(k) W ,W ,V ⋆ ˆ ˆ Uk , 3T ⋆ /4 − ⌈T ⋆ /4⌉ ≥ 4T ⋆ /9. Also note ˇ I 1 2 k k k m∈Uk mk k ˆ and note that Uk = Tk⋆ /   ˆ (k) 3∆mk ˇ k ⋆ W1 ,W2 ,Vmk ˇ mk ˇ mk ˇ , so that Nk ≤ (4/9)Tk⋆ . Thus, we have   ⋆ ˜ (1) P Hn (δ ) ∩ Imk > 3Tk⋆ /4    m=mk−1 +1 ˆ           ⋆ ⋆ ˜ (1) ˜ (1) Imk > 4Tk⋆ /9  ≤ P Hn (δ ) ∩ Imk > Nk  ≤ P Hn (δ ) ∩     ˆ ˆ m∈Uk m∈Uk      ⋆ ˆm ˜ (1) ≤ P Hn (δ ) ∩ ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk > Nk  , ˇ    mk ˆ (68) ˆ m∈Uk ⋆ ˜ ˇ where this last inequality is by (67). To simplify notation, define Zk = Tk⋆ , mk ,W1 ,W2 ,Vmk . By ˇ Lemmas 43 and 44 (with β = 3/32, ζ = 2γ /3, α = 3/4, and ξ = γ /16), since n ≥ n(i) (δ ) ≥ ˆ ˜ (1) 24 · max τ (iv) (γ /16; δ ), τ (3/32; δ ) , on Hn (δ ), ∀m ∈ Uk , ˜ ˇ ˇ p2γ /3 (k, mk , m) ≤ P (x : px (k, mk ) ≥ γ /2) + exp −γ 2 M(m)/256 ¯ ˜ ˇ ≤ P (x : px (k, mk ) ≥ γ /2) + exp −γ 2 M(mk )/256 ˇ (k) ⋆ ˆ ≤ ∆mk W1 ,W2 ,Vmk . ˇ ˇ (k) (1) ⋆ ˆ ˜n ˜n ˜ ˇ Letting G′ (k) denote the event p2γ /3 (k, mk , m) ≤ ∆mk W1 ,W2 ,Vmk , we see that G′ (k) ⊇ Hn (δ ). ¯ ˇ ˇ ⋆ ˆ (k) ˜ variables are conditionally independent given Zk for Thus, since the ½[2γ /3,∞) ∆m Xm ,W2 ,Vmk ˇ ˆ m ∈ Uk , each with respective conditional distribution Bernoulli p2γ /3 (k, mk , m) , the law of total ¯ ˇ probability and a Chernoff bound imply that (68) is at most  ˜n P G′ (k) ∩   = E P    ⋆ ˆm ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk ˇ  ˆ m∈Uk ⋆ ˆm ½[2γ /3,∞) ∆(k) Xm ,W2 ,Vmk ˇ ˆ m∈Uk (k) ⋆ ˆ ≤ E exp −∆mk W1 ,W2 ,Vmk ˇ ˇ ˆ Uk /27   > Nk     ˜ > Nk Zk  · ½G′n (k)  ˜ ≤ E [exp{−Tk⋆ /162}] ≤ exp −n/ 243 · 4k−1 1563 , H ANNEKE ˜ ˜ ˜ (1) where the last inequality is by Lemma 50. Thus, there exists Gn (k) with P Hn (δ ) \ Gn (k) ≤ exp −n/ 243 · 4k−1 ˜ (i) ˜ (1) Hn (δ ) = Hn (δ ) ∩ (1) ˜ ˜ such that, on Hn (δ ) ∩ Gn (k), we have d˜f ˜ k=1 Gn (k), mk ˆ ⋆ m=mk−1 +1 Imk ˆ ≤ 3Tk⋆ /4. Defining a union bound implies ˜ ˜ (1) ˜ (i) P Hn (δ ) \ Hn (δ ) ≤ d˜f · exp −n/ 243 · 4d f −1 , (69) (i) ˆ ⋆ ⋆ ˜ and on Hn (δ ), every k ∈ 1, . . . , d˜f has mk mk−1 +1 Imk ≤ 3Tk⋆ /4. In particular, this means the Cmk m= ˆ ˆ ⋆ ˆ factors are redundant in Q⋆ , so that tk = mk mk−1 +1 Imk . k m= ˆ To get the stated probability bound, a union bound implies that (i) ˜ (1) 1 − P Hn (δ ) ≤ (1 − P (Hmn (δ ))) + 1 − P Hmn ˜ ˜ (iv) + 1 − P Hmn ˜ (i) (i) (iii) + P Hmn \ Hmn (γ /16) ˜ ˜ (i) + P Hmn \ Gmn ˜ ˜ ˜ ˜ ≤ δ + c(i) · exp −M (mn ) /4 ˜ ˜ ˜ + c(iii) (γ /16) · exp −M (mn ) γ 2 /256 + 3d˜f · exp {−2mn } −1 ˜ ˜ ˜ + 121d˜f δ f · exp −M (mn ) /60 ˜ ≤ δ + c(i) + c(iii) (γ /16) + 124d˜f δ f−1 · exp −mn δ f γ 2 /512 . ˜ ˜ (70) ˜ Since n ≥ n(i) (δ ) ≥ 24, we have mn ≥ n/48, so that summing (69) and (70) gives us ˜ ˜ ˜ ˜ (i) 1 − P Hn (δ ) ≤ δ + c(i) + c(iii) (γ /16) + 125d˜f δ f−1 · exp −nδ f γ 2 / 512 · 48 · 4d f −1 . (71) Finally, note that we have chosen n(i) (δ ) sufficiently large so that (71) is at most 2δ . The next lemma indicates that the redundancy of the “t < ⌊3Tk /4⌋” constraint, just established in Lemma 51, implies that all y labels obtained while k ≤ d˜f are consistent with the target function. ˆ Lemma 52 Consider running Meta-Algorithm 3 with a budget n ∈ N, while f is the target func˜ (ii) tion and P is the data distribution. There is an event Hn and (C, P, f , γ )-dependent constants (ii) (ii) (ii) (ii) ˜ (ii) ≤ δ , ˜ (i) c1 , c2 ∈ [1, ∞) such that, for any δ ∈ (0, 1), if n ≥ c1 ln c2 /δ , then P Hn (δ ) \ Hn ˜ ˜ ˜ ˜ ˜ ⋆ ˜ (i) ˜ (ii) and on Hn (δ ) ∩ Hn , we have V (d f ) = Vmd˜ = Vm ˜ . ˆ ˆ df f (ii) (i) Proof Define c1 = max c1 , r 192d , ˜ ˜ (1−γ )/6 211 ˜ 1/3 δf (ii) (i) , c2 = max c2 , r ˜ ˜ 8e (1−γ )/6 , c(ii) , exp {τ ∗ } , let n(ii) (δ ) = (ii) (ii) (ii) ˜ (ii) c1 ln c2 /δ , suppose n ≥ n(ii) (δ ), and define the event Hn = Hmn . ˜ ˜ ˜ ˜ (i) ˜ (ii) By Lemma 41, since n ≥ n(ii) (δ ) ≥ 24 · max {τ ((1 − γ )/6; δ ), τ ∗ }, on Hn (δ ) ∩ Hn , ∀m ∈ N ˜ and k ∈ 1, . . . , d˜f with either k = 1 or m > mn , ⋆ ⋆ ⋆ ˆ (k) ˆ (k) ˆ (k) ∆m Xm ,W2 ,Vm−1 < γ ⇒ Γm Xm , − f (Xm ),W2 ,Vm−1 < Γm Xm , f (Xm ),W2 ,Vm−1 . 1564 (72) ACTIVIZED L EARNING ˜ Recall that mn ≤ min {⌈T1 /4⌉ , 2n } = ⌈⌈2n/3⌉ /4⌉. Therefore, Vmn is obtained purely by mn exe˜ ˜ cutions of Step 8 while k = 1. Thus, for every m obtained in Meta-Algorithm 3, either k = 1 or m > mn . We now proceed by induction on m. We already know V0 = C = V0⋆ , so this serves as ˜ our base case. Now consider some value m ∈ N obtained in Meta-Algorithm 3 while k ≤ d˜f , and ⋆ suppose every m′ < m has Vm′ = Vm′ . But this means that Tk = Tk⋆ and the value of t upon obtaining m−1 ⋆ ⋆ ˆ (k) this particular m has t ≤ ℓ=mk−1 +1 Iℓk . In particular, if ∆m (Xm ,W2 ,Vm−1 ) ≥ γ , then Imk = 1, so ˆ ˆ ⋆ ⋆ ⋆ ˜ (ii) ˜ (i) that t < m mk−1 +1 Imk ; by Lemma 51, on Hn (δ ) ∩ Hn , m mk−1 +1 Imk ≤ mk mk−1 +1 Imk ≤ 3Tk⋆ /4, ℓ= ˆ ℓ= ˆ ℓ= ˆ ⋆ /4, and therefore y = Y = f (X ); this implies V = V ⋆ . On the other hand, on so that t < 3Tk ˆ m m m m (ii) (k) (i) ˆ ˜ ˜ Hn (δ ) ∩ Hn , if ∆m (Xm ,W2 ,Vm−1 ) < γ , then (72) implies ˆ (k) y = argmax Γm (Xm , y,W2 ,Vm−1 ) = f (Xm ), ˆ y∈{−1,+1} ⋆ ˜ (i) ˜ (ii) so that again Vm = Vm . Thus, by the principle of induction, on Hn (δ ) ∩ Hn , for every m ∈ N ˜f ) ⋆ ⋆ obtained while k ≤ d˜f , we have Vm = Vm ; in particular, this implies V (d = Vmd˜ = Vm ˜ . The bound ˆ ˆ df f ˜ (i) ˜ (ii) Hn (δ ) \ Hn on P then follows from Lemma 41, as we have chosen that (27) (with τ = mn ) is at most δ . ˜ n(ii) (δ ) sufficiently large so Lemma 53 Consider running Meta-Algorithm 3 with a budget n ∈ N, while f is the target func(iii) (iii) ˜ ˜ tion and P is the data distribution. There exist (C, P, f , γ )-dependent constants c1 , c2 ∈ [1, ∞) −3 ), λ ∈ [1, ∞), and n ∈ N, there exists an event H (iii) (δ , λ ) having ˜n such that, for any δ ∈ (0, e ˜ (ii) ˜ (iii) ˜ (i) P Hn (δ ) ∩ Hn \ Hn (δ , λ ) ≤ δ with the property that, if (iii) n ≥ c1 θ f (d/λ ) ln2 ˜ ˜ (i) (ii) (iii) c2 λ ˜ δ , (iii) ˜ ˜ ˜ then on Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), at the conclusion of Meta-Algorithm 3, Ld˜f ≥ λ . (iii) Proof Let c1 ˜ (i) 10+2d˜f (ii) d·d˜f ·4 ˜ γ 3 δ f3 = max c1 , c1 , ˜ ˜ , r192d (3/32) (iii) , c2 ˜ (i) (ii) = max c2 , c2 , r 8e ˜ ˜ (3/32) , fix any δ ∈ (iii) (iii) ˜ ˜ ˜ (0, e−3 ), λ ∈ [1, ∞), let n(iii) (δ , λ ) = c1 θ f (d/λ ) ln2 (c2 λ /δ ), and suppose n ≥ n(iii) (δ , λ ). ˜ ˜ ˆ Define a sequence ℓi = 2i for integers i ≥ 0, and let ι = log2 42+d f λ /γ δ f . Also define ˜ ˆ φ (m, δ , λ ) = max {φ (m; δ /2ι ) , d/λ }, where φ is defined in Lemma 29. Then define the events ˜ H (3) (δ , λ ) = ˆ ι i=1 ˜ (iii) ˜ ˆ Hℓi (δ /2ι ) , Hn (δ , λ ) = H (3) (δ , λ ) ∩ md˜f ≥ ℓι . ˇ ˆ ˆ ˇ Note that ι ≤ n, so that ℓι ≤ 2n , and therefore the truncation in the definition of md˜f , which enforces ˆ ˜f · 2n + 1, mk−1 , will never be a factor in whether or not m ˜ ≥ ℓι is satisfied. ˇ df md˜f ≤ max d ˇ ˆ ˆ (ii) (ii) ⋆ ˜ (ii) ˆ ˜ (i) Since n ≥ n(iii) (λ , δ ) ≥ c1 ln c2 /δ , Lemma 52 implies that on Hn (δ ) ∩ Hn , Vmd˜ = Vm ˜ . ˜ ˜ ˆ f df Recall that this implies that all y values obtained while m ≤ md˜f are consistent with their respective ˆ ˆ 1565 H ANNEKE ⋆ ⋆ f (Xm ) values, so that every such m has Vm = Vm as well. In particular, Vmd˜ = Vm ˜ . Also note that ˇ ˇ df f n(iii) (δ , λ ) Thus, on 24 · τ (iv) (γ /16; δ ), ≥ so that ˜ (ii) ˜ (iii) ˜ (i) Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), τ (iv) (γ /16; δ ) (taking ˆ ∆(k) ≤ mn , and recall we always have mn ≤ md˜f . ˜ ˜ ˇ as in Meta-Algorithm 3) ˜ ⋆ ˆ ˜ ˆ (d ) ∆(d f ) = ∆m ˜f W1 ,W2 ,Vmd˜ ˇ ˇ df (Lemma 52) f ˇ ˇ d˜ ≤ P x : px d˜f , md˜f ≥ γ /8 + 4m−1 (Lemma 44) f ˜ 8P d f ≤ ˜ γ P d f −1 ˜ ⋆ S d f Vm ˜ ˇ df ˜ S d f −1 ˜ ˜ ≤ 8/γ δ f P d f ⋆ Vm ˜ ˇd + 4md˜ ˇ −1 f ˜ ⋆ S d f Vmd˜ ˇ ˜ ˜ ˜ (Markov’s ineq.) f + 4md˜ ˇ −1 ˜ ˜ ≤ 8/γ δ f P d f S d f Vℓ⋆ ˆ ι (Lemma 35) f f ˜ (iii) (defn of Hn (δ , λ )) −1 + 4ℓι ˆ ˜ ˜ ˆ ≤ 8/γ δ f P d f S d f B f , φ (ℓι , δ , λ ) + 4ℓ−1 ˆ ι (Lemma 29) ˜ ˜ ˜ ˆ ≤ 8/γ δ f θ f (d/λ )φ (ℓι , δ , λ ) + 4ℓ−1 ˆ ι ˜ (defn of θ f (d/λ )) ˜ ˜ ˜ ˆ ≤ 12/γ δ f θ f (d/λ )φ (ℓι , δ , λ ) = ˜ ˆ (φ (ℓι , δ , λ ) ≥ ℓ−1 ) ˆ ι ˜ ˆ 12θ f (d/λ ) d ln (2e max {ℓι , d} /d) + ln (4ι /δ ) ˆ max 2 , d/λ . ˜ ℓι γδ f ˆ (73) ˆ Plugging in the definition of ι and ℓι , ˆ ˆ d ln (2e max {ℓι , d} /d) + ln (4ι /δ ) ˜ ˜ ˆ ˜ ˜ ≤ (d/λ )γ δ f 4−1−d f ln 41+d f λ /δ γ δ f ≤ (d/λ ) ln (λ /δ ) . ℓι ˆ ˜ ˜ Therefore, (73) is at most 24θ f (d/λ )(d/λ ) ln (λ /δ ) /γ δ f . Thus, since (i) (i) (ii) (ii) ˜ ˜ ˜ n(iii) (δ , λ ) ≥ max c1 ln c2 /δ , c1 ln c2 /δ ˜ , ˜ (i) ˜ (ii) ˜ (iii) Lemmas 51 and 52 imply that on Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), ˜ ˆ Ld˜f = Td⋆f / 3∆(d f ) ˜ ˜ ˜ ˆ ≥ 41−d f 2n/ 9∆(d f ) ˜ ˜ 41−d f γ δ f n ≥ λ ln(λ /δ ) ≥ λ . ≥ ˜ 9 · 24 · θ f (d/λ )(d/λ ) ln (λ /δ ) ˜ (i) ˜ (ii) ˜ (iii) Now we turn to bounding P Hn (δ ) ∩ Hn \ Hn (δ , λ ) . By a union bound, we have ˆ ι ˜ 1 − P H (3) (δ , λ ) ≤ i=1 ˆ (1 − P (Hℓi (δ /2ι ))) ≤ δ /2. 1566 (74) ACTIVIZED L EARNING ˜ (i) ˜ (ii) ˜ Thus, it remains only to bound P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ md˜f < ℓι ˇ ˆ . ˇ ˆ For each i ∈ {0, 1, . . . , ι − 1}, let Qi = ⋆ ˇ m ∈ (ℓi , ℓi+1 ] ∩ Ud˜f : Imd˜ = 1 . Now consider the set f ˇ ˆ I of all i ∈ {0, 1, . . . , ι − 1} with ℓi ≥ mn and (ℓi , ℓi+1 ] ∩ Ud˜f = ∅. Note that n(iii) (δ , λ ) ≥ 48, so that ˜ ℓ0 < mn . Fix any i ∈ I. Since n(iii) (λ , δ ) ≥ 24· τ (1/6; δ ), we have mn ≥ τ (1/6; δ ), so that Lemma 49 ˜ ˜ (i) (ii) (3) (δ , λ ), letting Q = 2 · 46+d˜f d/γ 2 δ 2 θ (d/λ ) ln(λ /δ ), ˜ ˜f ¯ ˜ ˜ ˜ implies that on Hn (δ ) ∩ Hn ∩ H f ˇ ¯ ˜ (ii) ˜ ˜ (i) P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ Qi > Q W2 ,Vℓ⋆ i ˜ ˆ (d ) m ∈ (ℓi , ℓi+1 ] ∩ N : ∆m f Xm ,W2 ,Vℓ⋆ ≥ 2γ /3 i ≤P ¯ > Q W2 ,Vℓ⋆ . (75) i ˜ ˆm ½[2γ /3,∞) ∆(d f ) Xm ,W2 ,Vℓ⋆i are conditionally (given W2 ,Vℓ⋆ ) indepeni dent, each with respective conditional distribution Bernoulli with mean p2γ /3 d˜f , ℓi , m . Since ¯ n(iii) (δ , λ ) ≥ 24 · τ (3/32; δ ), we have mn ≥ τ (3/32; δ ), so that Lemma 43 (with ζ = 2γ /3, α = 3/4, ˜ (i) ˜ n (δ ) ∩ Hn ∩ H (3) (δ , λ ), each of these m values has ˜ (ii) ˜ and β = 3/32) implies that on H For m > ℓi , the variables ˜ p2γ /3 d˜f , ℓi , m ≤ P x : px d˜f , ℓi ≥ γ /2 + exp −M(m)γ 2 /256 ¯ ˜ ≤ ˜ 2P d f S d f Vℓ⋆ i ˜ γ P d f −1 ˜ S d f −1 ˜ Vℓ⋆ i ˜ ˜ ≤ 2/γ δ f P d f S d f Vℓ⋆ i ˜ + exp −M(ℓi )γ 2 /256 (Markov’s ineq.) ˜ + exp −M(ℓi )γ 2 /256 ˜ ˜ ˜ ˜ ≤ 2/γ δ f P d f S d f B f , φ (ℓi , δ , λ ) (Lemma 35) ˜ + exp −M(ℓi )γ 2 /256 ˜ ˜ ˜ ˜ ≤ 2/γ δ f θ f (d/λ )φ (ℓi , δ , λ ) + exp −M(ℓi )γ 2 /256 (Lemma 29) ˜ (defn of θ f (d/λ )). Denote the expression in this last line by pi , and let B(ℓi , pi ) be a Binomial(ℓi , pi ) random vari˜ (ii) ˜ ˜ (i) able. Noting that ℓi+1 − ℓi = ℓi , we have that on Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ), (75) is at most ¯ P B(ℓi , pi ) > Q . Next, note that ˜ ˜ ˜ ˜ ℓi pi = (2/γ δ f )θ f (d/λ )ℓi φ (ℓi , δ , λ ) + ℓi · exp −ℓ3 δ f γ 2 /512 . i ˜ Since u · exp −u3 ≤ (3e)−1/3 for any u, letting u = ℓi δ f γ /8 we have ˜ ˜ ˜ ˜ ℓi · exp −ℓ3 δ f γ 2 /512 ≤ 8/γ δ f u · exp −u3 ≤ 8/ γ δ f (3e)1/3 ≤ 4/γ δ f . i 1567 H ANNEKE ˜ Therefore, since φ (ℓi , δ , λ ) ≥ ℓ−1 , we have that ℓi pi is at most i ˆ 6 ˜ 4ι 6 ˜ ˜ θ (d/λ )ℓi φ (ℓi , δ , λ ) ≤ θ (d/λ ) max 2d ln (2eℓι ) + 2 ln , ℓι d/λ ˆ ˆ ˜f f ˜f f δ γδ γδ 6 ˜ ≤ θ (d/λ ) max 2d ln ˜ f γδ f ≤ 6 ˜ θ (d/λ ) max 4d ln ˜ f γδ f ˜ ˜ 43+d f eλ ˜ γδ f + 2 ln ˜ d44+d f λ 6 ˜ ln θ f (d/λ ) · ≤ ˜f ˜f δ γδ γδ ˜ d43+d f , ˜ γδ f ˜ ˜ 43+d f λ ˜ γδ f δ 43+d f 2λ ˜ γδ f δ , d43+d f ˜ γδ f ˜ 46+d f d ˜ λ ≤ θ f (d/λ ) ln ˜ 2δ 2 δ γ f ¯ = Q/2. (i) ¯ ¯ ˜ ˆ Therefore, a Chernoff bound implies P B(ℓi , pi ) > Q ≤ exp −Q/6 ≤ δ /2ι , so that on Hn (δ ) ∩ (ii) (3) (δ , λ ), (75) is at most δ /2ι . The law of total probability implies there exists an event ˜ ˜ ˆ Hn ∩ H (4) (i) (ii) ˜ ˜ ˜ ˜ ˜ (4) ˜ (i) ˆ Hn (i, δ , λ ) with P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ Hn (i, δ , λ ) ≤ δ /2ι such that, on Hn (δ ) ∩ ˇ ¯ ˜ (4) ˜ (ii) ˜ Hn ∩ H (3) (δ , λ ) ∩ Hn (i, δ , λ ), Qi ≤ Q. Note that ˜ ˜ ˜ ˜ ˜ ˆ¯ ι Q ≤ log2 42+d f λ /γ δ f · 47+d f d/γ 2 δ f2 θ f (d/λ ) ln(λ /δ ) ˜ ˜ ˜ ˜ ≤ d˜f 49+d f /γ 3 δ f3 d θ f (d/λ ) ln2 (λ /δ ) ≤ 41−d f n/12. ⋆ m≤2mn Imd˜f ˜ ˜ (4) i∈I Hn (i, δ , λ ), Since (76) ˜ (i) ˜ (ii) ˜ ≤ n/12, if d˜f = 1 then (76) implies that on the event Hn (δ )∩ Hn ∩ H (3) (δ , λ )∩ ˇ ˆ¯ ≤ n/12 + i∈I Qi ≤ n/12 + ι Q ≤ n/6 ≤ ⌈T1⋆ /4⌉, so that m1 ≥ ℓι . ˇ ˆ (i) ˇ ˇ ˇ ˜ ˜ Otherwise, if d˜f > 1, then every m ∈ Ud˜f has m > 2mn , so that i≤ˆ Qi = i∈I Qi ; thus, on Hn (δ )∩ ι ˜f ˜f (4) (ii) ˇ ˜ ˜ ˜ ˆ¯ Hn ∩ H (3) (δ , λ ) ∩ i∈I Hn (i, δ , λ ), i∈I Qi ≤ ι Q ≤ 41−d n/12; Lemma 50 implies 41−d n/12 ≤ ⋆ m≤ℓι Im1 ˆ ˇ Td⋆ /4 , so that again we have md˜f ≥ ℓι . Combined with a union bound, this implies ˆ ˜ f ˜ (ii) ˜ ˜ (i) ˇ P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) ∩ md˜f < ℓι ˆ ˜ (ii) ˜ ˜ (i) ≤ P Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ ≤ P i∈I ˜ (4) Hn (i, δ , λ ) i∈I ˜ (i) ˜ (ii) ˜ ˜ (4) Hn (δ ) ∩ Hn ∩ H (3) (δ , λ ) \ Hn (i, δ , λ ) ≤ δ /2. (77) ˜ (ii) ˜ (iii) ˜ (i) Therefore, P Hn (δ ) ∩ Hn \ Hn (δ , λ ) ≤ δ , obtained by summing (77) and (74). Proof [Theorem 16] If Λ p (ε /4, f , P) = ∞ then the result trivially holds. Otherwise, suppose (i) (ii) (iii) ε ∈ (0, 10e−3 ), let δ = ε /10, λ = Λ p (ε /4, f , P), c2 = max 10c2 , 10c2 , 10c2 , 10e(d + 1) , ˜ ˜ ˜ ˜ (i) (ii) (iii) and c1 = max c1 , c1 , c1 , 2 · 63 (d + 1)d˜ln(e(d + 1)) , and consider running Meta-Algorithm ˜ ˜ ˜ ˜ 1568 ACTIVIZED L EARNING 3 with passive algorithm A p and budget n ≥ c1 θ f (d/λ ) ln2 (c2 λ /ε ), while f is the target func˜ ˜ ˜ (i) ˜ (ii) ˜ (iii) ˜ n (δ ) ∩ Hn ∩ Hn (δ , λ ), Lemma 53 imtion and P is the data distribution. On the event H ˜ ⋆ plies Ld˜f ≥ λ , while Lemma 52 implies V (d f ) = Vm ˜ ; recalling that Lemma 35 implies that ˆ df ⋆ Vm ˜ ˆd f ˆ = ∅ on this event, we must have erLd˜ ( f ) = 0. Furthermore, if h is the classifier returned f ˆ by Meta-Algorithm 3, then Lemma 34 implies that er(h) is at most 2 er(A p (Ld˜f )), on a high ˆ ˜ (i) ˜ (ii) ˜ (iii) ˆ ˆ probability event (call it E2 in this context). Letting E3 (δ ) = E2 ∩ Hn (δ ) ∩ Hn ∩ Hn (δ , λ ), ˆ a union bound implies the total failure probability 1 − P(E3 (δ )) from all of these events is at most 4δ + e(d + 1) · exp −⌊n/3⌋/ 72d˜f (d + 1) ln(e(d + 1)) ≤ 5δ = ε /2. Since, for ℓ ∈ N with P Ld˜f = ℓ > 0, the sequence of Xm values appearing in Ld˜f are conditionally distributed as P ℓ given |Ld˜f | = ℓ, and this is the same as the (unconditional) distribution of {X1 , X2 , . . . , Xℓ }, we have that ˆ E er h ≤ E 2 er A p Ld˜f ½E3 (δ ) + ε /2 = E E 2 er A p Ld˜f ˆ ≤2 sup ℓ≥Λ p (ε /4, f ,P) ½E3 (δ ) |Ld˜f | + ε /2 ˆ E [er(A p (Zℓ ))] + ε /2 ≤ ε . To specialize to the specific variant of Meta-Algorithm 3 stated in Section 5.2, take γ = 1/2. Appendix E. Proofs Related to Section 6: Agnostic Learning This appendix contains the proofs of our results on learning with noise. Specifically, Appendix E.1 provides the proof of the counterexample from Theorem 22, demonstrating that there is no activizer ˇ for the A p passive learning algorithm described in Section 6.2 in the agnostic case. Appendix E.2 presents the proof of Lemma 26 from Section 6.7, bounding the label complexity of Algorithm 5 under Condition 1. Finally, Appendix E.3 presents a proof of Theorem 28, demonstrating that any active learning algorithm can be modified to trivialize the misspecified model case. The notation used throughout Appendix E is taken from Section 6. E.1 Proof of Theorem 22: Negative Result for Agnostic Activized Learning ˇ It suffices to show that A p achieves a label complexity Λ p such that, for any label complexity Λa achieved by any active learning algorithm Aa , there exists a distribution PXY on X × {−1, +1} such that PXY ∈ Nontrivial(Λ p ; C) and yet Λa (ν + cε , PXY ) = o (Λ p (ν + ε , PXY )) for every constant c ∈ (0, ∞). Specifically, we will show that there is a distribution PXY for which Λ p (ν + ε , PXY ) = Θ(1/ε ) and Λa (ν + ε , PXY ) = o(1/ε ). Let P({0}) = 1/2, and for any measurable A ⊆ (0, 1], P(A) = λ (A)/2, where λ is Lebesgue measure. Let D be the family of distributions PXY on X × {−1, +1} characterized by the properties that the marginal distribution on X is P, η (0; PXY ) ∈ (1/8, 3/8), and ∀x ∈ (0, 1], η (x; PXY ) = η (0; PXY ) + (x/2) · (1 − η (0; PXY )) . η (0;PXY Thus, η (x; PXY ) is a linear function. For any PXY ∈ D, since the point z ∗ = 1−2η (0;PXY )) has 1− η (z ∗ ; PXY ) = 1/2, we see that f = hz ∗ is a Bayes optimal classifier. Furthermore, for any η0 ∈ 1569 H ANNEKE [1/8, 3/8], |η (0; PXY ) − η0 | 1 − 2η0 1 − 2η (0; PXY ) = , − 1 − η0 1 − η (0; PXY ) (1 − η0 )(1 − η (0; PXY )) and since (1 − η0 )(1 − η (0; PXY )) ∈ (25/64, 49/64) ⊂ (1/3, 1), the value z = 1−2η0 1−η0 satisfies |η0 − η (0; PXY )| ≤ |z − z ∗ | ≤ 3|η0 − η (0; PXY )|. (78) Also note that under PXY , since (1 − 2η (0; PXY )) = (1 − η (0; PXY ))z ∗ , any z ∈ (0, 1) has er(hz ) − er(hz ∗ ) = z∗ z 1 − 2η (x; PXY ) dx = = (1 − η (0; PXY )) z∗ z z∗ z 1 − 2η (0; PXY ) − x(1 − η (0; PXY )) dx (z ∗ − x) dx = (1 − η (0; PXY )) ∗ (z − z)2 , 2 so that 5 7 (z − z ∗ )2 ≤ er(hz ) − er(hz ∗ ) ≤ (z − z ∗ )2 . 16 16 Finally, note that any x, x′ ∈ (0, 1] with |x − z ∗ | < |x′ − z ∗ | has (79) |1 − 2η (x; PXY )| = |x − z ∗ |(1 − η (0; PXY )) < |x′ − z ∗ |(1 − η (0; PXY )) = |1 − 2η (x′ ; PXY )|. ′ ′ ′ Thus, for any q ∈ (0, 1/2], there exists zq ∈ [0, 1] such that z ∗ ∈ [zq , zq + 2q] ⊆ [0, 1], and the clas′ ′ sifier h′ (x) = hz ∗ (x) · 1 − 2½(zq ,zq +2q] (x) has er(h) ≥ er(h′ ) for every classifier h with h(0) = q q ′ −1 and P(x : h(x) = hz ∗ (x)) = q. Noting that er(h′ ) − er(hz ∗ ) = limz↓zq er(hz ) − er(hz ∗ ) + q ′ er(hzq +2q ) − er(hz ∗ ) , (79) implies that er(h′ ) − er(hz ∗ ) ≥ q 5 16 ′ zq − z ∗ 2 ′ + zq + 2q − z ∗ 2 , and 5 ′ ′ since max{z ∗ − zq , zq + 2q − z ∗ } ≥ q, this is at least 16 q2 . In general, any h with h(0) = +1 has er(h) − er(hz ∗ ) ≥ 1/2 − η (0; PXY ) > 1/8 ≥ (1/8)P(x : h(x) = hz ∗ (x))2 . Combining these facts, we see that any classifier h has er(h) − er(hz ∗ ) ≥ (1/8)P (x : h(x) = hz ∗ (x))2 . (80) ˇ Lemma 54 The passive learning algorithm A p achieves a label complexity Λ p such that, for every PXY ∈ D, Λ p (ν + ε , PXY ) = Θ(1/ε ). ˇ ˆ Proof Consider the values η0 and z from A p (Zn ) for some n ∈ N. Combining (78) and (79), ˆ 7 ∗ )2 ≤ 63 (η − η (0; P ))2 ≤ 4(η − η (0; P ))2 . Let N = ˆ0 ˆ0 we have er(hz ) − er(hz ∗ ) ≤ 16 (ˆ − z z XY XY n ˆ 16 −1 ¯ 0 = Nn |{i ∈ {1, . . . , n} : Xi = 0,Yi = +1}| if Nn > 0, or η0 = 0 if ¯ |{i ∈ {1, . . . , n} : Xi = 0}|, and η ˆ ˆ ¯ Nn = 0. Note that η0 = η0 ∨ 1 ∧ 3 , and since η (0; PXY ) ∈ (1/8, 3/8), we have |η0 − η (0; PXY )| ≤ 8 8 ¯ |η0 − η (0; PXY )|. Therefore, for any PXY ∈ D, ¯ ˆ E [er(hz ) − er(hz ∗ )] ≤ 4E (η0 − η (0; PXY ))2 ≤ 4E (η0 − η (0; PXY ))2 ˆ ¯ ≤ 4E E (η0 − η (0; PXY ))2 Nn ½[n/4,n] (Nn ) + 4P(Nn < n/4). (81) ¯ By a Chernoff bound, P(Nn < n/4) ≤ exp{−n/16}, and since the conditional distribution of Nn η0 given Nn is Binomial(Nn , η (0; PXY )), (81) is at most 4E 16 68 4 15 1 < . η (0; PXY )(1 − η (0; PXY )) + 4 · exp {−n/16} ≤ 4 · · + 4 · Nn ∨ n/4 n 64 n n 1570 ACTIVIZED L EARNING ˇ For any n ≥ ⌈68/ε ⌉, this is at most ε . Therefore, A p achieves a label complexity Λ p such that, for any PXY ∈ D, Λ p (ν + ε , PXY ) = ⌈68/ε ⌉ = Θ(1/ε ). Next we establish a corresponding lower bound for any active learning algorithm. Note that this requires more than a simple minimax lower bound, since we must have an asymptotic lower bound for a fixed PXY , rather than selecting a different PXY for each ε value; this is akin to the strong minimax lower bounds proven by Antos and Lugosi (1998) for passive learning in the realizable case. For this, we proceed by reduction from the task of estimating a binomial mean; toward this end, the following lemma will be useful. Lemma 55 For any nonempty (a, b) ⊂ [0, 1], and any sequence of estimators pn : {0, 1}n → [0, 1], ˆ there exists p ∈ (a, b) such that, if B1 , B2 , . . . are independent Bernoulli(p) random variables, also independent from every pn , then E ( pn (B1 , . . . , Bn ) − p)2 = o(1/n). ˆ ˆ Proof We first establish the claim when a = 0 and b = 1. For any p ∈ [0, 1], let B1 (p), B2 (p), . . . be i.i.d. Bernoulli(p) random variables, independent from any internal randomness of the pn estiˆ mators. We proceed by reduction from hypothesis testing, for which there are known lower bounds. Specifically, it is known (e.g., Wald, 1945; Bar-Yossef, 2003) that for any p, q ∈ (0, 1), δ ∈ (0, e−1 ), any (possibly randomized) q : {0, 1}n → {p, q}, and any n ∈ N, ˆ n< (1 − 8δ ) ln(1/8δ ) =⇒ 8KL(p q) max P (q(B1 (p∗ ), . . . , Bn (p∗ )) = p∗ ) > δ , ˆ p∗ ∈{p,q} where KL(p q) = p ln(p/q) + (1 − p) ln((1 − p)/(1 − q)). It is also known (e.g., Poland and Hutter, 2006) that for p, q ∈ [1/4, 3/4], KL(p q) ≤ (8/3)(p − q)2 . Combining this with the above fact, we have that for p, q ∈ [1/4, 3/4], max P (q(B1 (p∗ ), . . . , Bn (p∗ )) = p∗ ) ≥ (1/16) · exp −128(p − q)2 n/3 . ˆ p∗ ∈{p,q} (82) Given the estimator pn from the lemma statement, we construct a sequence of hypothesis tests as ˆ follows. For i ∈ N, let αi = exp −2i and ni = 1/αi2 . Define p∗ = 1/4, and for i ∈ N, induc0 tively define qi (b1 , . . . , bni ) = argmin p∈{p∗ ,p∗ +αi } | pni (b1 , . . . , bni ) − p| for b1 , . . . , bni ∈ {0, 1}, and ˆ ˆ i−1 i−1 ˆ p∗ = argmax p∈{p∗ ,p∗ +αi } P (qi (B1 (p), . . . , Bni (p)) = p). Finally, define p∗ = limi→∞ p∗ . Note that i i i−1 i−1 ∞ ∗ < 1/2, p∗ , p∗ + α ∈ [1/4, 3/4], and 0 ≤ p∗ − p∗ ≤ 2 ∀i ∈ N, pi i i i−1 i−1 j=i+1 α j < 2αi+1 = 2αi . We generally have 1 E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 − (p∗ − p∗ )2 ˆ ˆ i i 3 1 ˆ ≥ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 − 4αi4 . i 3 Furthermore, note that for any m ∈ {0, . . . , ni }, (p∗ )m (1 − p∗ )ni −m ≥ (p∗ )m (1 − p∗ )ni −m i i 1 − p∗ 1 − p∗ i ≥ 1 − 4αi2 1571 ni ≥ ni 1 − p∗ − 2αi2 i 1 − p∗ i ni ≥ exp −8αi2 ni ≥ e−8 , H ANNEKE so that the probability mass function of (B1 (p∗ ), . . . , Bni (p∗ )) is never smaller than e−8 times that of (B1 (p∗ ), . . . , Bni (p∗ )), which implies (by the law of the unconscious statistician) i i ˆ E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ e−8 E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 . ˆ i i i i By a triangle inequality, we have E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ ˆ i i i αi2 P (qi (B1 (p∗ ), . . . , Bni (p∗ )) = p∗ ) . ˆ i i i 4 By (82), this is at least αi2 (1/16) · exp −128αi2 ni /3 ≥ 2−6 e−43 αi2 . 4 Combining the above, we have E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 ≥ 3−1 2−6 e−51 αi2 − 4αi4 ≥ 2−9 e−51 n−1 − 4n−2 . ˆ i i For i ≥ 5, this is larger than 2−11 e−51 n−1 . Since ni diverges as i → ∞, we have that i E ( pni (B1 (p∗ ), . . . , Bni (p∗ )) − p∗ )2 = o(1/n), ˆ which establishes the result for a = 0 and b = 1. To extend this result to general nonempty ranges (a, b), we proceed by reduction from the above problem. Specifically, suppose p′ ∈ (0, 1), and consider the following independent random variables (also independent from the Bi (p′ ) variables and pn estimators). For each i ∈ N, Ci1 ∼ ˆ Bernoulli(a), Ci2 ∼ Bernoulli((b − a)/(1 − a)). Then for bi ∈ {0, 1}, define B′ (bi ) = max{Ci1 ,Ci2 · i bi }. For any given p′ ∈ (0, 1), the random variables B′ (Bi (p′ )) are i.i.d. Bernoulli (p), with p = i a + (b − a)p′ ∈ (a, b) (which forms a bijection between (0, 1) and (a, b)). Defining p′ (b1 , . . . , bn ) = ˆn ′ (b ), . . . , B′ (b )) − a)/(b − a), we have ( pn (B1 1 ˆ n n E ( pn (B1 (p), . . . , Bn (p)) − p)2 = (b − a)2 · E ˆ p′ (B1 (p′ ), . . . , Bn (p′ )) − p′ ˆn 2 . (83) We have already shown there exists a value of p′ ∈ (0, 1) such that the right side of (83) is not o(1/n). Therefore, the corresponding value of p = a + (b − a)p′ ∈ (a, b) has the left side of (83) not o(1/n), which establishes the result. We are now ready for the lower bound result for our setting. Lemma 56 For any label complexity Λa achieved by any active learning algorithm Aa , there exists a PXY ∈ D such that Λa (ν + ε , PXY ) = o(1/ε ). Proof The idea here is to reduce from the task of estimating the mean of iid Bernoulli trials, corresponding to the Yi values. Specifically, consider any active learning algorithm Aa ; we use Aa to construct an estimator for the mean of iid Bernoulli trials as follows. Suppose we have B1 , B2 , . . . , Bn i.i.d. Bernoulli(p), for some p ∈ (1/8, 3/8) and n ∈ N. We take the sequence of X1 , X2 , . . . random 1572 ACTIVIZED L EARNING variables i.i.d. with distribution P defined above (independent from the B j variables). For each i, we additionally have a random variable Ci with conditional distribution Bernoulli(Xi /2) given Xi , where the Ci are conditionally independent given the Xi sequence, and independent from the Bi sequence as well. We run Aa with this sequence of Xi values. For the t th label request made by the algorithm, say for the Yi value corresponding to some Xi , if it has previously requested this Yi already, then we simply repeat the same answer for Yi again, and otherwise we return to the algorithm the value 2 max{Bt ,Ci } − 1 for Yi . Note that in the latter case, the conditional distribution of max{Bt ,Ci } is Bernoulli(p + (1 − p)Xi /2), given the Xi that Aa requests the label of; thus, the Yi response has the same conditional distribution given Xi as it would have for the PXY ∈ D with η (0; PXY ) = p (i.e., η (Xi ; PXY ) = p + (1 − p)Xi /2). Since this Yi value is conditionally (given Xi ) independent from the previously returned labels and X j sequence, this is distributionally equivalent to running Aa under the PXY ∈ D with η (0; PXY ) = p. ˆ Let hn be the classifier returned by Aa (n) in the above context, and let zn denote the value ˆ 1−zn ˆ ˆ of z ∈ [2/5, 6/7] with minimum P(x : hz (x) = hn (x)). Then define pn = 2−zn ∈ [1/8, 3/8] and ˆ ˆ z z ∗ = 1−2p ∈ (2/5, 6/7). By a triangle inequality, we have |ˆ n − z ∗ | = 2P(x : hzn (x) = hz ∗ (x)) ≤ ˆ 1−p ˆ n (x) = hz ∗ (x)). Combining this with (80) and (78) implies that 4P(x : h 1 ˆ ˆ er(hn ) − er(hz ∗ ) ≥ P x : hn (x) = hz ∗ (x) 8 2 ≥ 1 1 (ˆ n − z ∗ )2 ≥ z ( pn − p)2 . ˆ 128 128 (84) In particular, by Lemma 55, we can choose p ∈ (1/8, 3/8) so that E ( pn − p)2 = o(1/n), which, by ˆ ˆ (84), implies E er(hn ) − ν = o(1/n). This means there is an increasing infinite sequence of values ˆ nk ∈ N, and a constant c ∈ (0, ∞) such that ∀k ∈ N, E er(hnk ) − ν ≥ c/nk . Supposing Aa achieves label complexity Λa , and taking the values εk = c/(2nk ), we have Λa (ν + εk , PXY ) > nk = c/(2εk ). Since εk > 0 and approaches 0 as k → ∞, we have Λa (ν + ε , PXY ) = o(1/ε ). Proof [of Theorem 22] The result follows from Lemmas 54 and 56. E.2 Proof of Lemma 26: Label Complexity of Algorithm 5 The proof of Lemma 26 essentially runs parallel to that of Theorem 16, with variants of each lemma from that proof adapted to the noise-robust Algorithm 5. As before, in this section we will fix a particular joint distribution PXY on X × {−1, +1} with marginal P on X , and then analyze the label complexity achieved by Algorithm 5 for that particular distribution. For our purposes, we will suppose PXY satisfies Condition 1 for some finite parameters µ and κ . We also fix any f ∈ cl(C(ε )). Furthermore, we will continue using the notation of ε >0 ⋆ Appendix B, such as S k (H), etc., and in particular we continue to denote Vm = {h ∈ C : ∀ℓ ≤ ⋆ m, h(Xℓ ) = f (Xℓ )} (though note that in this case, we may sometimes have f (Xℓ ) = Yℓ , so that Vm = C[Zm ]). As in the above proofs, we will prove a slightly more general result in which the “1/2” threshold in Step 5 can be replaced by an arbitrary constant γ ∈ (0, 1). ˆ For the estimators P4m used in the algorithm, we take the same definitions as in Appendix B.1. To be clear, we assume the sequences W1 and W2 mentioned there are independent from the entire 1573 H ANNEKE (X1 ,Y1 ), (X2 ,Y2 ), . . . sequence of data points; this is consistent with the earlier discussion of how these W1 and W2 sequences can be constructed in a preprocessing step. We will consider running Algorithm 5 with label budget n ∈ N and confidence parameter δ ∈ ˆ ˆ ˆ (0, e−3 ), and analyze properties of the internal sets Vi . We will denote by Vi , Li , and ik , the final values of Vi , Li , and ik , respectively, for each i and k in Algorithm 5. We also denote by m(k) ˆ ˆ and V (k) the final values of m and Vik +1 , respectively, obtained while k has the specified value in ˆ ˆ ˆ Algorithm 5; V (k) may be smaller than Viˆk when m(k) is not a power of 2. Additionally, define ⋆ = {(X ,Y )}2i Li m m m=2i−1 +1 . After establishing a few results concerning these, we will show that for n satisfying the condition in Lemma 26, the conclusion of the lemma holds. First, we have a few auxiliary definitions. For H ⊆ C, and any i ∈ N, define φi (H) = E sup h1 ,h2 ∈H er(h1 ) − erL⋆ (h1 ) − er(h2 ) − erL⋆ (h2 ) i i ˜ ˜ and Ui (H, δ ) = min K φi (H) + diam(H) ln(32i2 /δ ) ln(32i2 /δ ) + 2i−1 2i−1 ,1 , ˜ where for our purposes we can take K = 8272. It is known (see, e.g., Massart and N´ d´ lec, 2006; e e Gin´ and Koltchinskii, 2006) that for some universal constant c′ ∈ [2, ∞), e φi+1 (H) ≤ c′ max diam(H)2−i d log2 2 , 2−i di . diam(H) (85) We also generally have φi (H) ≤ 2 for every i ∈ N. The next lemma is taken from the work of Koltchinskii (2006) on data-dependent Rademacher complexity bounds on the excess risk. Lemma 57 For any δ ∈ (0, e−3 ), any H ⊆ C with f ∈ cl(H), and any i ∈ N, on an event Ki with P(Ki ) ≥ 1 − δ /4i2 , ∀h ∈ H, ˆ erL⋆ (h) − min erL⋆ (h′ ) ≤ er(h) − er( f ) + Ui (H, δ ) i i ′ h ∈H ˆ er(h) − er( f ) ≤ erL⋆ (h) − erL⋆ ( f ) + Ui (H, δ ) i i ˜ ˆ min Ui (H, δ ), 1 ≤ Ui (H, δ ). Lemma 57 essentially follows from a version of Talagrand’s inequality. The details of the proof may be extracted from the proofs of Koltchinskii (2006), and related derivations have previously been presented by Hanneke (2011) and Koltchinskii (2010). The only minor twist here is that f need only be in cl(H), rather than in H itself, which easily follows from Koltchinskii’s original results, since the Borel-Cantelli lemma implies that with probability one, every ε > 0 has some g ∈ H(ε ) (very close to f ) with erL⋆ (g) = erL⋆ ( f ). i i For our purposes, the important implications of Lemma 57 are summarized by the following lemma. Lemma 58 For any δ ∈ (0, e−3 ) and any n ∈ N, when running Algorithm 5 with label budget n and ˆ ˆ confidence parameter δ , on an event Jn (δ ) with P(Jn (δ )) ≥ 1 − δ /2, ∀i ∈ {0, 1, . . . , id+1 }, if V2⋆i ⊆ Vi 1574 ACTIVIZED L EARNING ˆ then ∀h ∈ Vi , ˆ ˆ erL⋆ (h) − min erL⋆ (h′ ) ≤ er(h) − er( f ) + Ui+1 (Vi , δ ) i+1 i+1 ˆ h′ ∈Vi ˆ ˆ er(h) − er( f ) ≤ erL⋆ (h) − erL⋆ ( f ) + Ui+1 (Vi , δ ) i+1 i+1 ˆ ˆ ˜ ˆ min Ui+1 (Vi , δ ), 1 ≤ Ui+1 (Vi , δ ). ˆ Proof For each i, consider applying Lemma 57 under the conditional distribution given Vi . The ⋆ is independent from V , as are the Rademacher variables in the definition of R ˆi ˆ i+1 (Vi ). Furˆ set Li+1 thermore, by Lemma 35, on H ′ , f ∈ cl V2⋆i , so that the conditions of Lemma 57 hold. The law of total probability then implies the existence of an event Ji of probability P(Ji ) ≥ 1 − δ /4(i + 1)2 , on ˆ which the claimed inequalities hold for that value of i if i ≤ id+1 . A union bound over values of i then implies the existence of an event Jn (δ ) = i Ji with probability P(Jn (δ )) ≥ 1 − i δ /4(i + 1)2 ≥ ˆ 1 − δ /2 on which the claimed inequalities hold for all i ≤ id+1 . Lemma 59 For some (C, PXY , γ )-dependent constants c, c∗ ∈ [1, ∞), for any δ ∈ (0, e−3 ) and integer n ≥ c∗ ln(1/δ ), when running Algorithm 5 with label budget n and confidence parameter δ , on (i) (ii) ˆ event Jn (δ ) ∩ Hn ∩ Hn , every i ∈ {0, 1, . . . , id˜f } satisfies V2⋆i di + ln(1/δ ) ˆ ⊆ Vi ⊆ C c 2i κ 2κ −1 , ˜ ˆ and furthermore V ⋆(d˜f ) ⊆ V (d f ) . m ˆ ˜ √ Proof Define c = 24Kc′ µ 2κ 2κ −1 , c∗ = max τ ∗ , 8d µ c1/κ r(1−γ )/6 1 2κ −1 log2 4µ c1/κ r(1−γ )/6 , and suppose n ≥ c∗ ln(1/δ ). We now proceed by induction. As the right side equals C for i = 0, the claimed ˆ inclusions are certainly true for V0 = C, which serves as our base case. Now suppose some i ∈ ˆd˜ } satisfies {0, 1, . . . , i f V2⋆i di + ln(1/δ ) ˆ ⊆ Vi ⊆ C c 2i κ 2κ −1 . (86) In particular, Condition 1 implies di + ln(1/δ ) ˆ diam(Vi ) ≤ diam C c 2i κ 2κ −1 ≤ µc 1 κ di + ln(1/δ ) 2i 1 2κ −1 . (87) ˆ ˆ ˆ If i < id˜f , then let k be the integer for which ik−1 ≤ i < ik , and otherwise let k = d˜f . Note that we ˆ certainly have i1 ≥ ⌊log2 (n/2)⌋, since m = ⌊n/2⌋ ≥ 2⌊log2 (n/2)⌋ is obtained while k = 1. Therefore, if k > 1, di + ln(1/δ ) 4d log2 (n) + 4 ln(1/δ ) ≤ , 2i n 1575 H ANNEKE so that (87) implies 1 2κ −1 4d log2 (n) + 4 ln(1/δ ) n 1 ˆ diam Vi ≤ µ c κ . By our choice of c∗ , the right side is at most r(1−γ )/6 . Therefore, since Lemma 35 implies f ∈ cl V2⋆i (i) ˆ ˆ on Hn , we have Vi ⊆ B f , r(1−γ )/6 when k > 1. Combined with (86), we have that V2⋆i ⊆ Vi , and ˆ either k = 1, or Vi ⊆ B( f , r(1−γ )/6 ) and 4m > 4⌊n/2⌋ ≥ n. Now consider any m with 2i + 1 ≤ m ≤ ˜ ⋆ min 2i+1 , m(d f ) , and for the purpose of induction suppose Vm−1 ⊆ Vi+1 upon reaching Step 5 for ˆ ˆ that value of m in Algorithm 5. Since Vi+1 ⊆ Vi and n ≥ τ ∗ , Lemma 41 (with ℓ = m − 1) implies that (i) (ii) on Hn ∩ Hn , ˆ (k) ˆ (k) ˆ (k) ∆4m (Xm ,W2 ,Vi+1 ) < γ =⇒ Γ4m (Xm , − f (Xm ),W2 ,Vi+1 ) < Γ4m (Xm , f (Xm ),W2 ,Vi+1 ) , ⋆ ⋆ so that after Step 8 we have Vm ⊆ Vi+1 . Since (86) implies that the Vm−1 ⊆ Vi+1 condition holds if i + 1 (at which time V ˆ Algorithm 5 reaches Step 5 with m = 2 i+1 = Vi ), we have by induction that (i) (ii) i+1 , m(d˜f ) . This establishes the ⋆ ⊆V ˆ on Hn ∩ Hn , Vm i+1 upon reaching Step 9 with m = min 2 final claim of the lemma, given that the first claim holds. For the remainder of this inductive proof, ˆ suppose i < id˜f . Since Step 8 enforces that, upon reaching Step 9 with m = 2i+1 , every h1 , h2 ∈ Vi+1 (i) (ii) have erLi+1 (h1 ) − erLi+1 (h2 ) = erL⋆ (h1 ) − erL⋆ (h2 ), on Jn (δ ) ∩ Hn ∩ Hn we have ˆ ˆ i+1 i+1 ˆ Vi+1 ⊆ ˆ ˆ ˆ h ∈ Vi : erL⋆ (h) − ′min erL⋆ (h′ ) ≤ Ui+1 Vi , δ i+1 i+1 ⋆ h ∈V 2i+1 ˆ ˆ ˆ ⊆ h ∈ Vi : erL⋆ (h) − erL⋆ ( f ) ≤ Ui+1 Vi , δ i+1 i+1 ˆ ˆ ˆ ⊆ Vi ∩ C 2Ui+1 Vi , δ ˆ ˜ ⊆ C 2Ui+1 Vi , δ , (88) where the second line follows from Lemma 35 and the last two inclusions follow from Lemma 58. ˆ Focusing on (88), combining (87) with (85) (and the fact that φi+1 (Vi ) ≤ 2), we can bound the value ˆ ˜ i+1 Vi , δ as follows. of U 2 1 ˆ ln(32(i + 1) /δ ) ≤ √µ c 2κ diam(Vi ) i 2 ≤ √ µc di + ln(1/δ ) 2i 2di + 2 ln(1/δ ) 2i+1 1 2κ √ 1 ≤ 4 µ c 2κ ′√ ˆ φi+1 (Vi ) ≤ c µc ′√ ≤ 4c 1 2κ µc 1 2κ 1 4κ −2 ln(32(i + 1)2 /δ ) 2i 1 4κ −2 d(i + 1) + ln(1/δ ) 2i+1 di + ln(1/δ ) 2i 1 4κ −2 d(i + 1) + ln(1/δ ) 2i+1 1576 1 2 8(i + 1) + 2 ln(1/δ ) 2i+1 κ 2κ −1 , d(i + 2) 2i κ 2κ −1 , 1 2 1 2 ACTIVIZED L EARNING and thus d(i + 1) + ln(1/δ ) 2i+1 ˜ ˆ ˜ √ Ui+1 (Vi , δ ) ≤ min 8Kc′ µ c 2κ 1 d(i + 1) + ln(1/δ ) 2i+1 ˜ √ ≤ 12Kc′ µ c 2κ 1 κ 2κ −1 κ 2κ −1 2 ˜ ln(32(i + 1) /δ ) , 1 +K 2i κ 2κ −1 d(i + 1) + ln(1/δ ) = (c/2) 2i+1 . Combining this with (88) now implies κ 2κ −1 d(i + 1) + ln(1/δ ) ˆ Vi+1 ⊆ C c 2i+1 . ˆ To complete the inductive proof, it remains only to show V2⋆i+1 ⊆ Vi+1 . Toward this end, recall (i) (ii) we have shown above that on Hn ∩ Hn , V2⋆i+1 ⊆ Vi+1 upon reaching Step 9 with m = 2i+1 , and that every h1 , h2 ∈ Vi+1 at this point have erLi+1 (h1 ) − erLi+1 (h2 ) = erL⋆ (h1 ) − erL⋆ (h2 ). Consider any ˆ ˆ i+1 i+1 (i) (ii) h ∈ V2⋆i+1 , and note that any other g ∈ V2⋆i+1 has erL⋆ (g) = erL⋆ (h). Thus, on Hn ∩ Hn , i+1 i+1 erLi+1 (h) − ′min erLi+1 (h′ ) = erL⋆ (h) − ′min erL⋆ (h′ ) ˆ ˆ i+1 i+1 h ∈Vi+1 h ∈Vi+1 ≤ erL⋆ (h) − min erL⋆ (h′ ) = inf erL⋆ (g) − min erL⋆ (h′ ). (89) i+1 i+1 i+1 i+1 ⋆ ˆ h′ ∈Vi g∈V 2i+1 (i) ˆ h′ ∈Vi (ii) Lemma 58 and (86) imply that on Jn (δ ) ∩ Hn ∩ Hn , the last expression in (89) is not larger (i) ˆ ˆ than infg∈V ⋆i+1 er(g) − er( f ) + Ui+1 (Vi , δ ), and Lemma 35 implies f ∈ cl V2⋆i+1 on Hn , so that 2 infg∈V ⋆i+1 er(g) = er( f ). We therefore have 2 ˆ ˆ erLi+1 (h) − ′min erLi+1 (h′ ) ≤ Ui+1 (Vi , δ ), ˆ ˆ h ∈Vi+1 ˆ ˆ so that h ∈ Vi+1 as well. Since this holds for any h ∈ V2⋆i+1 , we have V2⋆i+1 ⊆ Vi+1 . The lemma now follows by the principle of induction. Lemma 60 There exist (C, PXY , γ )-dependent constants c∗ , c∗ ∈ [1, ∞) such that, for any ε , δ ∈ 1 2 (0, e−3 ) and integer 1 2 1 ˜ n ≥ c∗ + c∗ θ f ε κ ε κ −2 log2 , 1 2 2 εδ ∗ when running Algorithm 5 with label budget n and confidence parameter δ , on an event Jn (ε , δ ) ∗ (ε , δ )) ≥ 1 − δ , we have V ˆiˆ ⊆ C(ε ). with P(Jn ˜ df Proof Define   ˜ c∗ = max 2d f +5 1  µ c1/κ r(1−γ )/6 2κ −1 d log2 d µ c1/κ   2 120 , ln 8c(i) , 1/3 ln 8c(ii)  ˜ ˜ r(1−γ )/6 δ 1/3 δf f 1577 H ANNEKE and   ˜ c∗ = max c∗ , 2d f +5 · 2  2κ −1 µ c1/κ r(1−γ )/6 ˜ , 2d f +15 · 1 µ c2 d 2 ˜ Fix any ε , δ ∈ (0, e−3 ) and integer n ≥ c∗ + c∗ θ f ε κ ε κ −2 log2 2 1 2 1 For each i ∈ {0, 1, . . .}, let ri = µ c κ ˜ ˜ i= 2− 1 2κ −1 di+ln(1/δ ) 2i 1 κ log2 ˜ γδ f 1 εδ   log2 (4dc) . 2  . . Also define c 2dc + log2 8d log2 ε εδ . ˇ ˇ ˆ and let i = min i ∈ N : sup j≥i r j < r(1−γ )/6 . For any i ∈ i, . . . , id˜f , let ˜ (d˜ ) ˆ Qi+1 = m ∈ 2i + 1, . . . , 2i+1 : ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 . ˜ Also define 1 2 2dc 96 ˜ ˜ θ ε κ · 2µ c2 · 8d log2 Q= · ε κ −2 . ˜f f εδ γδ (i) (ii) ˆ By Lemma 59 and Condition 1, on Jn (δ ) ∩ Hn ∩ Hn , if i ≤ id˜f , di + ln(1/δ ) ˆ Vi ⊆ C c 2i κ 2κ −1 ⊆ B ( f , ri ) . ˜ (90) (i) (ii) ˆ ˆ Lemma 59 also implies that, on Jn (δ ) ∩ Hn ∩ Hn , for i with id˜f −1 ≤ i ≤ id˜f , all of the sets Vi+1 ˆ obtained in Algorithm 5 while k = d˜f and m ∈ 2i + 1, . . . , 2i+1 satisfy V2⋆i+1 ⊆ Vi+1 ⊆ Vi . Recall that ˜f = 1 or else every m ∈ 2i + 1, . . . , 2i+1 has 4m > n. Also ˆ i1 ≥ ⌊log2 (n/2)⌋, so that we have either d (i) ˇ recall that Lemma 49 implies that when the above conditions are satisfied, and i ≥ i, on H ′ ∩ Gn , ˜f ) ˜f ) ˆ (d ˆ (d ∆4m (Xm ,W2 ,Vi+1 ) ≤ (3/2)∆4m (Xm ,W2 , B ( f , ri )), so that |Qi+1 | upper bounds the number of m ∈ ˜ i + 1, . . . , 2i+1 for which Algorithm 5 requests the label Y in Step 6 of the k = d round. Thus, ˜f 2 m (i) (ii) on Jn (δ ) ∩ Hn ∩ Hn , 2i + ˇ ˆ id˜ f ˇˆ i=max i,id˜ f −1 |Qi+1 | upper bounds the total number of label requests by Algorithm 5 while k = d˜f ; therefore, by the constraint in Step 3, we know that either this quantity ˆ i ˜ +1 ˜ is at least as big as 2−d f n , or else we have 2 d f > d˜f · 2n . In particular, on this event, if we can show that ˆ ˜ min id˜ ,i f ˜ ˜ |Qi+1 | < 2−d f n and 2i+1 ≤ d˜f · 2n , ˇ i 2+ ˇˆ i=max i,id˜ (91) f −1 ˜ ˆ then it must be true that i < id˜f . Next, we will focus on establishing this fact. ˇˆ ˆ ˜ and any m ∈ 2i + 1, . . . , 2i+1 . If d˜f = 1, Consider any i ∈ max i, id˜f −1 , . . . , min id˜f , i then ˜ ˜ ˜ ˆ (d ) ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 = P d f S d f (B ( f , ri )) . 1578 ACTIVIZED L EARNING ˜ ˆ (d ) Otherwise, if d˜f > 1, then by Markov’s inequality and the definition of ∆4mf (·, ·, ·) from (15), ˜ ˜ 3 ˆ (d ) ˆ (d ) ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 ≤ E ∆4mf (Xm ,W2 , B ( f , ri )) W2 2γ 1 3 = (d˜f ) 2γ M (B ( f , r )) ˜i 4m (4m)3 (d˜f ) P Ss s=1 (d˜f ) ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss ˜ (i) . (ii) By Lemma 39, Lemma 59, and (90), on Jn (δ ) ∩ Hn ∩ Hn , this is at most 3 1 ˜ f γ (4m)3 δ (4m)3 (d˜f ) P Ss s=1 24 1 ≤ ˜ f γ 43 23i+3 δ (d˜f ) ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss ˜ 43 23i+3 (d˜f ) P Ss s=1 (d˜f ) ˜ ˜ ∪ {Xm } ∈ S d f (B ( f , ri )) Ss . Note that this value is invariant to the choice of m ∈ 2i + 1, . . . , 2i+1 . By Hoeffding’s inequality, ∗ ∗ on an event Jn (i) of probability P (Jn (i)) ≥ 1 − δ /(16i2 ), this is at most ln(4i/δ ) ˜ ˜ + P d f S d f (B ( f , ri )) ˜ 43 23i+3 24 ˜ δf γ . (92) ˆ Since i ≥ i1 > log2 (n/4) and n ≥ ln(1/δ ), we have ln(4i/δ ) ≤ 2−i 43 23i+3 ln(4 log2 (n/4)/δ ) ≤ 2−i 128n ln(n/δ ) ≤ 2−i . 128n Thus, (92) is at most 24 ˜ ˜ 2−i + P d f S d f (B ( f , ri )) ˜ ˜f γ δ . 1 (i) (ii) ∗ ˜ In either case (d˜f = 1 or d˜f > 1), by definition of θ f ε κ , on Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), ∀m ∈ 2i + 1, . . . , 2i+1 we have ˜ 1 1 24 ˜ ˆ (d ) 2−i + θ f ε κ · max ri , ε κ ˜ ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 ≤ ˜ δf γ . (93) ˜ ˆ (d ) Furthermore, the ½[2γ /3,∞) ∆4mf (Xm ,W2 , B ( f , ri )) indicators are conditionally independent given ˜ ˜ W2 , so that we may bound P |Qi+1 | > Q W2 via a Chernoff bound. Toward this end, note that on (i) (ii) ∗ Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), (93) implies 2i+1 E |Qi+1 | W2 = ≤ 2i · m=2i +1 ˜ ˆ (d ) ˜ P ∆4mf (Xm ,W2 , B ( f , ri )) ≥ 2γ /3 W2 1 1 24 ˜ 2−i + θ f ε κ · max ri , ε κ ˜ ˜ δf γ ≤ 1 24 ˜ 1 ˜ 1 + θ f ε κ · max 2i ri , 2i ε κ ˜ ˜ δf γ 1579 . (94) H ANNEKE Note that 1 2i ri = µ c κ (di + ln(1/δ )) 2κ −1 · 2i(1− 2κ −1 ) ˜ 1 ≤ µc ˜ 1 1 κ Then since 2−i 2κ −1 ≤ most ˜ d i + ln(1/δ ) ε c 1 κ 1 2κ −1 1 ·2 · 8d log2 2dc εδ 1 24 ˜ 1 ˜ 1 + θ f ε κ · µ · 2i ε κ ˜f γδ ≤ ˜ i(1− 2κ1 ) −1 − 2κ1 −1 ≤ µc 2dc 8d log2 εδ 1 κ 1 2κ −1 · 2i(1− 2κ −1 ) . ˜ , we have that the rightmost expression in (94) is at 1 2 2dc ˜ · ε κ −2 1 + θ f ε κ · 2µ c2 · 8d log2 εδ 24 ˜ γδ f 1 (i) ˜ ≤ Q/2. (ii) ∗ Therefore, a Chernoff bound implies that on Jn (δ ) ∩ Hn ∩ Hn ∩ Jn (i), we have 2dc εδ ˜ ˜ P |Qi+1 | > Q W2 ≤ exp −Q/6 ≤ exp −8 log2 ≤ exp − log2 48 log2 (2dc/εδ ) δ ˜ ≤ δ /(8i). Combined with the law of total probability and a union bound over i values, this implies there exists (i) (ii) ∗ an event Jn (ε , δ ) ⊆ Jn (δ ) ∩ Hn ∩ Hn with ˜ i (i) (ii) ∗ P Jn (δ ) ∩ Hn ∩ Hn \ Jn (ε , δ ) ≤ ˇ i=i ˜ δ /(16i2 ) + δ /(8i) ≤ δ /4, ˜ has |Qi+1 | ≤ Q. ˜ ˜ ˇ We have chosen c∗ and c∗ large enough that 2i+1 < d˜f · 2n and 2i < 2−d f −2 n. In particular, this 2 1 ∗ (ε , δ ), means that on Jn ˇˆ ˆ ˜ on which every i ∈ max i, id˜f −1 , . . . , min id˜f , i ˜ˆ min i,id˜ f ˜ ˇ i ˜˜ |Qi+1 | < 2−d f −2 n + iQ. 2+ ˇˆ i=max i,id˜ f −1 ˜ Furthermore, since i ≤ 3 log2 4dc , we have εδ 13 2 1 2 ˜ ˜ ˜ 2 µ c d θ f ε κ · ε κ −2 · log2 4dc iQ ≤ 2 ˜f εδ γδ ≤ 1 2 1 213 µ c2 d log2 (4dc) ˜ ˜ 2 θ f ε κ · ε κ −2 · log2 ≤ 2−d f −2 n. 2 ˜f εδ γδ ∗ ˜ ˆ Combining the above, we have that (91) is satisfied on Jn (ε , δ ), so that id˜f > i. Combined with ∗ (ε , δ ), Lemma 59, this implies that on Jn ˆ Viˆ ˜ df ˜ d i + ln(1/δ ) ˆ ⊆ Vi˜ ⊆ C c 2i˜ 1580 κ 2κ −1 , ACTIVIZED L EARNING ˜ and by definition of i we have ˜ d i + ln(1/δ ) c 2i˜ κ 2κ −1 2dc ≤ c 8d log2 εδ 2dc ≤ c 8d log2 εδ κ 2κ −1 κ 2κ −1 ˜ κ · 2−i 2κ −1 2dc · (ε /c) · 8d log2 εδ − 2κκ −1 = ε, ˆ so that Viˆ ˜ ⊆ C(ε ). df ∗ Finally, to prove the stated bound on P(Jn (ε , δ )), by a union bound we have (i) ∗ 1 − P (Jn (ε , δ )) ≤ (1 − P(Jn (δ ))) + 1 − P Hn (i) (i) (ii) + P Hn \ Hn (ii) ∗ + P Jn (δ ) ∩ Hn ∩ Hn \ Jn (ε , δ ) 1/3 ˜ ˜ ≤ 3δ /4 + c(i) · exp −n3 δ f /8 + c(ii) · exp −nδ f /120 ≤ δ . We are now ready for the proof of Lemma 26. Proof [Lemma 26] First, note that because we break ties in the argmax of Step 7 in favor of a y value ˆ with Vik +1 [(Xm , y)] = ∅, if Vik +1 = ∅ before Step 8, then this remains true after Step 8. Furthermore, ˆ ˆ the Uik +1 estimator is nonnegative, and thus the update in Step 10 never removes from Vik +1 the minimizer of erLi +1 (h) among h ∈ Vik +1 . Therefore, by induction we have Vik = ∅ at all times in ˆ k ˆ ˆ Algorithm 5. In particular, Viˆd+1 +1 = ∅ so that the return classifier h exists. Also, by Lemma 60, for ∗ (ε , δ ), running Algorithm 5 with label budget n and confidence parameter n as in Lemma 60, on Jn ∗ ˆ δ results in Viˆ ˜ ⊆ C(ε ). Combining these two facts implies that for such a value of n, on Jn (ε , δ ), df ˆ ˆ ˆ ˆ h ∈ Viˆd+1 +1 ⊆ Viˆ ˜ ⊆ C(ε ), so that er h ≤ ν + ε . df E.3 The Misspecified Model Case Here we present a proof of Theorem 28, including a specification of the method A′ from the theorem a statement. Proof [Theorem 28] Consider a weakly universally consistent passive learning algorithm Au (Devroye, Gy¨ rfi, and Lugosi, 1996). Such a method must exist in our setting; for instance, Hoeffding’s o inequality and a union bound imply that it suffices to take Au (L) = argmin½± erL (½±i ) + ln(4i |L|) , B 2|L| Bi where {B1 , B2 , . . .} is a countable algebra that generates FX . Then Au achieves a label complexity Λu such that for any distribution PXY on X × {−1, +1}, ∀ε ∈ (0, 1), Λu (ε + ν ∗ (PXY ), PXY ) < ∞. In particular, if ν ∗ (PXY ) < ν (C; PXY ), then we have Λu ((ν ∗ (PXY ) + ν (C; PXY ))/2, PXY ) < ∞. Fix any n ∈ N and describe the execution of A′ (n) as follows. In a preprocessing step, witha hold the first mun = n − ⌊n/2⌋ − ⌊n/3⌋ ≥ n/6 examples {X1 , . . . , Xmun } and request their labels {Y1 , . . . ,Ymun }. Run Aa (⌊n/2⌋) on the remainder of the sequence {Xmun +1 , Xmun +2 , . . .} (i.e., shift 2 1581 H ANNEKE any index references in the algorithm by mun ), and let ha denote the classifier it returns. Also request the labels Ymun +1 , . . .Ymun +⌊n/3⌋ , and let hu = Au (Xmun +1 ,Ymun +1 ), . . . , (Xmun +⌊n/3⌋ ,Ymun +⌊n/3⌋ ) . ˆ ˆ If ermun (ha ) − ermun (hu ) > n−1/3 , return h = hu ; otherwise, return h = ha . This method achieves the stated result, for the following reasons. First, let us examine the final step of this algorithm. By Hoeffding’s inequality, with probability at least 1 − 2 · exp −n1/3 /12 , |(ermun (ha ) − ermun (hu )) − (er(ha ) − er(hu ))| ≤ n−1/3 . ˆ When this is the case, a triangle inequality implies er(h) ≤ min{er(ha ), er(hu ) + 2n−1/3 }. If PXY satisfies the benign noise case, then for any n ≥ 2Λa (ε /2 + ν (C; PXY ), PXY ), ˆ we have E[er(ha )] ≤ ν (C; PXY ) + ε /2, so E[er(h)] ≤ ν (C; PXY ) + ε /2 + 2 · exp{−n1/3 /12}, which 3 ln3 (4/ε ). So in this case, we can take λ (ε ) = 123 ln3 (4/ε ) . is at most ν (C; PXY ) + ε if n ≥ 12 On the other hand, if PXY is not in the benign noise case (i.e., the misspecified model case), then for any n ≥ 3Λu ((ν ∗ (PXY ) + ν (C; PXY ))/2, PXY ), E [er(hu )] ≤ (ν ∗ (PXY ) + ν (C; PXY ))/2, so that ˆ E[er(h)] ≤ E[er(hu )] + 2n−1/3 + 2 · exp{−n1/3 /12} ≤ (ν ∗ (PXY ) + ν (C; PXY ))/2 + 2n−1/3 + 2 · exp{−n1/3 /12}. 2 Again, this is at most ν (C; PXY ) + ε if n ≥ max 123 ln3 ε , 64(ν (C; PXY ) − ν ∗ (PXY ))−3 . So in this case, we can take 2 ν ∗ (PXY ) + ν (C; PXY ) 64 λ (ε ) = max 123 ln3 , 3Λu , PXY , ε 2 (ν (C; PXY ) − ν ∗ (PXY ))3 . In either case, we have λ (ε ) ∈ Polylog(1/ε ). References N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning, 1998. M. Alekhnovich, M. Braverman, V. Feldman, A. Klivans, and T. Pitassi. Learnability and automatizability. In Proceedings of the 45th Foundations of Computer Science, 2004. K. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. The Annals of Probability, 4:1041–1067, 1984. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 1582 ACTIVIZED L EARNING A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, 30:31–56, 1998. R. B. Ash and C. A. Dol´ ans-Dade. Probability & Measure Theory. Academic Press, 2000. e M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, 2006a. M.-F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels, margins, and lowdimensional mappings. Machine Learning Journal, 65(1):79–94, 2006b. M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Proceedings of the 20th Conference on Learning Theory, 2007. M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009. M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning. Machine Learning, 80(2–3):111–139, 2010. J. Baldridge and A. Palmer. How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2009. Z. Bar-Yossef. Sampling lower bounds via information theory. In Proceedings of the 35th Annual ACM Symposium on the Theory of Computing, 2003. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning, 2009. A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems 23, 2010. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929–965, 1989. F. Bunea, A. B. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169–194, 2009. C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In Proceedings of the 17th International Conference on Machine Learning, 2000. R. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339–2353, 2008. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. 1583 H ANNEKE S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, 2005. S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In Proceedings of the 18th Conference on Learning Theory, 2005. S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, 2007. S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. O. Dekel, C. Gentile, and K. Sridharan. Robust selective sampling from single and multiple teachers. In Proceedings of the 23rd Conference on Learning Theory, 2010. L. Devroye, L. Gy¨ rfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springero Verlag New York, Inc., 1996. R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Conference on Learning Theory, 2009. R. Gangadharaiah, R. D. Brown, and J. Carbonell. Active learning in example-based machine translation. In Proceedings of the 17th Nordic Conference on Computational Linguistics, 2009. E. Gin´ and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empire ical processes. The Annals of Probability, 34(3):1143–1216, 2006. S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:20–31, 1995. S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Conference on Learning Theory, 2007a. S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, 2007b. S. Hanneke. Adaptive rates of convergence in active learning. In Proceedings of the 22nd Conference on Learning Theory, 2009a. S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2009b. S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011. S. Har-Peled, D. Roth, and D. Zimak. Maximum margin coresets for active and noise tolerant learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. 1584 ACTIVIZED L EARNING D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992. D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115:248–292, 1994. T. Heged¨ s. Generalized teaching dimension and the query complexity of learning. In Proceedings u of the 8th Conference on Computational Learning Theory, 1995. L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many queries are needed to learn? Journal of the Association for Computing Machinery, 43(5):840–862, 1996. D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed concept classes. Machine Learning, 5:165–196, 1990. S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd International Conference on Machine Learning, 2006. M. K¨ ari¨ inen. Active learning in the non-realizable case. In Proceedings of the 17th International a¨ a Conference on Algorithmic Learning Theory, 2006. N. Karmarkar. A new polynomial-time algorithm for linear programming. Combinatorica, 4:373– 395, 1984. M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994. M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learning, 17:115–141, 1994. L. G. Khachiyan. A polynomial algorithm in linear programming. Soviet Mathematics Doklady, 20:191–194, 1979. V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006. V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11:2457–2485, 2010. V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In ´ ´e Ecole d’Et´ de Probabilit´ s de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics, 2033, e Springer, 2011. S. Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011. M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor classifiers. Machine Learning, 54:125–152, 2004. 1585 H ANNEKE T. Luo, K. Kramer, D. B. Goldgof, L. O. Hall, S. Samson, A. Remsen, and T. Hopkins. Active learning to recognize multiple types of plankton. Journal of Machine Learning Research, 6: 589–613, 2005. S. Mahalanabis. A note on active learning for smooth problems. arXiv:1103.3095, 2011. E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 1808–1829, 1999. ´ e e P. Massart and E. N´ d´ lec. Risk bounds for statistical learning. The Annals of Statistics, 34(5): 2326–2366, 2006. A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, 1998. P. Mitra, C. A. Murthy, and S. K. Pal. A probabilistic active support vector learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):413–418, 2004. J. R. Munkres. Topology. Prentice Hall, Inc., 2nd edition, 2000. I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning, 2002. R. D. Nowak. Generalized binary search. In Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing, 2008. L. Pitt and L. G. Valiant. Computational limitations on learning from examples. Journal of the Association for Computing Machinery, 35(4):965–984, 1988. J. Poland and M. Hutter. MDL convergence speed for Bernoulli sequences. Statistics and Computing, 16:161–175, 2006. G. V. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l1-penalized parametric M-estimators with applications to linear SVM and logistic regression. arXiv:0908.1940v1, 2009. D. Roth and K. Small. Margin-based active learning for structured output spaces. In European Conference on Machine Learning, 2006. N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, 2001. A. I. Schein and L. H. Ungar. Active learning for logistic regression: An evaluation. Machine Learning, 68(3):235–265, 2007. G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, 2000. B. Settles. Active learning literature survey. http://active-learning.net, 2010. S. M. Srivastava. A Course on Borel Sets. Springer-Verlag, 1998. 1586 ACTIVIZED L EARNING S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 2001. A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134–1142, 1984. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996. V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2): 117–186, 1945. L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information Processing Systems 22, 2009. L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. Journal of Machine Learning Research, 12:2269–2292, 2011. L. Wang and X. Shen. On L1-norm multiclass support vector machines. Journal of the American Statistical Association, 102(478):583–594, 2007. L. Yang, S. Hanneke, and J. Carbonell. The sample complexity of self-verifying Bayesian active learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011. 1587

5 0.24324977 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann

Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models

6 0.20767917 59 jmlr-2012-Linear Regression With Random Projections

7 0.2026328 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

8 0.19243439 68 jmlr-2012-Minimax Manifold Estimation

9 0.19171448 4 jmlr-2012-A Kernel Two-Sample Test

10 0.18745135 91 jmlr-2012-Plug-in Approach to Active Learning

11 0.16851127 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

12 0.16106822 20 jmlr-2012-Analysis of a Random Forests Model

13 0.15325372 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

14 0.14775662 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

15 0.14682032 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

16 0.14567058 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

17 0.14091673 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

18 0.13994081 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

19 0.13872495 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

20 0.13715491 80 jmlr-2012-On Ranking and Generalization Bounds


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.04), (26, 0.033), (29, 0.056), (35, 0.035), (38, 0.414), (49, 0.017), (57, 0.012), (69, 0.01), (75, 0.066), (77, 0.014), (79, 0.015), (92, 0.092), (96, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67508721 109 jmlr-2012-Stability of Density-Based Clustering

Author: Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman

Abstract: High density clusters can be characterized by the connected components of a level set L(λ) = {x : p(x) > λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. Keywords: clustering, density estimation, level sets, stability, model selection

2 0.33979011 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification

3 0.33892313 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

4 0.33869934 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

5 0.33733869 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan

Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning

6 0.33622625 82 jmlr-2012-On the Necessity of Irrelevant Variables

7 0.3361755 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

8 0.33572274 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

9 0.3355341 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

10 0.33346888 80 jmlr-2012-On Ranking and Generalization Bounds

11 0.33276173 13 jmlr-2012-Active Learning via Perfect Selective Classification

12 0.33241403 4 jmlr-2012-A Kernel Two-Sample Test

13 0.33211821 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

14 0.33163929 73 jmlr-2012-Multi-task Regression using Minimal Penalties

15 0.33123812 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

16 0.3303327 111 jmlr-2012-Structured Sparsity and Generalization

17 0.32986522 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

18 0.32855183 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

19 0.32817069 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

20 0.32727441 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment