jmlr jmlr2012 jmlr2012-37 knowledge-graph by maker-knowledge-mining

37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Source: pdf

Author: Vikas C. Raykar, Shipeng Yu

Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, deﬁned as annotators who assign labels randomly without actually looking at the instance. Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the ﬁnal consensus labels. In this paper we propose an empirical Bayesian algorithm called SpEM that iteratively eliminates the spammers and estimates the consensus labels based only on the good annotators. The algorithm is motivated by deﬁning a spammer score that can be used to rank the annotators. Experiments on simulated and real data show that the proposed approach is better than (or as good as) the earlier approaches in terms of the accuracy and uses a signiﬁcantly smaller number of annotators. Keywords: crowdsourcing, multiple annotators, ranking annotators, spammers

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Siemens Healthcare 51 Valley Stream Parkway, E51 Malvern, PA 19355, USA Editor: Ben Taskar Abstract With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time. [sent-6, score-0.487]

2 Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. [sent-7, score-0.554]

3 Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, deﬁned as annotators who assign labels randomly without actually looking at the instance. [sent-8, score-0.561]

4 In this paper we propose an empirical Bayesian algorithm called SpEM that iteratively eliminates the spammers and estimates the consensus labels based only on the good annotators. [sent-10, score-0.539]

5 With the advent of crowdsourcing services (Amazon’s Mechanical Turk1 being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see Sheng et al. [sent-16, score-0.542]

6 The annotators usually come from a diverse pool including genuine experts, novices, biased annotators, malicious annotators, and spammers. [sent-22, score-0.496]

7 Hence in order to get good quality labels requestors typically get each instance labeled by multiple annotators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods 1. [sent-23, score-0.657]

8 R AYKAR AND Y U that model and correct for the annotator biases (Dawid and Skene, 1979; Smyth et al. [sent-29, score-0.558]

9 In our context a spammer is a low quality annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator). [sent-36, score-1.595]

10 A mechanism to detect and eliminate spammers is a desirable feature for any crowdsourcing market place. [sent-38, score-0.505]

11 Spammer score to rank annotators The ﬁrst contribution of this paper is to formalize the notion of a spammer for binary and categorical labels. [sent-41, score-0.889]

12 More speciﬁcally we deﬁne a scalar metric which can be used to rank the annotators, with the spammers having a score close to zero and the good annotators having a score close to one. [sent-42, score-0.998]

13 We summarize the multiple parameters corresponding to each annotator into a single score indicative of how spammer like the annotator is. [sent-43, score-1.497]

14 While we obtain somewhat similar annotator rankings, we differ from this work in that our score is directly deﬁned in terms of the annotator parameters. [sent-50, score-1.178]

15 Having the score deﬁned only in terms of the annotator parameters makes it easy to specify a prior for Bayesian approaches to eliminate spammers and consolidate annotations. [sent-51, score-1.092]

16 Algorithm to eliminate spammers The second contribution is that we propose an algorithm to consolidate annotations that eliminates spammers automatically. [sent-53, score-0.93]

17 One of the commonly used strategy is to inject some items into the annotations with known labels (gold standard) and use them to evaluate the annotators and thus eliminate the spammers. [sent-54, score-0.607]

18 3 Typically we would like to detect the spammers with as few instances as possible and eliminate them from further annotations. [sent-55, score-0.508]

19 In this work we propose an algorithm called SpEM that eliminates the spammers without using any gold standard and estimates the consensus ground truth based only on the good annotators. [sent-56, score-0.578]

20 (2009, 2010) who proposed algorithms that correct for the annotator biases by estimating the annotator accuracy and the actual true label jointly. [sent-60, score-1.204]

21 A simple strategy would be to use these algorithms to estimate the annotator parameters, detect and eliminate the spammers (as deﬁned by our proposed spammer score) and reﬁt the model with only the good annotators. [sent-61, score-1.382]

22 492 E LIMINATING S PAMMERS AND R ANKING A NNOTATORS FOR C ROWDSOURCING this, it iteratively eliminates the spammers and re-estimates the labels based only on the good annotators. [sent-69, score-0.503]

23 A crucial element of our proposed algorithm is that we eliminate spammers by thresholding on a hyperparameter of the prior (automatically estimated from the data) rather than directly thresholding on the estimated spammer score. [sent-70, score-0.895]

24 In Section 2 we model the annotators in terms of the sensitivity and speciﬁcity for binary labels and extend it to categorical labels. [sent-72, score-0.627]

25 2) derived from the proposed spammer score designed to favor spammer detection. [sent-75, score-0.715]

26 The hyperparameters of this prior are estimated via an empirical Bayesian method in Section 5 leading to the proposed SpEM algorithm (Algorithm 1) that iteratively eliminates the spammers and re-estimates the ground truth based only on the good annotators. [sent-79, score-0.616]

27 Annotator Model j An annotator provides a noisy version of the true label. [sent-84, score-0.558]

28 (2009, 2010) we model the accuracy of the annotator separately on the positive and the negative examples. [sent-87, score-0.581]

29 If the true label is one, the sensitivity (true positive rate) for the jth annotator is deﬁned as the probability that the annotator labels it as one. [sent-88, score-1.259]

30 j On the other hand, if the true label is zero, the speciﬁcity (1−false positive rate) is deﬁned as the probability that the annotator labels it as zero. [sent-90, score-0.637]

31 , 2009) and also to explicitly model the annotator performance based on the instance feature vector (Yan et al. [sent-94, score-0.558]

32 k=1 The term αck denotes the probability that annotator j assigns class k to an instance given the true j j class is c. [sent-104, score-0.558]

33 Score to Rank Annotators Intuitively, a spammer assigns labels randomly, maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a j human annotator. [sent-108, score-1.054]

34 More precisely an annotator is a spammer if the probability of observed label yi being one given the true label yi is independent of the true label, that is, j j Pr[yi = 1|yi ] = Pr[yi = 1]. [sent-109, score-1.045]

35 (1) j This means that the annotator is assigning labels randomly by ﬂipping a coin with bias Pr[yi = 1] without actually looking at the data. [sent-110, score-0.613]

36 (2) Hence in the context of the annotator model deﬁned in Section 2, a spammer is an annotator for whom α j + β j − 1 = 0. [sent-112, score-1.435]

37 4 If α j + β j − 1 < 0 then the annotator lies below the diagonal line and is a malicious annotator who ﬂips the labels. [sent-114, score-1.186]

38 Note that a malicious annotator has discriminatory power if we can detect them and ﬂip their labels. [sent-115, score-0.638]

39 Hence we deﬁne the spammer score for an annotator as S j = (α j + β j − 1)2 . [sent-118, score-0.939]

40 (3) An annotator is a spammer if S j is close to zero. [sent-119, score-0.877]

41 Good annotators have S j > 0 while a perfect annotator has S j = 1. [sent-120, score-1.006]

42 If an annotator is a spammer (that is (2) holds) then j Pr[yi = 1|yi ] p log j = log 1 − p . [sent-123, score-0.919]

43 Pr[yi = 0|yi ] Essentially the annotator provides no information in updating the posterior log-odds and hence does not contribute to the estimation of the actual true label. [sent-124, score-0.601]

44 Another way to think about this is that instead of using sensitivity and speciﬁcity we can re-parameterize an annotator in terms of an accuracy parameter ((α j + β j )/2) and a bias parameter ((α j − β j )/2). [sent-130, score-0.645]

45 A spammer is an annotator with accuracy equal to 0. [sent-131, score-0.9]

46 8 1 Figure 1: For binary labels each annotator is modeled by his/her sensitivity and speciﬁcity. [sent-147, score-0.677]

47 An annotator with high accuracy is a good annotator but one with low accuracy is not necessarily a spammer. [sent-151, score-1.186]

48 The accuracy of the jth annotator is computed as j Accuracy j = Pr[yi = yi ] = 1 ∑ Pr[yij = 1|yi = k]Pr[yi = k] = α j p + β j (1 − p), (4) k=0 where p := Pr[yi = 1] is the prevalence of the positive class. [sent-152, score-0.695]

49 The malicious annotators ﬂip their labels and as such are not spammers if we can detect them and then correct for the ﬂipping. [sent-157, score-0.971]

50 , 2010) can correctly ﬂip the labels for the malicious annotators and hence they should not be treated as spammers. [sent-159, score-0.551]

51 Figure 2(b) also shows the contours of equal score for our proposed score and it can be seen that the malicious annotators have a high score and only annotators along the diagonal have a low score (spammers). [sent-160, score-1.253]

52 This indicates that the annotator j is a spammer if αck = αc′ k , ∀c, c′ , k = 1, . [sent-244, score-0.877]

53 j j (5) Let A j be the C × C confusion rate matrix with entries [A j ]ck = αck , a spammer would have all the rows of A j equal to one another, for example, an annotator with a confusion matrix A j = 0. [sent-248, score-0.915]

54 In the binary case we had this natural notion of spammer as an annotator for whom α j +β j −1 was close to zero. [sent-259, score-0.877]

55 We also assume that the ASD priors for each annotator are independent. [sent-279, score-0.558]

56 Algorithm to Eliminate Spammers For each annotator we imposed the Automatic Spammer Detection prior of the form Pr[α j , β j |λ j ] ∝ exp −λ j (α j + β j − 1)2 /2 , parameterized by precision hyperparameter λ j . [sent-318, score-0.603]

57 However it is crucial that we use the right λ j for each annotator for two reasons: (1) For the good annotators we want the precision term to be small so that we do not over penalize the good annotators. [sent-323, score-1.054]

58 Hence, regardless of the evidence of the training data, the posterior will also be sharply concentrated around α j + β j = 1, thus that annotator will not affect the ground truth and hence, it can be effectively removed. [sent-326, score-0.645]

59 Therefore, the discrete optimization problem corresponding to spammer detection (should each annotator be included or not? [sent-327, score-0.877]

60 (17) (α j + β j − 1)2 + σ j One way to think of this is that the penalization is inversely proportional to (α j + β j − 1)2 , that is, good annotators get penalized less while the spammers suffer a large penalization. [sent-347, score-0.891]

61 Figure 4(b) plots the estimated hyperparameter λ j for each annotator as a function of the iteration number for a simulation setup shown in Figure 4(a). [sent-348, score-0.635]

62 It can be seen that as expected for the good annotators λ j starts decreasing7 while for the spammers λ j starts increasing with iterations. [sent-350, score-0.874]

63 At each iteration we eliminate all the annotators for whom the estimated λ j is greater than a certain pruning threshold T . [sent-353, score-0.616]

64 For all our experiments for each annotator we set the pruning threshold to 0. [sent-363, score-0.646]

65 501 R AYKAR AND Y U 5 good annotators 20 spammers 500 instances 1 0. [sent-365, score-0.916]

66 The simulation has 5 good annotators and 20 spammers and 500 instances. [sent-379, score-0.874]

67 (b) The estimated hyperparameter λ j for each annotator as a function of the iteration number. [sent-382, score-0.613]

68 However we could deﬁne an annotator as a spammer if the estimated |α j + β j − 1| ≤ ε. [sent-425, score-0.911]

69 What is the advantage of different shrinkage for each annotator ? [sent-431, score-0.58]

70 While this is a valid approach, the advantage of our ASD prior is that the amount of shrinkage for each annotator is different and depends on how good the annotator is, that is, good annotators suffer less shrinkage while spammers suffer severe shrinkage. [sent-433, score-2.116]

71 Let Mi be the number of annotators labeling the ith instance, and let N j be the number of instances labeled 503 R AYKAR AND Y U by the jth annotator. [sent-436, score-0.522]

72 Therefore, the gold standard instances and unlabeled instances will be used together to estimate the sensitivity and speciﬁcity of each annotator (and also to estimate the labels). [sent-446, score-0.73]

73 One might then identify an annotator j as a spammer if all of the λ j in the C runs indicate that this is a spammer. [sent-450, score-0.877]

74 3 Effect of Missing Labels In a realistic scenario an annotator does not label all the instances. [sent-454, score-0.582]

75 Figure 7 plots the behavior of the different algorithms as a function of the fraction of annotators labeling each instance. [sent-455, score-0.502]

76 When each annotator labels only a few instances all three algorithms achieve very similar performance in terms of the AUC. [sent-456, score-0.655]

77 508 E LIMINATING S PAMMERS AND R ANKING A NNOTATORS FOR C ROWDSOURCING 5 good annotators 100 spammers 5 good annotators 100 spammers 1 1 Majority Voting EM Algorithm SpEM Proposed Algorithm 0. [sent-462, score-1.748]

78 8 Fraction of annotators per instance (a) Accuracy 1 (b) Precision Figure 7: Effect of missing labels (Section 8. [sent-485, score-0.503]

79 3) (a) The AUC of the estimated labels as a function of the fraction of annotators labeling each instance. [sent-486, score-0.569]

80 5 good annotators 100 spammers 5 good annotators 100 spammers 1 0. [sent-490, score-1.748]

81 For all our experiments for each annotator we set the pruning threshold to 0. [sent-521, score-0.646]

82 1 times the number of instances labeled by the 509 R AYKAR AND Y U 5 good annotators 50 spammers 5 good annotators 50 spammers 1 1 Majority Voting EM Algorithm SpEM Proposed Algorithm Senstivity for spammer detection AUC of the estimated ground truth 0. [sent-522, score-2.213]

83 The advantage of our ASD prior is that the amount of shrinkage for each annotator is different and depends on how accurate the annotator is, more accurate annotators suffer less shrinkage while spammers suffer severe shrinkage. [sent-547, score-2.068]

84 The table also shows the number of annotators eliminated as spammers by the proposed algorithm. [sent-558, score-0.903]

85 Figure 10 plots the actual and the estimated annotator performance for the SpEM algorithm for binary data sets with known ground truth. [sent-559, score-0.677]

86 , 2010) The annotator had to identify whether there was a Indigo Bunting or Blue Grosbeak in the image. [sent-561, score-0.558]

87 , 2008) The annotator is presented with two sentences and given a binary choice of whether the second hypothesis sentence can be inferred from the ﬁrst. [sent-563, score-0.558]

88 , 2008) Each annotator is presented with a short headline and asked to rate the overall positive or negative valence of the emotional content of the headline. [sent-568, score-0.601]

89 S is the number of annotators eliminated as spammers by the proposed algorithm. [sent-605, score-0.903]

90 511 R AYKAR AND Y U 39 annotators 108 instances 164 annotators 800 instances 0. [sent-607, score-0.98]

91 8 1 (b) rte 76 annotators 462 instances 38 annotators 100 instances 1 0. [sent-637, score-1.002]

92 The rankings are based on the lower limit of the 95% CI which factors the number of instances labeled by the annotator into the ranking. [sent-681, score-0.625]

93 An annotator who labels only a few instances will have very wide CI. [sent-682, score-0.655]

94 Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide and hence ranked lower. [sent-683, score-0.895]

95 Ideally we would like to have annotators with a high score and at the same time label a lot of instances so that we can reliably identify them. [sent-684, score-0.576]

96 For example the authors made the following comments about Annotator 7 ”Quirky annotator - had a lot of debate about what was the meaning of the annotation question. [sent-687, score-0.58]

97 4 Annotator sentiment | 1660 instances | 33 annotators 1 30 Spammer Score 0. [sent-706, score-0.523]

98 8 1 108 1 temp | 462 instances | 76 annotators 10 50 10 10 40 10 70 350 80 40 100 192 190 40 32 60 70 20 20 40 80 20 50 50 50 30 10 30 10 30 20 10 bluebird | 108 instances | 39 annotators Annotator Annotator Figure 11: Annotator Rankings The rankings obtained for the data sets in Table 1. [sent-713, score-1.059]

99 Note that the CIs are wider when the annotator labels only a few instances. [sent-718, score-0.613]

100 Modeling annotator expertise: Learning when everybody knows a bit of something. [sent-956, score-0.558]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('annotator', 0.558), ('annotators', 0.448), ('spammers', 0.402), ('spammer', 0.319), ('spem', 0.194), ('pr', 0.129), ('raykar', 0.11), ('em', 0.087), ('aykar', 0.086), ('liminating', 0.072), ('nnotators', 0.072), ('pammers', 0.072), ('rowdsourcing', 0.072), ('skene', 0.065), ('sensitivity', 0.064), ('auc', 0.062), ('pruning', 0.062), ('score', 0.062), ('yi', 0.06), ('categorical', 0.06), ('annotations', 0.058), ('specificity', 0.057), ('anking', 0.055), ('ck', 0.055), ('labels', 0.055), ('prevalence', 0.054), ('dawid', 0.051), ('asd', 0.05), ('erf', 0.049), ('malicious', 0.048), ('voting', 0.047), ('eliminate', 0.046), ('map', 0.044), ('valence', 0.043), ('instances', 0.042), ('city', 0.041), ('crowdsourcing', 0.039), ('roc', 0.039), ('eliminated', 0.038), ('ground', 0.037), ('pai', 0.036), ('consensus', 0.036), ('estimated', 0.034), ('sentiment', 0.033), ('amazon', 0.033), ('truth', 0.033), ('mechanical', 0.032), ('labeling', 0.032), ('bluebird', 0.029), ('mv', 0.028), ('snow', 0.028), ('actual', 0.026), ('threshold', 0.026), ('smyth', 0.025), ('rankings', 0.025), ('majority', 0.025), ('hyperparameters', 0.025), ('temp', 0.025), ('prior', 0.024), ('gold', 0.024), ('good', 0.024), ('label', 0.024), ('contours', 0.024), ('accuracy', 0.023), ('annotation', 0.022), ('diagonal', 0.022), ('eliminates', 0.022), ('plots', 0.022), ('shrinkage', 0.022), ('brew', 0.022), ('carpenter', 0.022), ('crowdsourced', 0.022), ('rte', 0.022), ('senstivity', 0.022), ('shipeng', 0.022), ('vikas', 0.022), ('welinder', 0.022), ('whitehill', 0.022), ('hyperparameter', 0.021), ('log', 0.021), ('ee', 0.021), ('confusion', 0.019), ('siemens', 0.018), ('bogoni', 0.018), ('detect', 0.018), ('suffer', 0.017), ('maybe', 0.017), ('posterior', 0.017), ('maximization', 0.017), ('equating', 0.017), ('workers', 0.017), ('likelihood', 0.016), ('beta', 0.016), ('proposed', 0.015), ('derivative', 0.014), ('amt', 0.014), ('bot', 0.014), ('discriminatory', 0.014), ('ipeirotis', 0.014), ('irish', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Author: Vikas C. Raykar, Shipeng Yu

2 0.044120453 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

Author: José Hernández-Orallo, Peter Flach, Cèsar Ferri

Abstract: Many performance metrics have been introduced in the literature for the evaluation of classiﬁcation performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into reﬁnement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassiﬁcation costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: ﬁxed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 0-1 loss, the absolute error, the Brier score, the AUC and the reﬁnement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibra

3 0.033009935 13 jmlr-2012-Active Learning via Perfect Selective Classification

Author: Ran El-Yaniv, Yair Wiener

Abstract: We discover a strong relation between two known learning models: stream-based active learning and perfect selective classiﬁcation (an extreme case of ‘classiﬁcation with a reject option’). For these models, restricted to the realizable case, we show a reduction of active learning to selective classiﬁcation that preserves fast rates. Applying this reduction to recent results for selective classiﬁcation, we derive exponential target-independent label complexity speedup for actively learning general (non-homogeneous) linear classiﬁers when the data distribution is an arbitrary high dimensional mixture of Gaussians. Finally, we study the relation between the proposed technique and existing label complexity measures, including teaching dimension and disagreement coefﬁcient. Keywords: classiﬁcation with a reject option, perfect classiﬁcation, selective classiﬁcation, active learning, selective sampling, disagreement coefﬁcient, teaching dimension, exploration vs. exploitation 1. Introduction and Related Work Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, potentially leading to signiﬁcantly faster learning. In recent years it has been gaining considerable recognition as a vital technique for efﬁciently implementing inductive learning in many industrial applications where abundance of unlabeled data exists, and/or in cases where labeling costs are high. In this paper we expose a strong relation between active learning and selective classiﬁcation, another known alternative learning model (Chow, 1970; El-Yaniv and Wiener, 2010). Focusing on binary classiﬁcation in realizable settings we consider standard stream-based active learning, which is also referred to as online selective sampling (Atlas et al., 1990; Cohn et al., 1994). In this model the learner is given an error objective ε and then sequentially receives unlabeled examples. At each step, after observing an unlabeled example x, the learner decides whether or not to request the label of x. The learner should terminate the learning process and output a binary classiﬁer whose true error is guaranteed to be at most ε with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. A label complexity bound of O(d log(d/ε)) for actively learning ε-good classiﬁer from a concept class with VC-dimension d, provides an exponential speedup in terms of 1/ε relative to standard (passive) supervised learning where the sample complexity is typically O(d/ε). The study of (stream-based, realizable) active learning is paved with very interesting theoretical results. Initially, only a few cases were known where active learning provides signiﬁcant advanc 2012 Ran El-Yaniv and Yair Wiener. E L -YANIV AND W IENER tage over passive learning. Perhaps the most favorable result was an exponential label complexity speedup for learning homogeneous linear classiﬁers where the (linearly separable) data is uniformly distributed over the unit sphere. This result was manifested by various authors using various analysis techniques, for a number of strategies that can all be viewed in hindsight as approximations or variations of the “CAL algorithm” of Cohn et al. (1994). Among these studies, the earlier theoretical results (Seung et al., 1992; Freund et al., 1993, 1997; Fine et al., 2002; Gilad-Bachrach, 2007) considered Bayesian settings and studied the speedup obtained by the Query by Committee (QBC) algorithm. The more recent results provided PAC style analyses (Dasgupta et al., 2009; Hanneke, 2007a, 2009). Lack of positive results for other non-toy problems, as well as various additional negative results that were discovered, led some researchers to believe that active learning is not necessarily advantageous in general. Among the striking negative results is Dasgupta’s negative example for actively learning general (non-homogeneous) linear classiﬁers (even in two dimensions) under the uniform distribution over the sphere (Dasgupta, 2005). A number of recent innovative papers proposed alternative models for active learning. Balcan et al. (2008) introduced a subtle modiﬁcation of the traditional label complexity deﬁnition, which opened up avenues for new positive results. According to their new deﬁnition of “non-veriﬁable” label complexity, the active learner is not required to know when to stop the learning process with a guaranteed ε-good classiﬁer. Their main result, under this deﬁnition, is that active learning is asymptotically better than passive learning in the sense that only o(1/ε) labels are required for actively learning an ε-good classiﬁer from a concept class that has a ﬁnite VC-dimension. Another result they accomplished is an exponential label complexity speedup for (non-veriﬁable) active learning of non-homogeneous linear classiﬁers under the uniform distribution over the the unit sphere. Based on Hanneke’s characterization of active learning in terms of the “disagreement coefﬁcient” (Hanneke, 2007a), Friedman (2009) recently extended the Balcan et al. results and proved that a target-dependent exponential speedup can be asymptotically achieved for a wide range of “smooth” learning problems (in particular, the hypothesis class, the instance space and the distribution should all be expressible by smooth functions). He proved that under such smoothness conditions, for any target hypothesis h∗ , Hanneke’s disagreement coefﬁcient is bounded above in terms of a constant c(h∗ ) that depends on the unknown target hypothesis h∗ (and is independent of δ and ε). The resulting label complexity is O (c(h∗ ) d polylog(d/ε)) (Hanneke, 2011b). This is a very general result but the target-dependent constant involved in this bound is only guaranteed to be ﬁnite. With this impressive progress in the case of target-dependent bounds for active learning, the current state of affairs in the target-independent bounds for active learning arena leaves much to be desired. To date the most advanced result in this model, which was already essentially established by Seung et al. and Freund et al. more than ﬁfteen years ago (Seung et al., 1992; Freund et al., 1993, 1997), is still a target-independent exponential speed up bound for homogeneous linear classiﬁers under the uniform distribution over the sphere. The other learning model we contemplate that will be shown to have strong ties to active learning, is selective classiﬁcation, which is mainly known in the literature as ‘classiﬁcation with a reject option.’ This old-timer model, that was already introduced more than ﬁfty years ago (Chow, 1957, 1970), extends standard supervised learning by allowing the classiﬁer to opt out from predictions in cases where it is not conﬁdent. The incentive is to increase classiﬁcation reliability over instances that are not rejected by the classiﬁer. Thus, using selective classiﬁcation one can potentially achieve 256 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION a lower error rate using the same labeling “budget.” The main quantities that characterize a selective classiﬁer are its (true) error and coverage rate (or its complement, the rejection rate). There is already substantial volume of research publications on selective classiﬁcation, that kept emerging through the years. The main theme in many of these publications is the implementation of certain reject mechanisms for speciﬁc learning algorithms like support vector machines and neural networks. Among the few theoretical studies on selective classiﬁcation, there are various excess risk bounds for ERM learning (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Wegkamp, 2007), and certain coverage/risk guarantees for selective ensemble methods (Freund et al., 2004). In a recent work (El-Yaniv and Wiener, 2010) the trade-off between error and coverage was examined and in particular, a new extreme case of selective learning was introduced. In this extreme case, termed here “perfect selective classiﬁcation,” the classiﬁer is given m labeled examples and is required to instantly output a classiﬁer whose true error is perfectly zero with certainty. This is of course potentially doable only if the classiﬁer rejects a sufﬁcient portion of the instance space. A non-trivial result for perfect selective classiﬁcation is a high probability lower bound on the classiﬁer coverage (or equivalently, an upper bound on its rejection rate). Such bounds have recently been presented in El-Yaniv and Wiener (2010). In Section 3 we present a reduction of active learning to perfect selective classiﬁcation that preserves “fast rates.” This reduction enables the luxury of analyzing dynamic active learning problems as static problems. Relying on a recent result on perfect selective classiﬁcation from El-Yaniv and Wiener (2010), in Section 4 we then apply our reduction and conclude that general (non-homogeneous) linear classiﬁers are actively learnable at exponential (in 1/ε) label complexity rate when the data distribution is an arbitrary unknown ﬁnite mixture of high dimensional Gaussians. While we obtain exponential label complexity speedup in 1/ε, we incur exponential slowdown in d 2 , where d is the problem dimension. Nevertheless, in Section 5 we prove a lower bound of Ω((log m)(d−1)/2 (1 + o(1)) on the label complexity, when considering the class of unrestricted linear classiﬁers under a Gaussian distribution. Thus, an exponential slowdown in d is unavoidable in such settings. Finally, in Section 6 we relate the proposed technique to other complexity measures for active learning. Proving and using a relation to the teaching dimension (Goldman and Kearns, 1995) we show, by relying on a known bound for the teaching dimension, that perfect selective classiﬁcation with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefﬁcient and show that the coverage of perfect selective classiﬁcation can be bounded below using the disagreement coefﬁcient. Conversely, we show that the disagreement coefﬁcient can be bounded above using any coverage bound for perfect selective classiﬁcation. Consequently, the results here imply that the disagreement coefﬁcient can be sufﬁciently bounded to ensure fast active learning for the case of linear classiﬁers under a mixture of Gaussians. 2. Active Learning and Perfect Selective Classiﬁcation In binary classiﬁcation the goal is to learn an accurate binary classiﬁer, h : X → {±1}, from a ﬁnite labeled training sample. Here X is some instance space and the standard assumption is that the training sample, Sm = {(xi , yi )}m , containing m labeled examples, is drawn i.i.d. from some i=1 unknown distribution P(X,Y ) deﬁned over X × {±1}. The classiﬁer h is chosen from some hypothesis class H . In this paper we focus on the realizable setting whereby labels are deﬁned by 257 E L -YANIV AND W IENER some unknown target hypothesis h∗ ∈ H . Thus, the underlying distribution reduces to P(X). The performance of a classiﬁer h is quantiﬁed by its true zero-one error, R(h) Pr{h(X) = h∗ (X)}. A positive result for a classiﬁcation problem (H , P) is a learning algorithm that given an error target ε and a conﬁdence parameter δ can output, based on Sm , an hypothesis h whose error R(h) ≤ ε, with probability of at least 1 − δ. A bound B(ε, δ) on the size m of labeled training sample sufﬁcient for achieving this is called the sample complexity of the learning algorithm. A classical result is that any consistent learning algorithm has sample complexity of O( 1 (d log( 1 ) + log( 1 ))), where d is ε ε δ the VC-dimension of H (see, e.g., Anthony and Bartlett, 1999). 2.1 Active Learning We consider the following standard active learning model. In this model the learner sequentially observes unlabeled instances, x1 , x2 , . . ., that are sampled i.i.d. from P(X). After receiving each xi , the learning algorithm decides whether or not to request its label h∗ (xi ), where h∗ ∈ H is an unknown target hypothesis. Before the start of the game the algorithm is provided with some desired error rate ε and conﬁdence level δ. We say that the learning algorithm actively learned the problem instance (H , P) if at some point it can terminate this process, after observing m instances and requesting k labels, and output an hypothesis h ∈ H whose error R(h) ≤ ε, with probability of at least 1 − δ. The quality of the algorithm is quantiﬁed by the number k of requested labels, which is called the label complexity. A positive result for a learning problem (H , P) is a learning algorithm that can actively learn this problem for any given ε and δ, and for every h∗ , with label complexity bounded above by L(ε, δ, h∗ ). If there is a label complexity bound that is O(polylog(1/ε)) we say that the problem is actively learnable at exponential rate. 2.2 Selective Classiﬁcation Following the formulation in El-Yaniv and Wiener (2010) the goal in selective classiﬁcation is to learn a pair of functions (h, g) from a labeled training sample Sm (as deﬁned above for passive learning). The pair (h, g), which is called a selective classiﬁer, consists of a binary classiﬁer h ∈ H , and a selection function, g : X → {0, 1}, which qualiﬁes the classiﬁer h as follows. For any sample x ∈ X , the output of the selective classiﬁer is (h, g)(x) h(x) iff g(x) = 1, and (h, g)(x) abstain iff g(x) = 0. Thus, the function g is a ﬁlter that determines a sub-domain of X over which the selective classiﬁer will abstain from classiﬁcations. A selective classiﬁer is thus characterized by its coverage, Φ(h, g) EP {g(x)}, which is the P-weighted volume of the sub-domain of X that is not ﬁltered out, and its error, R(h, g) = E{I(h(X) = h∗ (X)) · g(X)}/Φ(h, g), which is the zero-one loss restricted to the covered sub-domain. Note that this is a “smooth” generalization of passive learning and, in particular, R(h, g) reduces to R(h) (standard classiﬁcation) if g(x) ≡ 1. We expect to see a trade-off between R(h, g) and Φ(h, g) in the sense that smaller error should be obtained by compromising the coverage. A major issue in selective classiﬁcation is how to optimally control this trade-off. In this paper we are concerned with an extreme case of this trade-off whereby (h, g) is required to achieve a perfect score of zero error with certainty. This extreme learning objective is termed perfect learning in El-Yaniv and Wiener (2010). Thus, for a perfect selective classiﬁer (h, g) we always have R(h, g) = 0, and its quality is determined by its guaranteed coverage. A positive result for (perfect) selective classiﬁcation problem (H , P) is a learning algorithm that uses a labeled training sample Sm (as in passive learning) to output a perfect selective classiﬁer (h, g) for which Φ(h, g) ≥ BΦ (H , δ, m) with probability of at least 1 − δ, for any given δ. The bound 258 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION BΦ = BΦ (H , δ, m) is called a coverage bound (or coverage rate) and its complement, 1 − BΦ , is called a rejection bound (or rate). A coverage rate BΦ = 1 − O( polylog(m) ) (and the corresponding m 1 − BΦ rejection rate) are qualiﬁed as fast. 2.3 The CAL Algorithm and the Consistent Selective Strategy (CSS) The major players in active learning and in perfect selective classiﬁcation are the CAL algorithm and the consistent selective strategy (CSS), respectively. To deﬁne them we need the following deﬁnitions. Deﬁnition 1 (Version space, Mitchell, 1977) Given an hypothesis class H and a training sample Sm , the version space V SH ,Sm is the set of all hypotheses in H that classify Sm correctly. Deﬁnition 2 (Disagreement set, Hanneke, 2007a; El-Yaniv and Wiener, 2010) Let G ⊂ H . The disagreement set w.r.t. G is deﬁned as DIS(G ) {x ∈ X : ∃h1 , h2 ∈ G The agreement set w.r.t. G is AGR(G ) s.t. h1 (x) = h2 (x)} . X \ DIS(G ). The main strategy for active learning in the realizable setting (Cohn et al., 1994) is to request labels only for instances belonging to the disagreement set and output any (consistent) hypothesis belonging to the version space. This strategy is often called the CAL algorithm. A related strategy for perfect selective classiﬁcation was proposed in El-Yaniv and Wiener (2010) and termed consistent selective strategy (CSS). Given a training set Sm , CSS takes the classiﬁer h to be any hypothesis in V SH ,Sm (i.e., a consistent learner), and takes a selection function g that equals one for all points in the agreement set with respect to V SH ,Sm , and zero otherwise. 3. From Coverage Bound to Label Complexity Bound In this section we present a reduction from stream-based active learning to perfect selective classiﬁcation. Particularly, we show that if there exists for H a perfect selective classiﬁer with a fast rejection rate of O(polylog(m)/m), then the CAL algorithm will actively learn H with exponential label complexity rate of O(polylog(1/ε)). Lemma 3 Let Sm = {(x1 , y1 ), . . . , (xm , ym )} be a sequence of m labeled samples drawn i.i.d. from an unknown distribution P(X) and let Si = {(x1 , y1 ), . . . , (xi , yi )} be the i-preﬁx of Sm . Then, with probability of at least 1 − δ over random choices of Sm , the following bound holds simultaneously for all i = 1, . . . , m − 1, Pr xi+1 ∈ DIS(V SH ,Si )|Si ≤ 1 − BΦ H , δ , 2⌊log2 (i)⌋ , log2 (m) where BΦ (H , δ, m) is a coverage bound for perfect selective classiﬁcation with respect to hypothesis class H , conﬁdence δ and sample size m . 259 E L -YANIV AND W IENER Proof For j = 1, . . . , m, abbreviate DIS j DIS(V SH ,S j ) and AGR j AGR(V SH ,S j ). By deﬁnition, DIS j = X \ AGR j . By the deﬁnitions of a coverage bound and agreement/disagreement sets, with probability of at least 1 − δ over random choices of S j BΦ (H , δ, j) ≤ Pr{x ∈ AGR j |S j } = Pr{x ∈ DIS j |S j } = 1 − Pr{x ∈ DIS j |S j }. Applying the union bound we conclude that the following inequality holds simultaneously with high probability for t = 0, . . . , ⌊log2 (m)⌋ − 1, Pr{x2t +1 ∈ DIS2t |S2t } ≤ 1 − BΦ H , δ , 2t . log2 (m) (1) For all j ≤ i, S j ⊆ Si , so DISi ⊆ DIS j . Therefore, since the samples in Sm are all drawn i.i.d., for any j ≤ i, Pr {xi+1 ∈ DISi |Si } ≤ Pr xi+1 ∈ DIS j |S j = Pr x j+1 ∈ DIS j |S j . The proof is complete by setting j = 2⌊log2 (i)⌋ ≤ i, and applying inequality (1). Lemma 4 (Bernstein’s inequality Hoeffding, 1963) Let X1 , . . . , Xn be independent zero-mean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then, for all positive t,   n 2 /2 t . Pr ∑ Xi > t ≤ exp − 2 + Mt/3 i=1 ∑E Xj Lemma 5 Let Zi , i = 1, . . . , m, be independent Bernoulli random variables with success probabilities pi . Then, for any 0 < δ < 1, with probability of at least 1 − δ, m ∑ (Zi − E{Zi }) ≤ 2 ln i=1 Proof Deﬁne Wi 1 2 1 ∑ pi + 3 ln δ . δ Zi − E{Zi } = Zi − pi . Clearly, E{Wi } = 0, |Wi | ≤ 1, E{Wi2 } = pi (1 − pi ). Applying Bernstein’s inequality (Lemma 4) on the Wi ,   n t 2 /2 t 2 /2  = exp − Pr ∑ Wi > t ≤ exp − ∑ pi (1 − pi ) + t/3 i=1 ∑ E W j2 + t/3 ≤ exp − t 2 /2 . ∑ pi + t/3 Equating the right-hand side to δ and solving for t, we have t 2 /2 1 = ln δ ∑ pi + t/3 ⇐⇒ 2 1 1 t 2 − t · ln − 2 ln ∑ pi = 0, 3 δ δ 260 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and the positive solution of this quadratic equation is t= 1 1 ln + 3 δ 1 21 1 2 1 ln + 2 ln ∑ pi < ln + 9 δ δ 3 δ 2 ln 1 pi . δ∑ Lemma 6 Let Z1 , Z2 , . . . , Zm be a high order Markov sequence of dependent binary random variables deﬁned in the same probability space. Let X1 , X2 , . . . , Xm be a sequence of independent random variables such that, Pr {Zi = 1|Zi−1 , . . . , Z1 , Xi−1 , . . . , X1 } = Pr {Zi = 1|Xi−1 , . . . , X1 } . Deﬁne P1 Pr {Z1 = 1}, and for i = 2, . . . , m, Pi Pr {Zi = 1|Xi−1 , . . . , X1 } . Let b1 , b2 . . . bm be given constants independent of X1 , X2 , . . . , Xm .1 Assume that Pi ≤ bi simultaneously for all i with probability of at least 1 − δ/2, δ ∈ (0, 1). Then, with probability of at least 1 − δ, m m 2 2 2 ∑ Zi ≤ ∑ bi + 2 ln δ ∑ bi + 3 ln δ . i=1 i=1 We proceed with a direct proof of Lemma 6. An alternative proof of this lemma, using supermartingales, appears in Appendix B. Proof For i = 1, . . . , m, let Wi be binary random variables satisfying bi + I(Pi ≤ bi ) · (Pi − bi ) , Pi bi − Pi ,0 , Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } max 1 − Pi Pr{Wi = 1|Wi−1 , . . . ,W1 , Xi−1 , . . . , X1 } = Pr{Wi = 1|Xi−1 , . . . , X1 }. Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } We notice that Pr{Wi = 1|Xi−1 , . . . , X1 } = Pr{Wi = 1, Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1, Zi = 0|Xi−1 , . . . , X1 } = Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } Pr{Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } Pr{Zi = 0|Xi−1 , . . . , X1 } = Pi + bi −Pii (1 − Pi ) = bi , Pi ≤ bi ; 1−P bi · Pi + 0 = bi , else. Pi Hence the distribution of each Wi is independent of Xi−1 , . . . , X1 , and the Wi are independent Bernoulli random variables with success probabilities bi . By construction if Pi ≤ bi then Pr{Wi = 1|Zi = 1} = X Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } = 1. 1. Precisely we require that each of the bi were selected before Xi are chosen 261 E L -YANIV AND W IENER By assumption Pi ≤ bi for all i simultaneously with probability of at least 1−δ/2. Therefore, Zi ≤ Wi simultaneously with probability of at least 1 − δ/2. We now apply Lemma 5 on the Wi . The proof is then completed using the union bound. Theorem 7 Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Then with probability of at least 1 − δ over choices of Sm , the number of label requests k by the CAL algorithm is bounded by k ≤ Ψ(H , δ, m) + where Ψ(H , δ, m) 2 2 2 2 ln Ψ(H , δ, m) + ln , δ 3 δ m ∑ i=1 1 − BΦ H , δ , 2⌊log2 (i)⌋ 2 log2 (m) and BΦ (H , δ, m) is a coverage bound for perfect selective classiﬁcation with respect to hypothesis class H , conﬁdence δ and sample size m . Proof According to CAL, the label of sample xi will be requested iff xi ∈ DIS(V SH ,Si−1 ). For i = 1, . . . , m, let Zi be binary random variables such that Zi 1 iff CAL requests a label for sample xi . Applying Lemma 3 we get that for all i = 2, . . . , m, with probability of at least 1 − δ/2 Pr{Zi = 1|Si−1 } = Pr xi ∈ DIS(V SH ,Si−1 )|Si−1 ≤ 1 − BΦ H , δ , 2⌊log2 (i−1)⌋ . 2 log2 (m) For i = 1, BΦ (H , δ, 1) = 0 and the above inequality trivially holds. An application of Lemma 6 on the variables Zi completes the proof. Theorem 7 states an upper bound on the label complexity expressed in terms of m, the size of the sample provided to CAL. This upper bound is very convenient for directly analyzing the active learning speedup relative to supervised learning. A standard label complexity upper bound, which depends on 1/ε, can be extracted using the following simple observation. Lemma 8 (Hanneke, 2009; Anthony and Bartlett, 1999) Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P. Let H be a hypothesis class whose ﬁnite VC dimension is d, and let ε and δ be given. If m≥ 4 2 12 d ln + ln , ε ε δ then, with probability of at least 1 − δ, CAL will output a classiﬁer whose true error is at most ε. Proof Hanneke (2009) observed that since CAL requests a label whenever there is a disagreement in the version space, it is guaranteed that after processing m examples, CAL will output a classiﬁer that is consistent with all the m examples introduced to it. Therefore, CAL is a consistent learner. A classical result (Anthony and Bartlett, 1999, Theorem 4.8) is that any consistent learner will achieve, with probability of at least 1 − δ, a true error not exceeding ε after observing at most 12 2 4 ε d ln ε + ln δ labeled examples. 262 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 9 Let H be a hypothesis class whose ﬁnite VC dimension is d. If the rejection rate of CSS polylog( m ) δ (see deﬁnition in Section 2.3) is O , then (H , P) is actively learnable with exponential m label complexity speedup. Proof Plugging this rejection rate into Ψ (deﬁned in Theorem 7) we have,  m m polylog δ Ψ(H , δ, m) ∑ 1 − BΦ (H , , 2⌊log2 (i)⌋ ) = ∑ O  log2 (m) i i=1 i=1 Applying Lemma 41 we get Ψ(H , δ, m) = O polylog By Theorem 7, k = O polylog m δ m log(m) δ i log(m) δ  . . , and an application of Lemma 8 concludes the proof. 4. Label Complexity Bounding Technique and Its Applications In this section we present a novel technique for deriving target-independent label complexity bounds for active learning. The technique combines the reduction of Theorem 7 and a general datadependent coverage bound for selective classiﬁcation from El-Yaniv and Wiener (2010). For some learning problems it is a straightforward technical exercise, involving VC-dimension calculations, to arrive with exponential label complexity bounds. We show a few applications of this technique resulting in both reproductions of known label complexity exponential rates as well as a new one. The following deﬁnitions (El-Yaniv and Wiener, 2010) are required for introducing the technique. Deﬁnition 10 (Version space compression set) For any hypothesis class H , let Sm be a labeled sample of m points inducing a version space V SH ,Sm . The version space compression set, S′ ⊆ Sm , ˆ ˆ is a smallest subset of Sm satisfying V SH ,Sm = V SH ,S′ . The (unique) number n = n(H , Sm ) = |S′ | is called the version space compression set size. Remark 11 Our ”version space compression set” is precisely Hanneke’s ”minimum specifying set” (Hanneke, 2007b) for f on U with respect to V , where, f = h∗ , U = Sm , V = H [Sm ] (see Deﬁnition 23). Deﬁnition 12 (Characterizing hypothesis) For any subset of hypotheses G ⊆ H , the characterizing hypothesis of G , denoted fG (x), is a binary hypothesis over X (not restricted to H ) obtaining positive values over the agreement set AGR(G ) (Deﬁnition 2), and zero otherwise. Deﬁnition 13 (Order-n characterizing set) For each n, let Σn be the set of all possible labeled samples of size n (all n-subsets, each with all 2n possible labelings). The order-n characterizing set of H , denoted Fn , is the set of all characterizing hypotheses fG (x), where G ⊆ H is a version space induced by some member of Σn . 263 E L -YANIV AND W IENER Deﬁnition 14 (Characterizing set complexity) Let Fn be the order-n characterizing set of H . The order-n characterizing set complexity of H , denoted γ (H , n), is the VC-dimension of Fn . The following theorem, credited to (El-Yaniv and Wiener, 2010, Theorem 21), is a powerful data-dependent coverage bound for perfect selective learning, expressed in terms of the version space compression set size and the characterizing set complexity. Theorem 15 (Data-dependent coverage guarantee) For any m, let a1 , a2 , . . . , am ∈ R be given, such that ai ≥ 0 and ∑m ai ≤ 1. Let (h, g) be perfect selective classiﬁer (CSS, see Section 2.3). i=1 Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − 2 γ (H , n) ln+ ˆ m 2em 2 , + ln an δ γ (H , n) ˆ ˆ where n is the size of the version space compression set and γ (H , n) is the order-n characterizing ˆ ˆ ˆ set complexity of H . Given an hypothesis class H , our recipe to deriving active learning label complexity bounds for H is: (i) calculate both n and γ (H , n); (ii) apply Theorem 15, obtaining a bound BΦ for the ˆ ˆ coverage; (iii) plug BΦ in Theorem 7 to get a label complexity bound expressed as a summation; (iv) Apply Lemma 41 to obtain a label complexity bound in a closed form. 4.1 Examples In the following example we derive a label complexity bound for the concept class of thresholds (linear separators in R). Although this is a toy example (for which an exponential rate is well known) it does exemplify the technique, and in many other cases the application of the technique is not much harder. Let H be the class of thresholds. We ﬁrst show that the corresponding version space compression set size n ≤ 2. Assume w.l.o.g. that h∗ (x) I(x > w) for some w ∈ (0, 1). Let ˆ x− max{xi ∈ Sm |yi = −1} and x+ min(xi ∈ Sm |yi = +1). At least one of x− or x+ exist. Let ′ ′ Sm = {(x− , −1), (x+ , +1)}. Then V SH ,Sm = V SH ,Sm , and n = |Sm | ≤ 2. Now, γ (H , 2) = 2, because ˆ ′ the order-2 characterizing set of H is the class of intervals in R whose VC-dimension is 2. Plugging these numbers in Theorem 15, and using the assignment a1 = a2 = 1/2, BΦ (H , δ, m) = 1 − 2 4 ln (m/δ) 2 ln (em) + ln = 1−O . m δ m Next we plug BΦ in Theorem 7 obtaining a raw label complexity m Ψ(H , δ, m) = ∑ 1 − BΦ H , i=1 δ , 2⌊log2 (i)⌋ 2 log2 (m) m = ∑O i=1 ln (log2 (m) · i/δ) . i Finally, by applying Lemma 41, with a = 1 and b = log2 m/δ, we conclude that Ψ(H , δ, m) = O ln2 m δ . Thus, H is actively learnable with exponential speedup, and this result applies to any distribution. In Table 1 we summarize the n and γ (H , n) values we calculated for four other hypothesis classes. The ˆ ˆ 264 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Hypothesis class Distribution n ˆ γ (H , n) ˆ Linear separators in R Intervals in R Linear separators in R2 any any (target-dependent)2 any distribution on the unit circle (target-dependent)2 2 4 4 2 4 4 Linear separators in Rd Balanced axis-aligned rectangles in Rd mixture of Gaussians product distribution O (log m)d−1 /δ O (log (dm/δ)) O nd/2+1 ˆ O (d n log n) ˆ ˆ Table 1: The n and γ of various hypothesis spaces achieving exponential rates. ˆ last two cases are fully analyzed in Sections 4.2 and 6.1, respectively. For the other classes, where γ and n are constants, it is clear (Theorem 15) that exponential rates are obtained. We emphasize that ˆ the bounds for these two classes are target-dependent as they require that Sm include at least one sample from each class. 4.2 Linear Separators in Rd Under Mixture of Gaussians In this section we state and prove our main example, an exponential label complexity bound for linear classiﬁers in Rd . Theorem 16 Let H be the class of all linear binary classiﬁers in Rd , and let the underlying distribution be any mixture of a ﬁxed number of Gaussians in Rd . Then, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by 2 k=O (log m)d +1 δ(d+3)/2 . Therefore by Lemma 8 we get k = O (poly(1/δ) · polylog(1/ε)) . Proof The following is a coverage bound for linear classiﬁers in d dimensions that holds in our setting with probability of at least 1 − δ (El-Yaniv and Wiener, 2010, Corollary 33),3 2 Φ(h, g) ≥ 1 − O 1 (log m)d · (d+3)/2 m δ . (2) 2. Target-dependent with at least one sample in each class. 3. This bound uses the fact that for linear classiﬁers in d dimensions n = O (log m)d−1 /δ (El-Yaniv and Wiener, 2010, ˆ Lemma 32), and that γ (H , n) = O nd/2+1 (El-Yaniv and Wiener, 2010, Lemma 27). ˆ ˆ 265 E L -YANIV AND W IENER Plugging this bound in Theorem 7 we obtain, Ψ(H , δ, m) = m ∑ i=1 1 − BΦ H , m = ∑O i=1 = O δ , 2⌊log2 (i)⌋ 2 log2 (m) 2 log2 (m) (log i)d · i δ log2 (m) δ d+3 2 d+3 2 m (log(i))d ·∑ i i=1 2 . Finally, an application of Lemma 41 with a = d 2 and b = 1 completes the proof. 5. Lower Bound on Label Complexity In the previous section we have derived an upper bound on the label complexity of CAL for various classiﬁers and distributions. In the case of linear classiﬁers in Rd we have shown an exponential speed up in terms of 1/ε but also an exponential slow down in terms of the dimension d. In passive learning there is a linear dependency in the dimension while in our case (active learning using CAL) there is an exponential one. Is it an artifact of our bounding technique or a fundamental phenomenon? To answer this question we derive an asymptotic lower bound on the label complexity. We show that the exponential dependency in d is unavoidable (at least asymptotically) for every bounding technique when considering linear classiﬁer even under a single Gaussian (isotropic) distribution. The argument is obtained by the observation that CAL has to request a label to any point on the convex hull of a sample Sm . The bound is obtained using known results from probabilistic geometry, which bound the ﬁrst two moments of the number of vertices of a random polytope under the Gaussian distribution. Deﬁnition 17 (Gaussian polytope) Let X1 , ..., Xm be i.i.d. random points in Rd with common stan1 dard normal distribution (with zero mean and covariance matrix 2 Id ). A Gaussian polytope Pm is the convex hull of these random points. Denote by fk (Pm ) the number of k-faces in the Gaussian polytope Pm . Note that f0 (Pm ) is the number of vertices in Pm . The following two Theorems asymptotically bound the average and variance of fk (Pm ). Theorem 18 (Hug et al., 2004, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then E fk (Pm ) = c(k,d) (log m) d−1 2 · (1 + o(1)) as m → ∞, where c(k,d) is a constant depending only on k and d. 266 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Theorem 19 (Hug and Reitzner, 2005, Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then there exists a positive constant cd , depending only on the dimension, such that d−1 Var ( fk (Pm )) ≤ cd (log m) 2 for all k ∈ {0, . . . , d − 1}. We can now use Chebyshev’s inequality to lower bound the number of vertices in Pm ( f0 (Pm )) with high probability. Theorem 20 Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution and δ > 0 be given. Then with probability of at least 1 − δ, f0 (Pm ) ≥ cd (log m) d−1 2 d−1 cd ˜ − √ (log m) 4 δ · (1 + o(1)) as m → ∞, where cd and cd are constants depending only on d. ˜ Proof Using Chebyshev’s inequality (in the second inequality), as well as Theorem 19 we get Pr ( f0 (Pm ) > E f0 (Pm ) − t) = 1 − Pr ( f0 (Pm ) ≤ E f0 (Pm ) − t) ≥ 1 − Pr (| f0 (Pm ) − E f0 (Pm )| ≥ t) d−1 cd Var ( f0 (Pm )) ≥ 1 − 2 (log m) 2 . ≥ 1− 2 t t Equating the RHS to 1 − δ and solving for t we get t= (log m) cd δ d−1 2 . Applying Theorem 18 completes the proof. Theorem 21 (Lower bound) Let H be the class of linear binary classiﬁers in Rd , and let the underlying distribution be standard normal distribution in Rd . Then there exists a target hypothesis such that, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by d−1 cd k ≥ (log m) 2 · (1 + o(1)). 2 as m → ∞, where cd is a constant depending only on d. Proof Let us look at the Gaussian polytope Pm induced by the random sample Sm . As long as all labels requested by CAL have the same value (the case of minuscule minority class) we note that every vertex of Pm falls in the region of disagreement with respect to any subset of Sm that do not include that speciﬁc vertex. Therefore, CAL will request label at least for each vertex of Pm . For sufﬁciently large m, in particular, 4 2cd d−1 ˜ √ log m ≥ , cd δ we conclude the proof by applying Theorem 20. 267 E L -YANIV AND W IENER 6. Relation to Existing Label Complexity Measures A number of complexity measures to quantify the speedup in active learning have been proposed. In this section we show interesting relations between our techniques and two well known measures, namely the teaching dimension (Goldman and Kearns, 1995) and the disagreement coefﬁcient (Hanneke, 2009). Considering ﬁrst the teaching dimension, we prove in Lemma 26 that the version space compression set size is bounded above, with high probability, by the extended teaching dimension growth function (introduced by Hanneke, 2007b). Consequently, it follows that perfect selective classiﬁcation with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefﬁcient and show in Theorem 34 that the coverage of CSS can be bounded below using the disagreement coefﬁcient. Conversely, in Corollary 39 we show that the disagreement coefﬁcient can be bounded above using any coverage bound for CSS. Consequently, the results here imply that the disagreement coefﬁcient, θ(ε) grows slowly with 1/ε for the case of linear classiﬁers under a mixture of Gaussians. 6.1 Teaching Dimension The teaching dimension is a label complexity measure proposed by Goldman and Kearns (1995). The dimension of the hypothesis class H is the minimum number of examples required to present to any consistent learner in order to uniquely identify any hypothesis in the class. We now deﬁne the following variation of the extended teaching dimension (Heged¨ s, 1995) u due to Hanneke. Throughout we use the notation h1 (S) = h2 (S) to denote the fact that the two hypotheses agree on the classiﬁcation of all instances in S. ¨ Deﬁnition 22 (Extended Teaching Dimension, Hegedus, 1995; Hanneke, 2007b) Let V ⊆ H , m ≥ m, 0, U ∈ X ∀f ∈ H , XT D( f ,V,U) = inf {t | ∃R ⊆ U : | {h ∈ V : h(R) = f (R)} | ≤ 1 ∧ |R| ≤ t} . Deﬁnition 23 (Hanneke, 2007b) For V ⊆ H , V [Sm ] denotes any subset of V such that ∀h ∈ V, | h′ ∈ V [Sm ] : h′ (Sm ) = h(Sm ) | = 1. Claim 24 Let Sm be a sample of size m, H an hypothesis class, and n = n(H , Sm ), the version space ˆ compression set size. Then, XT D(h∗ , H [Sm ], Sm ) = n. ˆ Proof Let Sn ⊆ Sm be a version space compression set. Assume, by contradiction, that there exist ˆ two hypotheses h1 , h2 ∈ H [Sm ], each of which agrees on the given classiﬁcations of all examples in Sn . Therefore, h1 , h2 ∈ V SH ,Sn , and by the deﬁnition of version space compression set, we get ˆ ˆ h1 , h2 ∈ V SH ,Sm . Hence, | h ∈ H [Sm ] : h(Sm ) = h∗ (Sm ) | ≥ 2, which contradicts deﬁnition 23. Therefore, | h ∈ H [Sm ] : h(Sn ) = h∗ (Sn ) | ≤ 1, ˆ ˆ 268 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION and XT D(h∗ , H [Sm ], Sm ) ≤ |Sn | = n. ˆ ˆ Let R ⊂ Sm be any subset of size |R| < n. Consequently, V SH ,Sm ⊂ V SH ,R , and there exist hypothesis, ˆ ′ ∈ VS h H ,R , that agrees with all labeled examples in R, but disagrees with at least one example in Sm . Thus, h′ (Sm ) = h∗ (Sm ), and according to deﬁnition 23, there exist hypotheses h1 , h2 ∈ H [Sm ] such that h1 (Sm ) = h′ (Sm ) = h∗ (Sm ) = h2 (Sm ). But h1 (R) = h2 (R) = h∗ (R), so | {h ∈ V [Sm ] : h(R) = h∗ (R)} | ≥ 2. It follows that XT D(h∗ , H [Sm ], Sm ) ≥ n. ˆ Deﬁnition 25 (XTD Growth Function, Hanneke, 2007b) For m ≥ 0, V ⊆ H , δ ∈ [0, 1], XT D(V, P, m, δ) = inf t|∀h ∈ H , Pr {XT D(h,V [Sm ], Sm ) > t} ≤ δ . Lemma 26 Let H be an hypothesis class, P an unknown distribution, and δ > 0. Then, with probability of at least 1 − δ, n ≤ XT D(H , P, m, δ). ˆ Proof According to Deﬁnition 25, with probability of at least 1 − δ, XT D(h∗ , H [Sm ], Sm ) ≤ XT D(H , P, m, δ). Applying Claim 24 completes the proof. Lemma 27 (Balanced Axis-Aligned Rectangles, Hanneke, 2007b, Lemma 4) If P is a product distribution on Rd with continuous CDF, and H is the set of axis-aligned rectangles such that ∀h ∈ H , PrX∼P {h(X) = +1} ≥ λ, then, XT D(H , P, m, δ) ≤ O d2 dm log . λ δ Lemma 28 Blumer et al., 1989, Lemma 3.2.3 Let F be a binary hypothesis class of ﬁnite VC dimension d ≥ 1. For all k ≥ 1, deﬁne the k-fold union, Fk∪ Then, for all k ≥ 1, ∪k f i : f i ∈ F , 1 ≤ i ≤ k . i=1 VC(Fk∪ ) ≤ 2dk log2 (3k). Lemma 29 (order-n characterizing set complexity) Let H be the class of axis-aligned rectangles in Rd . Then, γ(H , n) ≤ O (dn log n) . 269 E L -YANIV AND W IENER − + Proof Let Sn = Sk ∪ Sn−k be a sample of size n composed of k negative examples, {x1 , x2 , . . . xk }, and n − k positive ones. Let H be the class of axis-aligned rectangles. We deﬁne, ∀1 ≤ i ≤ k, + Sn−k ∪ {(xi , −1)} . Ri Notice that V SH ,Ri includes all axis aligned rectangles that classify all samples in S+ as positive, and xi as negative. Therefore, the agreement region of V SH ,Ri is composed of two components as depicted in Figure 1. The ﬁrst component is the smallest rectangle that bounds the positive samples, and the second is an unbounded convex polytope deﬁned by up to d hyperplanes intersecting at xi . Let AGRi be the agreement region of V SH ,Ri and AGR the agreement region of V SH ,Sn . Clearly, Ri ⊆ Sn , so V SH ,Sn ⊆ V SH ,Ri , and AGRi ⊆ AGR, and it follows that k i=1 AGRi ⊆ AGR. Assume, by contradiction, that x ∈ AGR but x ∈ k AGRi . Therefore, for any 1 ≤ i ≤ k, there exist i=1 (i) (i) (i) (i) two hypotheses h1 , h2 ∈ V SH ,Ri , such that, h1 (x) = h2 (x). Assume, without loss of generality, (i) that h1 (x) = 1. We deﬁne k h1 (i) h1 k and (i) h2 , h2 i=1 i=1 (i) meaning that h1 classiﬁes a sample as positive if and only if all hypotheses h1 classify it as positive. Noting that the intersection of axis-aligned rectangles is itself an axis-aligned rectangle, we know (i) (i) that h1 , h2 ∈ H . Moreover, for any xi we have, h1 (xi ) = h2 (xi ) = −1, so also h1 (xi ) = h2 (xi ) = −1, and h1 , h2 ∈ V SH ,Sn . But h1 (x) = h2 (x). Contradiction. Therefore, k AGRi . AGR = i=1 It is well known that the VC dimension of a hyper-rectangle in Rd is 2d. The VC dimension of AGRi is bounded by the VC dimension of the union of two hyper-rectangles in Rd . Furthermore, the VC dimension of AGR is bounded by the VC dimension of the union of all AGRi . Applying Lemma 28 twice we get, VCdim {AGR} ≤ 42dk log2 (3k) ≤ 42dn log2 (3n). If k = 0 then the entire sample is positive and the region of agreement is an hyper-rectangle. Therefore, VCdim {AGR} = 2d. If k = n then the entire sample is negative and the region of agreement is the points of the samples themselves. Hence, VCdim {AGR} = n. Overall we get that in all cases, VCdim {AGR} ≤ 42dn log2 (3n) = O (dn log n) . 270 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Figure 1: Agreement region of V SH ,Ri . Corollary 30 (Balanced Axis-Aligned Rectangles) Under the same conditions of Lemma 27, the class of balanced axis-aligned rectangles in Rd can be perfectly selectively learned with fast coverage rate. Proof Applying Lemmas 26 and 27 we get that with probability of at least 1 − δ, dm d2 log . λ δ n≤O ˆ Any balanced axis-aligned rectangle belongs to the class of all axis-aligned rectangles. Therefore, the coverage of CSS for the class of balanced axis-aligned rectangles is bounded bellow by the coverage of the class of axis-aligned rectangles. Applying Lemma 29, and assuming m ≥ d, we obtain, γ (H , n) ≤ O d ˆ d2 d2 dm dm log log log λ δ λ δ ≤O dm d3 log2 . λ λδ Applying Theorem 15 completes the proof. 6.2 Disagreement Coefﬁcient In this section we show interesting relations between the disagreement coefﬁcient and coverage bounds in perfect selective classiﬁcation. We begin by deﬁning, for an hypothesis h ∈ H , the set of all hypotheses that are r-close to h. Deﬁnition 31 (Hanneke, 2011b, p.337) For any hypothesis h ∈ H , distribution P over X , and r > 0, deﬁne the set B(h, r) of all hypotheses that reside in a ball of radius r around h, B(h, r) h′ ∈ H : Pr X∼P h′ (X) = h(X) ≤ r . Theorem 32 (Vapnik and Chervonenkis, 1971; Anthony and Bartlett, 1999, p.53) Let H be a hypothesis class with VC-dimension d. For any probability distribution P on X × {±1}, with probability of at least 1 − δ over the choice of Sm , any hypothesis h ∈ H consistent with Sm satisﬁes R(h) ≤ η(d, m, δ) 2 2em 2 + ln . d ln m d δ 271 E L -YANIV AND W IENER For any G ⊆ H and distribution P we denote by ∆G the volume of the disagreement region of G, ∆G Pr {DIS(G)} . Deﬁnition 33 (Disagreement coefﬁcient, Hanneke, 2009) Let ε ≥ 0. The disagreement coefﬁcient of the hypothesis class H with respect to the target distribution P is θ(ε) θh∗ (ε) = sup r>ε ∆B(h∗ , r) . r The following theorem formulates an intimate relation between active learning (disagreement coefﬁcient) and selective classiﬁcation. Theorem 34 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, ε ≥ 0, and θ(ε), the corresponding disagreement coefﬁcient. Let (h, g) be a perfect selective classiﬁer (CSS, see Section 2.3). Then, R(h, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(h, g) ≥ 1 − θ(ε) · max {η(d, m, δ), ε} . Proof Clearly, R(h, g) = 0, and it remains to prove the coverage bound. By Theorem 32, with probability of at least 1 − δ, ∀h ∈ V SH ,Sm R(h) ≤ η(d, m, δ) ≤ max {η(d, m, δ), ε} . Therefore, V SH ,Sm ⊆ B (h∗ , max {η(d, m, δ), ε}) , ∆V SH ,Sm ≤ ∆B (h∗ , max {η(d, m, δ), ε}) . By Deﬁnition 33, for any r′ > ε, ∆B(h∗ , r′ ) ≤ θ(ε)r′ . Thus, the proof is complete by recalling that Φ(h, g) = 1 − ∆V SH ,Sm . Theorem 34 tells us that whenever our learning problem (speciﬁed by the pair (H , P)) has a disagreement coefﬁcient that grows slowly with respect to 1/ε , it can be (perfectly) selectively learned with a “fast” coverage bound. Consequently, through Theorem 9 we also know that in each case where there exists a disagreement coefﬁcient that grows slowly with respect to 1/ε, active learning with a fast rate can also be deduced directly through a reduction from perfect selective classiﬁcation. It follows that as far as fast rates in active learning are concerned, whatever can be accomplished by bounding the disagreement coefﬁcient, can be accomplished also using perfect selective classiﬁcation. This result is summarized in the following corollary. Corollary 35 Let H be an hypothesis class with VC-dimension d, P an unknown distribution, and θ(ε), the corresponding disagreement coefﬁcient. If θ(ε) = O(polylog(1/ε)), there exists a coverage bound such that an application of Theorem 7 ensures that (H , P) is actively learnable with exponential label complexity speedup. 272 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Proof The proof is established by straightforward applications of Theorems 34 with ε = 1/m and 9. The following result, due to Hanneke (2011a), implies a coverage upper bound for CSS. Lemma 36 (Hanneke, 2011a, Proof of Lemma 47) Let H be an hypothesis class, P an unknown distribution, and r ∈ (0, 1). Then, EP ∆Dm ≥ (1 − r)m ∆B (h∗ , r) , where Dm V SH ,Sm ∩ B (h∗ , r) . (3) Theorem 37 (Coverage upper bound) Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1). Then, for any r ∈ (0, 1), 1 > α > δ, BΦ (H , δ, m) ≤ 1 − where BΦ (H , δ, m) is any coverage bound. (1 − r)m − α ∆B (h∗ , r) , 1−α Proof Recalling the deﬁnition of Dm (3), clearly Dm ⊆ V SH ,Sm and Dm ⊆ B(h∗ , r). These inclusions imply (respectively), by the deﬁnition of disagreement set, ∆Dm ≤ ∆V SH ,Sm , and ∆Dm ≤ ∆B(h∗ , r). (4) Using Markov’s inequality (in inequality (5) of the following derivation) and applying (4) (in equality (6)), we thus have, (1 − r)m − α (1 − r)m − α ∆B (h∗ , r) ≤ Pr ∆Dm ≤ ∆B (h∗ , r) 1−α 1−α 1 − (1 − r)m Pr ∆B (h∗ , r) − ∆Dm ≥ ∆B (h∗ , r) 1−α 1 − (1 − r)m Pr |∆B (h∗ , r) − ∆Dm | ≥ ∆B (h∗ , r) 1−α E {|∆B (h∗ , r) − ∆Dm |} (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) ∆B (h∗ , r) − E∆Dm . (1 − α) · (1 − (1 − r)m ) ∆B (h∗ , r) Pr ∆V SH ,Sm ≤ = ≤ ≤ = Applying Lemma 36 we therefore obtain, ≤ (1 − α) · ∆B (h∗ , r) − (1 − r)m ∆B(h∗ , r) = 1 − α < 1 − δ. (1 − (1 − r)m ) ∆B (h∗ , r) Observing that for any coverage bound, Pr ∆V SH ,Sm ≤ 1 − BΦ (H , δ, m) ≥ 1 − δ, completes the proof. 273 (5) (6) E L -YANIV AND W IENER Corollary 38 Let H be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1/8). Then for any m ≥ 2, 1 1 , BΦ (H , δ, m) ≤ 1 − ∆B h∗ , 7 m where BΦ (H , δ, m) is any coverage bound. Proof The proof is established by a straightforward application of Theorem 37 with α = 1/8 and r = 1/m. With Corollary 38 we can bound the disagreement coefﬁcient for settings whose coverage bound is known. Corollary 39 Let H be an hypothesis class, P an unknown distribution, and BΦ (H , δ, m) a coverage bound. Then the disagreement coefﬁcient is bounded by, θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Proof Applying Corollary 38 we get that for any r ∈ (0, 1/2), 1 − BΦ (H , 1/9, ⌊1/r⌋) ∆B(h∗ , r) ∆B(h∗ , 1/⌊1/r⌋) ≤ ≤ 7· . r r r Therefore, θ(ε) = sup r>ε ∆B(h∗ , r) ≤ max r sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 . r Corollary 40 Let H be the class of all linear binary classiﬁers in Rd , and let the underlying distribution be any mixture of a ﬁxed number of Gaussians in Rd . Then θ(ε) ≤ O polylog 1 ε . Proof Applying Corollary 39 together with inequality 2 we get that θ(ε) ≤ max sup 7 · r∈(ε,1/2) 1 − BΦ (H , 1/9, ⌊1/r⌋) ,2 r 7 ≤ max sup ·O r∈(ε,1/2) r 2 d+3 (log ⌊1/r⌋)d ·9 2 ⌊1/r⌋ 274 ,2 ≤O 1 log ε d2 . ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION 7. Concluding Remarks For quite a few years, since its inception, the theory of target-independent bounds for noise-free active learning managed to handle relatively simple settings, mostly revolving around homogeneous linear classiﬁers under the uniform distribution over the sphere. It is likely that this distributional uniformity assumption was often adapted to simplify analyses. However, it was shown by Dasgupta (2005) that under this distribution, exponential speed up cannot be achieved when considering general (non homogeneous) linear classiﬁers. The reason for this behavior is related to the two tasks that a good active learner should successfully accomplish: exploration and exploitation. Intuitively (and oversimplifying things) exploration is the task of obtaining at least one sample in each class, and exploitation is the process of reﬁning the decision boundary by requesting labels of points around the boundary. Dasgupta showed that exploration cannot be achieved fast enough under the uniform distribution on the sphere. The source of this difﬁculty is the fact that under this distribution all training points reside on their convex hull. In general, the speed of exploration (using linear classiﬁers) depends on the size (number of vertices) of the convex hull of the training set. When using homogeneous linear classiﬁers, exploration is trivially achieved (under the uniform distribution) and exploitation can achieve exponential speedup. So why in the non-veriﬁable model (Balcan et al., 2008) it is possible to achieve exponential speedup even when using non homogeneous linear classiﬁers under the uniform distribution? The answer is that in the non-veriﬁable model, label complexity attributed to exploration is encapsulated in a target-dependent “constant.” Speciﬁcally, in Balcan et al. (2008) this constant is explicitly deﬁned to be the probability mass of the minority class. Indeed, in certain noise free settings using linear classiﬁers, where the minority class is large enough, exploration is a non issue. In general, however, exploration is a major bottleneck in practical active learning (Baram et al., 2004; Begleiter et al., 2008). The present results show how exponential speedup can be achieved, including exploration, when using different (and perhaps more natural) distributions. With these good news, a somewhat pessimistic picture arises from the lower bound we obtained for the exponential dependency on the dimension d. This negative result is not restricted to streambased active learning and readily applies also to the pool-based model. While the bound is only asymptotic, we conjecture that it also holds for ﬁnite samples. Moreover, we believe that within the stream- or pool-based settings a similar statement should hold true for any active learning method (and not necessarily CAL-based querying strategies). This result indicates that when performing noise free active learning of linear classiﬁers, aggressive feature selection is beneﬁcial for exploration speedup. We note, however, that it remains open whether a slowdown exponent of d (rather than d 2 ) is achievable. We have exposed interesting relations of the present technique to well known complexity measures for active learning, namely, the teaching dimension and the disagreement coefﬁcient. These developments were facilitated by observations made by Hanneke on the teaching dimension and the disagreement coefﬁcient. These relations gave rise to further observations on active learning, which are discussed in Section 6 and include exponential speedup for balanced axis-aligned rectangles. Finally, we note that the intimate relation between selective classiﬁcation and the disagreement coefﬁcient was recently exposed in another result for selective classiﬁcation where the disagreement coefﬁcient emerged as a dominating factor in a coverage bound for agnostic selective classiﬁcation (El-Yaniv and Wiener, 2011). 275 E L -YANIV AND W IENER Acknowledgments We thank the anonymous reviewers for their good comments. This paper particularly beneﬁted from insightful observations made by one of the reviewers, which are summarized in Section 6, including the proof of Theorem 37 and the link between our n and the extended teaching dimension ˆ (Lemmas 26 and 27). Appendix A. Lemma 41 For any m ≥ 3, a ≥ 1, b ≥ 1 we get lna (bi) i m ∑ i=1 Proof Setting f (x) lna (bx) x , < 4 a+1 ln (b(m + 1)). a we have lna−1 (bx) df = (a − ln bx) · . dx x2 Therefore, f is monotonically increasing when x < ea /b, monotonically decreasing function when x ≥ ea /b and its attains its maximum at x = ea /b. Consequently, for i < ea /b − 1, or i ≥ ea /b + 1, i+1 f (i) ≤ f (x)dx. x=i−1 For ea /b − 1 ≤ i < ea /b + 1, f (i) ≤ f (ea /b) = b a e a ≤ aa . (7) Therefore, if m < ea − 1 we have, m ∑ i=1 m f (i) = lna (b) + ∑ f (i) < 2 · i=2 m+1 x=1 f (x)dx ≤ 2 lna+1 (b(m + 1)). a+1 Otherwise, m ≥ ea /b, in which case we overcome the change of slope by adding twice the (upper bound on the) maximal value (7), m ∑ f (i) < i=1 ≤ 2 2 2 lna+1 (b(m + 1)) + 2aa = lna+1 (b(m + 1)) + aa+1 a+1 a+1 a 2 4 2 lna+1 (b(m + 1)) + lna+1 bm ≤ lna+1 (b(m + 1)). a+1 a a 276 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION Appendix B. Alternative Proof of Lemma 6 Using Super Martingales Deﬁne Wk ∑k (Zi − bi ). We assume that with probability of at least 1 − δ/2, i=1 Pr{Zi |Z1 , . . . , Zi−1 } ≤ bi , simultaneously for all i. Since Zi is a binary random variable it is easy to see that (w.h.p.), EZi {Wi |Z1 , . . . , Zi−1 } = Pr{Zi |Z1 , . . . , Zi−1 } − bi +Wi−1 ≤ Wi−1 , and the sequence W1m W1 , . . . ,Wm is a super-martingale with high probability. We apply the following theorem by McDiarmid that refers to martingales (but can be shown to apply to supermartingales, by following its original proof). Theorem 42 (McDiarmid, 1998, Theorem 3.12) Let Y1 , . . . ,Yn be a martingale difference sequence with −ak ≤ Yk ≤ 1 − ak for each k; let A = 1 ∑ ak . Then, for any ε > 0, n Pr ∑ Yk ≥ Anε ≤ exp (−[(1 + ε) ln(1 + ε) − ε]An) ≤ exp − Anε2 . 2(1 + ε/3) In our case, Yk = Wk −Wk−1 = Zk − bk ≤ 1 − bk and we apply the (revised) theorem with ak and An ∑ bk B. We thus obtain, for any 0 < ε < 1, Pr ∑ Zk ≥ B + Bε ≤ exp − bk Bε2 . 2(1 + ε/3) Equating the right-hand side to δ/2, we obtain ε = 2 2 ln ± 3 δ 4 22 2 ln + 8B ln 9 δ δ ≤ 1 2 ln + 3 δ 1 22 ln + 9 δ = 2 2 ln + 3 δ 2B ln 2 δ 2B ln /2B 2 δ /B /B. Applying the union bound completes the proof. References M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. L. Atlas, D. Cohn, R. Ladner, A.M. El-Sharkawi, and R.J. Marks. Training connectionist networks with queries and selective sampling. In Neural Information Processing Systems (NIPS), pages 566–573, 1990. M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In 21st Annual Conference on Learning Theory (COLT), pages 45–56, 2008. 277 E L -YANIV AND W IENER Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255–291, 2004. P.L. Bartlett and M.H. Wegkamp. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840, 2008. R. Begleiter, R. El-Yaniv, and D. Pechyony. Repairing self-conﬁdent active-transductive learners using systematic exploration. Pattern Recognition Letters, 29(9):1245–1251, 2008. A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Chervonenkis dimension. Journal of the ACM, 36, 1989. Learnability and the Vapnik- C.K. Chow. An optimum character recognition system using decision function. IEEE Transactions on Computers, 6(4):247–254, 1957. C.K. Chow. On optimum recognition error and reject trade-off. IEEE Transactions on Information Theory, 16:41–36, 1970. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, pages 235–242, 2005. S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classiﬁcation. Journal of Machine Learning Research, 11:1605–1641, 2010. R. El-Yaniv and Y. Wiener. Agnostic selective classiﬁcation. In Neural Information Processing Systems (NIPS), 2011. S. Fine, R. Gilad-Bachrach, and E. Shamir. Query by committee, linear separation and random walks. Theoretical Computer Science, 284(1):25–51, 2002. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In Advances in Neural Information Processing Systems (NIPS) 5, pages 483–490, 1993. Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for averaged classiﬁers. Annals of Statistics, 32(4):1698–1722, 2004. E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, the Hebrew University of Jerusalem, 2007. S. Goldman and M. Kearns. On the complexity of teaching. JCSS: Journal of Computer and System Sciences, 50, 1995. 278 ACTIVE L EARNING VIA P ERFECT S ELECTIVE C LASSIFICATION S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 353–360, 2007a. S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), volume 4539 of Lecture Notes in Artiﬁcial Intelligence, pages 66–81, 2007b. S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009. S. Hanneke. Activized learning: Transforming passive to active with improved label complexity. CoRR, abs/1108.1766, 2011a. URL http://arxiv.org/abs/1108.1766. informal publication. S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 37(1):333–361, 2011b. T. Heged¨ s. Generalized teaching dimensions and the query complexity of learning. In COLT: u Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1995. R. Herbei and M.H. Wegkamp. Classiﬁcation with reject option. The Canadian Journal of Statistics, 34(4):709–721, 2006. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963. D. Hug and M. Reitzner. Gaussian polytopes: variances and limit theorems, June 2005. D. Hug, G. O. Munsonious, and M. Reitzner. Asymptotic mean values of Gaussian polytopes. Beitr¨ ge Algebra Geom., 45:531–548, 2004. a C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195– 248. Springer-Verlag, 1998. T. Mitchell. Version spaces: a candidate elimination approach to rule learning. In IJCAI’77: Proceedings of the 5th international joint conference on Artiﬁcial Intelligence, pages 305–310, 1977. H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning theory (COLT), pages 287–294, 1992. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. M.H. Wegkamp. Lasso type classiﬁers with a reject option. Electronic Journal of Statistics, 1: 155–168, 2007. 279

4 0.030819992 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we deﬁne such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods

5 0.028920628 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

Abstract: Classiﬁcation algorithms are frequently used on data with a natural hierarchical structure. For instance, classiﬁers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classiﬁcation outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (ﬁxed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classiﬁcation accuracy is complemented by inference on the balanced accuracy, which avoids inﬂated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classiﬁcation studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies

6 0.027319307 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

7 0.024647418 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

8 0.02425928 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

9 0.023549328 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

10 0.021520423 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

11 0.021272123 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

12 0.019734977 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

13 0.019651569 80 jmlr-2012-On Ranking and Generalization Bounds

14 0.018810324 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

15 0.018753866 111 jmlr-2012-Structured Sparsity and Generalization

16 0.018692004 82 jmlr-2012-On the Necessity of Irrelevant Variables

17 0.018512068 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

18 0.017897915 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

19 0.017395457 12 jmlr-2012-Active Clustering of Biological Sequences

20 0.017394558 44 jmlr-2012-Feature Selection via Dependence Maximization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.082), (1, 0.027), (2, 0.026), (3, -0.015), (4, 0.014), (5, -0.006), (6, 0.067), (7, 0.051), (8, -0.012), (9, 0.023), (10, -0.001), (11, -0.019), (12, 0.014), (13, -0.029), (14, 0.05), (15, 0.038), (16, 0.071), (17, 0.049), (18, 0.017), (19, -0.002), (20, -0.018), (21, 0.021), (22, -0.008), (23, -0.005), (24, -0.021), (25, 0.074), (26, -0.211), (27, 0.006), (28, 0.11), (29, -0.027), (30, -0.039), (31, -0.164), (32, 0.201), (33, -0.024), (34, 0.081), (35, 0.058), (36, 0.049), (37, 0.128), (38, 0.118), (39, 0.175), (40, 0.126), (41, -0.389), (42, -0.012), (43, 0.219), (44, 0.133), (45, -0.026), (46, 0.167), (47, 0.142), (48, 0.292), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96425349 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Author: Vikas C. Raykar, Shipeng Yu

2 0.42120168 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

Author: José Hernández-Orallo, Peter Flach, Cèsar Ferri

3 0.23710269 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

4 0.22724161 90 jmlr-2012-Pattern for Python

Author: Tom De Smedt, Walter Daelemans

Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classiﬁers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks

5 0.22509289 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

6 0.18597235 13 jmlr-2012-Active Learning via Perfect Selective Classification

7 0.18320511 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

8 0.17065609 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

9 0.16563654 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

10 0.16441558 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

11 0.13462648 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

12 0.13132578 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

13 0.12891835 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

14 0.12868567 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

15 0.12775607 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

16 0.1243317 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

17 0.11970555 91 jmlr-2012-Plug-in Approach to Active Learning

18 0.11455769 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

19 0.11386674 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

20 0.11307883 12 jmlr-2012-Active Clustering of Biological Sequences

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (21, 0.022), (26, 0.025), (29, 0.039), (35, 0.024), (49, 0.036), (55, 0.012), (56, 0.023), (57, 0.012), (69, 0.023), (75, 0.044), (81, 0.014), (92, 0.048), (95, 0.447), (96, 0.103)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70408535 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Author: Vikas C. Raykar, Shipeng Yu

2 0.48091567 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efﬁcient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efﬁcient, crowdsourcing

3 0.27501163 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

4 0.27345383 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-inﬁnite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to signiﬁcant improvements in the efﬁciency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, ﬁxed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines

5 0.27222484 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

6 0.27139565 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

7 0.27054065 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

8 0.26967403 106 jmlr-2012-Sign Language Recognition using Sub-Units

9 0.26952899 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

10 0.26950783 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

11 0.26935834 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

12 0.26909614 91 jmlr-2012-Plug-in Approach to Active Learning

13 0.2682665 103 jmlr-2012-Sampling Methods for the Nyström Method

14 0.26820481 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

15 0.26799339 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

16 0.2678659 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

17 0.26728648 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

18 0.26654416 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

19 0.26644072 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

20 0.26537055 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors