jmlr jmlr2010 jmlr2010-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy
Abstract: For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline. Keywords: multiple annotators, multiple experts, multiple teachers, crowdsourcing 1. Supervised Learning From Multiple Annotators/Experts A typical supervised learning scenario consists of a training set D = {(xi , yi )}N containing N i=1 instances, where xi ∈ X is an instance (typically a d-dimensional feature vector) and yi ∈ Y is the corresponding known label. The task is to learn a function f : X → Y which generalizes well on unseen data. Specifically for binary classification the supervision is from the set Y = {0, 1}, for multi-class classification Y = {1, . . . , K}, for ordinal regression Y = {1, . . . , K} (with an ordering 1 < . . . < K), and Y = R for regression. c 2010 Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo H. Valadez, Charles Florin, Luca Bogoni and Linda Moy. R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY However, for many real life tasks, it may not be possible, or may be too expensive (or tedious) to acquire the actual label yi for training—which we refer to as the gold standard or the objective ground truth. Instead, we may have multiple (possibly noisy) labels y1 , . . . , yR provided by R i i different experts or annotators. In practice, there is a substantial amount of disagreement among the experts, and hence it is of great practical interest to address conventional supervised learning algorithms in this scenario. Our motivation for this work comes from the area of computer-aided diagnosis1 (CAD), where the task is to build a classifier to predict whether a suspicious region on a medical image (like a X-ray, CT scan, or MRI) is malignant (cancerous) or benign. In order to train such a classifier, a set of images is collected from hospitals. The actual gold standard (whether it is cancer or not) can only be obtained from a biopsy of the tissue. Since it is an expensive, invasive, and potentially dangerous process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard.2 The radiologist also annotates various descriptors of the potentially malignant lesion, like the size (a regression problem), shape (a multiclass classification problem), and also degree of malignancy (an ordinal regression problem). The radiologists come from a diverse pool including luminaries, experts, residents, and novices. Very often there is lot of disagreement among the annotations. For a lot of tasks the labels provided by the annotators are inherently subjective and there will be substantial variation among different annotators. The domain of text classification offers such a scenario. In this context the task is to predict the category for a token of text. The labels for training are assigned by human annotators who read the text and attribute their subjective category. With the advent of crowdsourcing (Howe, 2008) services like Amazon’s Mechanical Turk,3 Games with a Purpose,4 and reCAPTCHA5 it is quite inexpensive to acquire labels from a large number of annotators (possibly thousands) in a short time (Sheng et al., 2008; Snow et al., 2008; Sorokin and Forsyth, 2008). Websites such as Galaxy Zoo6 allow the public to label astronomical images over the internet. In situations like these, the performance of different annotators can vary widely (some may even be malicious), and without the actual gold standard, it may not be possible to evaluate the annotators. In this work, we provide principled probabilistic solutions to the following questions: 1. How to adapt conventional supervised learning algorithms when we have multiple annotators providing subjective labels but no objective gold standard? 2. How to evaluate systems when we do not have absolute gold-standard? 3. A closely related problem—particularly relevant when there are a large number of annotators— is to estimate how reliable/trustworthy is each annotator. 1. See Fung et al. (2009) for an overview of the data mining issues in this area. 2. Sometimes even a biopsy cannot confirm whether it is cancer or not and hence all we can hope to get is subjective ground truth. 3. Mechanical Turk found at https://www.mturk.com. 4. Games with a Purpose found at http://www.gwap.com. 5. reCAPTCHA found at http://recaptcha.net/. 6. Galaxy Zoo found at http://galaxyzoo.org. 1298 L EARNING F ROM C ROWDS 1.1 The Problem With Majority Voting When we have multiple labels a commonly used strategy is to use the labels on which the majority of them agree (or average for regression problem) as an estimate of the actual gold standard. For binary classification problems this amounts to using the majority label,7 that is, j yi = ˆ 1 if (1/R) ∑R yi > 0.5 j=1 , j 0 if (1/R) ∑R yi < 0.5 j=1 as an estimate of the hidden true label and use this estimate to learn and evaluate classifiers/annotators. Another strategy is that of considering every pair (instance, label) provided by each expert as a separate example. Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, that is, R Pr[yi = 1|y1 , . . . , yR ] = (1/R) ∑ yi . i i j j=1 Majority voting assumes all experts are equally good. However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. One could address this problem by introducing a weight capturing how good each expert is. But how would one measure the performance of an expert when there is no gold standard available? 1.2 Proposed Approach and Organization To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator that jointly learns the classifier/regressor, the annotator accuracy, and the actual true label. For ease of exposition we start with binary classification problem in § 2. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard (§ 2.1). The proposed algorithm automatically discovers the best experts and assigns a higher weight to them. In order to incorporate prior knowledge about each annotator, we impose a beta prior on the sensitivity and specificity and derive the maximum-a-posteriori estimate (§ 2.6). The final estimation is performed by an Expectation Maximization (EM) algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. While the proposed approach is described using logistic regression as the base classifier (§ 2.2), it is quite general, and can be used with any black-box classifier (§ 2.7), and can also handle missing labels (that is, each expert is not required to label all the instances). Furthermore, we extend the proposed algorithm to handle categorical (§ 3), ordinal (§ 4), and regression problems (§ 5). In § 6 section we extensively validate our approach using both simulated data and real data from different domains. 1.3 Related Work and Novel Contributions We first summarize the novel contributions of this work in context of other related work in this emerging new area. There has been a long line of work in the biostatistics and epidemiology literature on latent variable models where the task is to get an estimate of the observer error rates based 7. When there is no clear majority among the multiple experts (that is, yi = 0.5) in CAD domain the final decision is ˆ often made by an adjudicator or a super-expert. When there is no adjudicator a fair coin toss is used. 1299 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY on the results from multiple diagnostic tests without a gold standard (see Dawid and Skene, 1979, Hui and Walter, 1980, Hui and Zhou, 1998, Albert and Dodd, 2004 and references therein). In the machine learning community Smyth et al. (1995) first addressed the same problem in the context of labeling volcanoes in satellite images of Venus. We differ from this previous body of work in the following aspects: 1. Unlike Dawid and Skene (1979) and Smyth et al. (1995) which just focused on estimating the ground truth from multiple noisy labels, we specifically address the issue of learning a classifier. Estimating the ground truth and the annotator/classifer performance is a byproduct of our proposed algorithm. 2. In order to learn a classifier Smyth (1995) proposed to first estimate the ground truth (without using the features) and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly. Our experiments (§ 6.1.1) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. 3. Our solution is more general and can be easily extended to categorical(§ 3), ordinal(§ 4), and continuous data(§ 5). It can also be used in conjunction with any supervised learning algorithm. A preliminary version of this paper (Raykar et al., 2009) mainly discussed the binary classification problem. 4. Our proposed algorithm is also Bayesian—we impose a prior on the experts. The priors can potential capture the skill of different annotators. In this paper we refrain from doing a full Bayesian inference and use the mode of the posterior as a point estimate. A recent complete Bayesian generalization of these kind of models has been developed by Carpenter (2008). 5. The EM approach used in this paper is similar to that proposed by Jin and Ghahramani (2003). However their motivation is somewhat different. In their setting, each training example is annotated with a set of possible labels, only one of which is correct. There has been recent interest in the natural language processing (Sheng et al., 2008; Snow et al., 2008) and computer vision (Sorokin and Forsyth, 2008) communities where they use Amazon’s Mechanical Turk to collect annotations from many people. They show that it can be potentially as good as that provided by an expert. Sheng et al. (2008) analyzed when it is worthwhile to acquire new labels for some of the training examples. There is also some theoretical work (see Lugosi, 1992 and Dekel and Shamir, 2009a) dealing with multiple experts. Recently Dekel and Shamir (2009b) presented an algorithm which does not resort to repeated labeling, that is, each example does not have to be labeled by multiple teachers. Donmez et al. (2009) address the issue of active learning in this scenario—How to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task? There has also been some work in the medical imaging community (Warfield et al., 2004; Cholleti et al., 2008). 2. Binary Classification We first describe our proposed noise model for the annotators. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard. 1300 L EARNING F ROM C ROWDS 2.1 A Two-coin Model for Annotators Let y j ∈ {0, 1} be the label assigned to the instance x by the jth annotator/expert. Let y be the actual (unobserved) label for this instance. Each annotator provides a version of this hidden true label based on two biased coins. If the true label is one, she flips a coin with bias α j (sensitivity). If the true label is zero, she flips a coin with bias β j (specificity). In each case, if she gets heads she keeps the original label, otherwise she flips the label. If the true label is one, the sensitivity (true positive rate) for the jth annotator is defined as the probability that she labels it as one. α j := Pr[y j = 1|y = 1]. (1) On the other hand, if the true label is zero, the specificity (1−false positive rate) is defined as the probability that she labels it as zero. β j := Pr[y j = 0|y = 0]. (2) The assumption introduced is that α j and β j do not depend on the instance x. For example, in the CAD domain, this means that the radiologist’s performance is consistent across different sub-groups of data.8 2.2 Classification Model While the proposed method can be used for any classifier, for ease of exposition, we consider the family of linear discriminating functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. The final classifier can be written in the following form: y = 1 if w⊤ x ≥ γ and 0 otherwise. The ˆ threshold γ determines the operating point of the classifier. The Receiver Operating Characteristic (ROC) curve is obtained as γ is swept from −∞ to ∞. The probability for the positive class is modeled as a logistic sigmoid acting on fw, that is, Pr[y = 1|x, w] = σ(w⊤ x), where the logistic sigmoid function is defined as σ(z) = 1/(1 + e−z ). This classification model is known as logistic regression. 2.3 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R annotators, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and also the sensitivity α = i i i=1 [α1 , . . . , αR ] and the specificity β = [β1 , . . . , βR ] of the R annotators. It is also of interest to get an estimate of the unknown gold standard y1 , . . . , yN . 2.4 Maximum Likelihood Estimator Assuming the training instances are independently sampled, the likelihood function of the parameters θ = {w, α, β} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 8. While this is a reasonable assumption, it is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. 1301 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY j Conditioning on the true label yi , and also using the assumption yi is conditionally independent (of everything else) given α j , β j and yi , the likelihood can be decomposed as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |yi = 1, α]Pr[yi = 1|xi, w] i i i=1 + Pr[y1 , . . . , yR |yi = 0, β]Pr[yi = 0|xi, w] . i i Given the true label yi , we assume that y1 , . . . , yR are independent, that is, the annotators make their i i decisions independently.9 Hence, R R Pr[y1 , . . . , yR |yi = 1, α] = ∏ Pr[yi |yi = 1, α j ] = ∏ [α j ]yi [1 − α j ]1−yi . i i j j=1 j j j=1 Similarly, we have R Pr[y1 , . . . , yR |yi = 0, β] = ∏ [β j ]1−yi [1 − β j ]yi . i i j j j=1 Hence the likelihood can be written as N Pr[D |θ] = ∏ ai pi + bi (1 − pi ) , i=1 where we have defined pi := σ(w⊤ xi ). R ai := ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi := ∏[β j ]1−y [1 − β j ]y . j i j i j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood, that is, ˆ θML = {α, β, w} = arg max{ln Pr[D |θ]}. ˆ ˆ ˆ θ 2.5 The EM Algorithm This maximization problem can be simplified a lot if we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is an efficient iterative procedure to compute the maximum-likelihood solution in presence of missing/hidden data. We will use the unknown hidden true label yi as the missing data. If we know the missing data y = [y1 , . . . , yN ] then the complete likelihood can be written as N ln Pr[D , y|θ] = ∑ yi ln pi ai + (1 − yi ) ln(1 − pi )bi . i=1 9. This assumption is not true in general and there is some correlations among the labels assigned by multiple annotators. For example in the CAD domain if the cancer is in advanced stage (which is very easy to detect) almost all the radiologists assign the same label. 1302 L EARNING F ROM C ROWDS Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)step. The M-step involves maximization of a lower bound on the log-likelihood that is refined in each iteration by the E-step. 1. E-step. Given the observation D and the current estimate of the model parameters θ, the conditional expectation (which is a lower bound on the true likelihood) is computed as N E {ln Pr[D , y|θ]} = ∑ µi ln pi ai + (1 − µi ) ln(1 − pi )bi , (3) i=1 where the expectation is with respect to Pr[y|D , θ], and µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. Using i i Bayes’ theorem we can compute µi ∝ Pr[y1 , . . . , yR |yi = 1, θ] · Pr[yi = 1|xi, θ] i i ai pi = . ai pi + bi (1 − pi ) 2. M-step. Based on the current estimate µi and the observations D , the model parameters θ are then estimated by maximizing the conditional expectation. By equating the gradient of (3) to zero we obtain the following estimates for the sensitivity and specificity: j j αj = ∑N µi yi i=1 , ∑N µi i=1 βj = ∑N (1 − µi )(1 − yi ) i=1 . ∑N (1 − µi ) i=1 Due to the non-linearity of the sigmoid, we do not have a closed form solution for w and we have to use gradient ascent based optimization methods. We use the Newton-Raphson update given by wt+1 = wt − ηH −1 g, where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by N g(w) = ∑ µi − σ(w⊤ xi ) xi . i=1 The Hessian matrix is given by N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ . i i=1 Essentially, we are estimating a logistic regression model with probabilistic labels µi . These two steps (the E- and the M-step) can be iterated till convergence. The log-likelihood increases monotonically after every iteration, which in practice implies convergence to a local maximum. The EM algorithm is only guaranteed to converge to a local maximum. In practice multiple restarts with different initializations can potentially mitigate the local maximum problem. In this j paper we use majority voting µi = 1/R ∑R yi as the initialization for µi to start the EM-algorithm. j=1 1303 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2.6 A Bayesian Approach In some applications we may want to trust a particular expert more than the others. One way to achieve this is by imposing priors on the sensitivity and specificity of the experts. Since α j and β j represent the probability of a binary event, a natural choice of prior is the beta prior. The beta prior is also conjugate to the binomial distribution. For any a > 0, b > 0, and δ ∈ [0, 1] the beta distribution is given by δa−1 (1 − δ)b−1 , Beta(δ|a, b) = B(a, b) where B(a, b) = 01 δa−1 (1 − δ)b−1 dδ is the beta function.We assume a beta prior10 for both the sensitivity and the specificity as R j j j j j j j j Pr[α j |a1 , a2 ] = Beta(α j |a1 , a2 ). Pr[β j |b1 , b2 ] = Beta(β j |b1 , b2 ). For sake of completeness we also assume a zero mean Gaussian prior on the weights w with inverse covariance matrix Γ, that is, Pr[w] = N (w|0, Γ−1 ). Assuming that {α j }, {β j }, and w have independent priors, the maximum-a-posteriori (MAP) estimator is found by maximizing the logposterior, that is, ˆ θMAP = arg max{ln Pr[D |θ] + ln Pr[θ]}. θ An EM algorithm can be derived in a similar fashion for MAP estimation by relying on the interpretation of Neal and Hinton (1998). The final algorithm is summarized below: j 1. Initialize µi = (1/R) ∑R yi based on majority voting. j=1 2. Given µi , estimate the sensitivity and specificity of each annotator/expert as follows. j αj = j a1 − 1 + ∑N µi yi i=1 j j a1 + a2 − 2 + ∑N µi i=1 j β j = . j b1 − 1 + ∑N (1 − µi )(1 − yi ) i=1 j j b1 + b2 − 2 + ∑N (1 − µi ) i=1 . (4) The Newton-Raphson update for optimizing w is given by wt+1 = wt − ηH −1 g, with step length η, gradient vector N g(w) = ∑ µi − σ(w⊤ xi ) xi − Γw, i=1 and Hessian matrix N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ − Γ. i i=1 10. It may be convenient to specify a prior in terms of the mean µ and variance σ2 . The mean and the variance for a beta prior are given by µ = a/(a + b) and σ2 = ab/((a + b)2 (a + b + 1)). Solving for a and b we get a = (−µ3 + µ2 − µσ2 )/σ2 and b = a(1 − µ)/µ. 1304 L EARNING F ROM C ROWDS 3. Given the sensitivity and specificity of each annotator and the model parameters, update µi as µi = ai pi , ai pi + bi (1 − pi ) (5) where pi = σ(w⊤ xi ). R ai = ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi = ∏[β j ]1−y [1 − β j ]y . j i j i (6) j=1 Iterate (2) and (3) till convergence. 2.7 Discussions 1. Estimate of the gold standard The value of the posterior probability µi is a soft probabilistic estimate of the actual ground truth yi , that is, µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. The actual i i hidden label yi can be estimated by applying a threshold on µi , that is, yi = 1 if µi ≥ γ and zero otherwise. We can use γ = 0.5 as the threshold. By varying γ we can change the misclassification costs and obtain a ground truth with large sensitivity or large specificity. Because of this in our experimental validation we can actually draw an ROC curve for the estimated ground truth. 2. Log-odds of µ A particularly revealing insight can be obtained in terms of the log-odds or the logit of the posterior probability µi . From (5) the logit of µi can be written as logit(µi ) = ln Pr[yi = 1|y1 , . . . , yR , xi , θ] µi i i = ln 1 − µi Pr[yi = 0|y1 , . . . , yR , xi , θ] i i R = w⊤ xi + c + ∑ yi [logit(α j ) + logit(β j )]. j j=1 j where c = ∑R log 1−α is a constant term which does not depend on i. This indicates that j=1 βj the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. The weight of each expert is the sum of the logit of the sensitivity and specificity. 3. Using any other classifier For ease of exposition we used logistic regression. However, the proposed algorithm can be used with any generalized linear model or in fact with any classifier that can be trained with soft probabilistic labels. In each step of the EM-algorithm, the classifier is trained with instances sampled from µi . This modification is easy for most probabilistic classifiers. For general black-box classifiers where we cannot tweak the training algorithm an alternate approach is to replicate the training examples according to the soft label. For example a probabilistic label µi = 0.8 can be effectively simulated by adding 8 training examples with deterministic label 1 and 2 examples with label 0. 1305 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 4. Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. Here instead of learning a classifier we estimate p which is the prevalence of the positive class, that is, p = Pr[yi = 1]. We further assume a beta prior for the prevalence, that is, Beta(p|p1 , p2 ). The algorithm simplifies as follows. j (a) Initialize µi = (1/R) ∑R yi based on majority voting. j=1 (b) Given µi , estimate the sensitivity and specificity of each annotator using (4). The prevalence of the positive class is estimated as follows. p = p1 − 1 + ∑N µi i=1 . p1 + p2 − 2 + N (c) Given the sensitivity and specificity of each annotator and prevalence, refine µi as follows. ai p µi = . ai p + bi (1 − p) Iterate (2) and (3) till convergence. This algorithm is similar to the one proposed by Dawid and Skene (1979) and Smyth et al. (1995). 5. Handling missing labels The proposed approach can easily handle missing labels, that is, when the labels from some experts are missing for some instances. Let Ri be the number of radiologists labeling the ith instance, and let N j be the number of instances labeled by the jth radiologist. Then in the EM algorithm, we just need to replace N by N j for estimating the sensitivity and specificity in (4), and replace R by Ri for updating µi in (6). 6. Evaluating a classifier We can use the probability scores µi directly to evaluate classifiers. If zi are the labels obtained from any other classifier, then sensitivity and specificity can be estimated as ∑N (1 − µi )(1 − zi ) ∑N µi zi , β = i=1 N . α = i=1 ∑N µi ∑i=1 (1 − µi ) i=1 7. Posterior approximation At the end of each EM iteration a crude approximation to the posterior is obtained as N N α j ∼ Beta α j |a1 + ∑ µi yi , a2 + ∑ µi (1 − yi ) , j j i=1 N j j i=1 N β j ∼ Beta β j |b1 + ∑ (1 − µi )(1 − yi ), b2 + ∑ (1 − µi )yi j j i=1 j j . i=1 3. Multi-class Classification In this section we describe how the proposed approach for binary classification can be extended to categorical data. Suppose there are K ≥ 2 categories. An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity—which are three 1306 L EARNING F ROM C ROWDS different kinds on nodules. We can extend the previous model and introduce a vector of multinomial j j j parameters αc = (αc1 , . . . , αcK ) for each annotator, where αck := Pr[y j = k|y = c] j and ∑K αck = 1. Here αck denotes the probability that the annotator j assigns class k to an instance k=1 j j given the true class is c. When K = 2, α11 and α00 are sensitivity and specificity, respectively. A similar EM algorithm can be derived. In the E-step, we estimate j j R K Pr[yi = c|Y , α] ∝ Pr[yi = c|xi ] ∏ ∏ (αck )δ(yi ,k) , j j j=1 k=1 where δ(u, v) = 1 if u = v and 0 otherwise and in the M-step we learn a multi-class classifier and update the multinomial parameter as j αck = j ∑N Pr[yi = c|Y , α]δ(yi , k) i=1 . ∑N Pr[yi = c|Y , α] i=1 One can also assign a Dirichlet prior for the multinomial parameters, and this results in a smoothing term in the above updates in the MAP estimate. 4. Ordinal Regression We now consider the situation where the outputs are categorical and have an ordering among the labels. In the CAD domain the radiologist often gives a score (for example, 1 to 5 from lowest to highest) to indicate how likely she thinks it is malignant. This is different from a multi-class setting in which we do not have any preference among the multiple class labels. j Let yi ∈ {1, . . . , K} be the label assigned to the ith instance by the jth expert. Note that there is an ordering in the labels 1 < . . . < K. A simple approach is to convert the ordinal data into a series of binary data (Frank and Hall, 2001). Specifically the K class ordinal labels are transformed into K − 1 binary class labels as follows: jc yi = j 1 if yi > c 0 otherwise c = 1, . . . , K − 1. Applying the same procedure used for binary labels we can estimate Pr[yi > c] for c = 1, . . . , K − 1. The probability of the actual class values can then be obtained as Pr[yi = c] = Pr[yi > c − 1 and yi ≤ c] = Pr[yi > c − 1] − Pr[yi > c]. The class with the maximum probability is assigned to the instance. 5. Regression In this section we develop a similar algorithm to learn a regression function using annotations from multiple experts. In the CAD domain as a part of the annotation process a common task for a radiologist is to measure the diameter of a suspicious lesion. 1307 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 5.1 Model for Annotators j Let yi ∈ R be the continuous target value assigned to the ith instance by the jth annotator. Our model is that the annotator provides a noisy version of the actual true value yi . For the jth annotator we will assume a Gaussian noise model with mean yi (the true unknown value) and inverse-variance (precision) τ j , that is, j j Pr[yi |yi , τ j ] = N (yi |yi , 1/τ j ), (7) where the Gaussian distribution is defined as N (z|m, σ2 ) = (2πσ2 )−1/2 exp(−(z − m)2 /2σ2 ). The unknown inverse-variance τ j measures the accuracy of each annotator—the larger the value of τ j the more accurate the annotator. We have assumed that τ j does not depend on the instance xi . For example, in the CAD domain, this means that the radiologist’s accuracy does not depend on the nodule she is measuring. While this a practical assumption, it is not entirely true. It is known that some nodules are harder to measure than others. 5.2 Linear Regression Model for Features As before we consider the family of linear regression functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. We assume that the actual target response yi is given by the deterministic regression function fw with additive Gaussian noise, that is, yi = w⊤ xi + ε, where ε is a zero-mean Gaussian random variable with inverse-variance (precision) γ. Hence Pr[yi |xi , w, γ] = N (yi |w⊤ xi , 1/γ). (8) 5.3 Combined Model Combining both the annotator (7) and the regressor (8) model we have Pr[yi |xi , w, τ j , γ] = j Z Pr[yi |yi , τ j ]Pr[yi |xi , w, γ]dyi = N (yi |w⊤ xi , 1/γ + 1/τ j ). j j Since the two precision terms (γ and τ j ) are grouped together they are not uniquely identifiable. Hence we will define a new precision term λ j as 1 1 1 = + j. j λ γ τ So we have the following model Pr[yi |xi , w, λ j ] = N (yi |w⊤ xi , 1/λ j ). j j (9) 5.4 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R experts, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and the precision λ = [λ1 , . . . , λR ] of i i=1 i all the annotators. 1308 L EARNING F ROM C ROWDS 5.5 Maximum-likelihood Estimator Assuming the instances are independent the likelihood of the parameters θ = {w, λ} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 Conditional on the instance xi we assume that y1 , . . . , yR are independent, that is, the annotators i i provide their responses independently. Hence from (9) the likelihood can be written as N R Pr[D |θ] = ∏ ∏ N (yi |w⊤ xi , 1/λ j ). j i=1 j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood θML = {λ, w} = arg max{ln Pr[D |θ]}. θ By equating the gradient of the log-likelihood to zero we obtain the following update equations for the precision and the weight vector. 1 λj = 1 N ∑ yij − w⊤ xi N i=1 N w = ∑ −1 2 . (10) N ∑R λ j yi j=1 i=1 ∑R λ j j=1 ∑ xi xi x⊤ i i=1 j . (11) As the parameters w and λ are coupled together we iterate these two steps till convergence. 5.6 Discussions 1. Is this standard least-squares? Define the design matrix X = [x1 , . . . , xN ]⊤ and the rej j sponse vector for each annotator as y j = [y1 , . . . , yN ]⊤ . Using matrix notation Equation 11 can be written as w = (X ⊤ X)−1 X ⊤ y where y = ∑R λ j y j j=1 ∑R λ j j=1 . (12) Equation 12 is essentially the solution to a standard linear regression model, except that we are training a linear regression model with y as the ground truth, which is a precision weighted mean of the response vectors from all the annotators. The variance of each annotator is estimated using (10). The final algorithm iteratively establishes a particular gold standard (y), measures the performance of the annotators and learns a regressor given that gold standard, and refines the gold standard based on the performance measures. 2. Are we better than the best annotator? If we assume λ is fixed (i.e., we ignore the variability and assume that it is well estimated) then w is an unbiased estimator of w and the ˆ covariance matrix is given by Cov(w) = Cov(y) X ⊤ X ˆ 1309 −1 = 1 ∑R λ j j=1 X ⊤X −1 . R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Since ∑R λ j > max j (λ j ) the proposed method has a lower variance than the regressor learnt j=1 with the best annotator (i.e., the one with the minimum variance). 3. Are we better than the average? For a fixed X the error in w depends only on the variance of y j . If we know the true λ j then yi is the best linear unbiased estimator for yi which minij mizes the variance. To see this consider any linear estimator of the form yi = ∑ j a j (yi − b j ). The variance is given by Var[yi ] = ∑ j (a j )2 /λ j . Since E[yi ] = yi ∑ j a j , for the bias of this estimator to be zero we require that ∑ j a j = 1. Solving the constrained minimization problem we see that a j = λ j / ∑ j λ j minimizes the variance. 4. Obtaining a consensus without features When no features are available the same algorithm can be simplified to get a consensus estimate of the actual ground truth and also evaluate the annotators. Essentially we have to iterate the following two updates till convergence ∑R λ j yi j=1 1 ∑R λ j j=1 λj j yi = = 1 N ∑ y j − yi N i=1 i 2 . 6. Experimental Validation We now experimentally validate the proposed algorithms on both simulated and real data. 6.1 Classification Experiments We use two CAD and one text data set in our experiments. The CAD data sets include a digital mammography data set and a breast MRI data set, both of which are biopsy proven, that is, the gold standard is available. For the digital mammography data set we simulate the radiologists in order to validate our methods. The breast MRI data has annotations from four radiologists. We also report results on a Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk which has annotations from 164 annotators. 6.1.1 D IGITAL M AMMOGRAPHY WITH S IMULATED R ADIOLOGISTS Mammograms are used as a screening tool to detect early breast cancer. CAD systems search for abnormal areas (lesions) in a digitized mammographic image. These lesions generally indicate the presence of malignant cancer. The CAD system then highlights these areas on the images, alerting the radiologist to the need for a further diagnostic mammogram or a biopsy. In classification terms, given a set of descriptive morphological features for a region on a image, the task is to predict whether it is potentially malignant (1) or not (0). In order to train such a classifier, a set of mammograms is collected from hospitals. The ground truth (whether it is cancer or not) is obtained from biopsy. Since biopsy is an expensive, tedious, and an invasive process, very often CAD systems are built from labels collected from multiple expert radiologists who visually examine the mammograms and mark the lesion locations—this constitutes our ground truth (multiple labels) for learning. In this experiment we use a proprietary biopsy-proven data set (Krishnapuram et al., 2008) containing 497 positive and 1618 negative examples. Each instance is described by a set of 27 morphological features. In order to validate our proposed algorithm, we simulate multiple radiologists according to the two-coin model described in § 2.1. Based on the labels from multiple radiologists, 1310 L EARNING F ROM C ROWDS we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. For the first set of experiments we use 5 radiologists with sensitivity α = [0.90 0.80 0.57 0.60 0.55] and specificity β = [0.95 0.85 0.62 0.65 0.58]. This corresponds to a scenario where the first two radiologists are experts and the last three are novices. Figure 1 summarizes the results. We compare on three different aspects: (1) How good is the learnt classifier? (2) How well can we estimate the sensitivity and specificity of each radiologist? (3) How good is the estimated ground truth? The following observations can be made. 1. Classifier performance Figure 1(a) plots the ROC curve of the learnt classifier on the training set. The dotted (black) line is the ROC curve for the classifier learnt using the actual ground truth. The solid (red) line is the ROC curve for the proposed algorithm and the dashed (blue) line is for the classifier learnt using the majority-voting scheme. The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. The area under the ROC curve (AUC) for the proposed algorithm is around 3.5% greater than that learnt using the majority-voting scheme. 2. Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(b). The end of the solid red line shows the estimates of the sensitivity and specificity from the proposed method. We used a uniform prior on all the parameters. The ellipse plots the contour of one standard deviation as obtained from the beta posterior estimates. The end of the dashed blue line shows the estimate obtained from the majorityvoting algorithm. We see that the proposed method is much closer to the actual values of sensitivity and specificity. 3. Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. From Figure 1(b) we can see that the ROC curve for the proposed method dominates the majority voting ROC curve. Furthermore, the area under the ROC curve (AUC) is around 3% higher. The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). It does not have an idea of who is an expert and who is a novice. The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity. The improvement obtained is quite large in Figure 2 which corresponds a situation where we have only one expert and 7 novices. 4. Joint Estimation To learn a classifier, Smyth et al. (1995) proposed to first estimate the golden ground truth and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. Figure 3 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. 6.1.2 B REAST MRI In this example, each radiologist reviews the breast MRI data and assesses the malignancy of each lesion on a BIRADS scale of 1 to 5. The BIRADS scale is defined as follows: 1 Negative, 2 Benign, 1311 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Majority Voting Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 EM algorithm Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0217 0.5869 0.2391 0.1521 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1785 0.1071 0.2500 0.4642 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0000 0.6957 0.1304 0.1739 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1428 0.0000 0.3214 0.5357 Table 1: The confusion matrix for the estimate obtained using majority voting and the proposed EM algorithm. The x indicates that there was no such category in the true labels (the gold standard). The gold-standard is obtained by the biopsy which can confirm whether it is benign (BIRADS=2) or malignant (BIRADS=5). 3 Probably Benign, 4 Suspicious abnormality, and 5 Highly suggestive of malignancy. Our data set comprises of 75 lesions with annotations from four radiologists, and the true labels from biopsy. Based on eight morphological features, we have to predict whether a lesion is malignant or not. For the first experiment we reduce the BIRADS scale to a binary one: any lesion with a BIRADS > 3 is considered malignant and benign otherwise. The set included 28 malignant and 47 benign lesions. Figure 4 summarizes the results. We show the leave-one-out cross validated ROC for the classifier. The cross-validated AUC of the proposed method is approximately 6% better than the majority voting baseline. We also consider the BIRADS labels as a set of ordinal measurements since there is an ordering among the BIRADS label. The confusion matrix in Table 1 shows that the EM algorithm is significantly superior than the majority voting in estimating the true BIRADS. 6.1.3 R ECOGNIZING T EXTUAL E NTAILMENT Finally we report results on Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk. In this task, the annotator is presented with two sentences and given a choice of whether the second sentence can be inferred from the first. The data has 800 tasks and 164 distinct readers, with 10 annotations per task along with the golden ground truth. The majority of the entries (94 %) in the 800x164 matrix are missing. There is one annotator who has labeled all the tasks. We use this data set to obtain an estimate of the actual ground truth. Figure 5 plots the accuracy of the estimated ground truth as a function of the number of annotators. The proposed EM algorithm achieves a higher accuracy than majority voting. In other words to achieve a desired accuracy the proposed algorithm needs fewer annotators than the majority voting scheme. 1312 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiologists. (a) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). (b) The ROC curve for the estimated ground truth. The actual sensitivity and specificity of each of the radiologists is marked as a ×. The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. The end of the solid red line shows the estimates from the proposed method. The ellipse plots the contour of one standard deviation. 1313 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.906 Majority voting baseline AUC=0.884 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm AUC=0.967 Majority voting baseline AUC=0.872 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 2: Same as Figure 1 except with 8 different radiologist annotations. 1314 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 3: ROC curves comparing the proposed algorithm (solid red line) with the Decoupled Estimation procedure (dotted blue line), which refers to the algorithm where the ground truth is first estimated using just the labels from the five radiologists and then a logistic regression classifier is trained using the soft probabilistic labels. In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm. 1315 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Leave−One−Out ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 0.9 True Positive Rate (sensitivity) 0.8 0.7 0.6 0.5 0.4 Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 4: Breast MRI results. (a) The leave-one-out cross validated ROC. (b) ROC for the estimated ground truth. 1316 L EARNING F ROM C ROWDS 0.95 0.9 0.85 Accuracy 0.8 0.75 0.7 0.65 Majority Voting EM Algorithm 0.6 0.55 0.5 20 40 60 80 100 120 Number of Annotators 140 160 Figure 5: The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. The plot was generated by randomly sampling the annotators 100 times. 6.2 Regression Experiments We first illustrate the algorithm on a toy dataset and then present a case study for automated polyp measurements. 6.2.1 I LLUSTRATION Figure 6 illustrates the the proposed algorithm for regression on a one-dimensional toy data set with three annotators. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 20 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 6.2.2 AUTOMATED P OLYP M EASUREMENTS Colorectal polyps are small colonic findings that may develop into cancer at a later stage. The diameter of the polyp is one of the key factors which decides the malignancy of a suspicious polyp. Hence accurate size estimation is crucial to decide the action to be taken on a polyp. We have developed various algorithms to segment a polyp. Multiple segmentation algorithms give rise to a set of features which are correlated with the diameter of the polyp. We want to learn a regression function which can predict the diameter of a polyp as a function of these features. In order to learn a regression function we collect our ground truth by asking many radiologists to manually measure the the diameter of the polyps from the three-dimensional images. In practice there is a lot of disagreement among the radiologists as to the actual size of the polyp. 1317 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2 1 N=50 examples R=3 annotators 2 2 5 Actual regression model Model with annotator 1 (Precision=0.01) 2 1 Model with annotator 2 (Precision=0.10) 2 Model with annotator 3 (Precision=1.00) 3 2 3 2 Model with average response 2 Proposed algorithm 2 3 1 1 4 3 2 2 y (response) 2 2 1 3 2 3 3 2 3 1 −1 3 3 2 2 3 3 1 2 2 2 3 33 3 1 1 3 3 3 3 2 3 3 2 33 33 2 2 2 3 3 3 32 1 2 3 2 1 3 3 2 1 2 1 2 2 1 1 2 3 2 3 2 3 1 3 3 −3 2 3 3 2 3 3 3 2 3 3 2 1 3 3 2 0 −2 2 2 3 2 3 2 3 1 2 2 1 2 3 −4 1 2 2 1 −5 0 2 0.1 0.2 1 0.3 0.4 0.5 0.6 10.8 2 0.7 x (feature) 0.9 1 1 Figure 6: Illustration of the proposed algorithm on a one-dimensional toy data set. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 50 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 5 5 10 Actual Polyp Diameter (mm) (a) Gold standard model 15 10 5 0 0 Pearson Correlation Coefficient=0.554966 RMSE=2.887740 15 Estimated Polyp Diameter (mm) 10 0 0 Pearson Correlation Coefficient=0.706558 RMSE=1.991576 15 Estimated Polyp Diameter (mm) Estimated Polyp Diameter (mm) Pearson Correlation Coefficient=0.714720 RMSE=1.970815 15 5 10 Actual Polyp Diameter (mm) (b) Proposed model 15 10 5 0 0 5 10 Actual Polyp Diameter (mm) 15 (c) Average model Figure 7: Scatter plot of the actual polyp diameter vs the diameter predicted by the models learnt using (a) the actual gold standard, (b) the proposed algorithm with annotations from five radiologists, and (c) the average of the radiologist’s annotations. (See § 6.2.2 for a description of the experimental setup.) 1318 L EARNING F ROM C ROWDS We use a proprietary data set containing 393 examples (which point to 285 distinct polyps— the segmentation algorithms generally return multiple marks on the same polyp.) along with the measured diameter (ranging from 2mm to 15mm) as our training set. Each example is described by a set of 60 morphological features which are correlated to the diameter of the polyp. In order to validate the feasibility of our proposed algorithm, we simulate five radiologists according to the noisy model described in § 5.1 with τ = [0.001 0.01 0.1 1 10]. This corresponds to a situation where the first three radiologists are extremely noisy and the last two are quite accurate. Based on the measurements from multiple radiologists, we can simultaneously (1) learn a linear regressor and (2) estimate the precision of each radiologist. We compare the results with the classifier trained using the actual golden ground truth as well as the regressor learnt using the average of the radiologists measurements. The results are validated on an independent test set containing 397 examples (which point to 298 distinct polyps). Figure 7 shows the scatter plot of the actual polyp diameter vs the diameter predicted by the three different models. We compare the performance based on the root mean squared error (RMSE) and also the Pearson’s correlation coefficient. The regressor learnt using the proposed iterative algorithm (Figure 7(b)) is almost as good as the one learnt using the golden ground truth (Figure 7(a)). The correlation coefficient for the proposed algorithm is significantly larger than that learnt using the average of the radiologists response. The estimate obtained by averaging is closer to the novices since they form a majority (3/5). The proposed algorithm appropriately weights each radiologist based on their estimated precisions. 7. Conclusions and Future Work In this paper we proposed a probabilistic framework for supervised learning with multiple annotators providing labels but no absolute gold standard. The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. We specifically discussed binary/categorical/ordinal classification and regression problems. We made two key assumptions: (1) the performance of each annotator does not depend on the feature vector for a given instance and (2) conditional on the truth the experts are independent, that is, they make their errors independently. As we pointed out earlier these assumptions are not true in practice. The annotator performance depends on the instance he is labeling and there is some degree of correlation among the annotators. We briefly discuss some strategies to relax these two assumptions. 7.1 Instance Difficulty One drawback of the current model is that it doesn’t estimate difficulty of items. It is often observed that for the easy instances all the annotators agree on the labels—thus violating our conditional independence assumption. The difficulty of annotating an item can be captured by another latent variable γi for each instance—which modulates the annotators performance. Models for this have been developed in the area of item-response theory (Baker and Kim, 2004) and also in epidemiology (Uebersax and Grove, 1993)—see also Whitehill et al. (2009) for a recent paper in the machine learning community. While these models do not take into account the available features our pro1319 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY posed model for sensitivity and specificity can be extended as follows (in place of (1) and (2)): α j (γi ) := Pr[yi = 1|yi = 1, γi ] = σ(a j1 + b j1 γi ). j β j (γi ) := Pr[yi = 0|yi = 0, γi ] = σ(a j0 + b j0 γi ). j Here the parameters a j1 and a j0 are related to the sensitivity and specificity of the jth annotator, while the latent term γi captures the difficulty of the instance. The key assumption here is that the annotators are independent conditional on both yi and γi . Various assumptions can be made on two parameters b j1 and b j0 to simplify these models further—for example we could set b j1 = b1 and b j0 = b0 for all the annotators. 7.2 Annotators Actually Look at the Data In our model we made the assumption that the sensitivity α j and the specificity β j of the jth annotator does not depend on the feature vector xi . For example, in the CAD domain, this meant that the radiologist’s performance is consistent across different sub-groups of data—which is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. We can extend the previous model such that the sensitivity and the specificity depends on the feature vector xi explicitly as follows j⊤ α j (γi , xi ) := Pr[yi = 1|yi = 1, γi , xi ] = σ(a j1 + b j1 γi + wα xi ). j j⊤ α j (γi , xi ) := Pr[yi = 0|yi = 0, γi , xi ] = σ(a j0 + b j0 γi + wβ xi ). j However this change increases the number of parameters to be learned. References P. S. Albert and L. E. Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60:427–435, 2004. F. B. Baker and S. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, 2 edition, 2004. B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. S. R. Cholleti, S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 2008 20th IEEE international Conference on Tools with Artificial intelligence, 2008. A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. O. Dekel and O. Shamir. Vox Populi: Collecting high-quality labels from a crowd. In COLT 2009: Proceedings of the 22nd Annual Conference on Learning Theory, 2009a. O. Dekel and O. Shamir. Good learners for evil teachers. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 233–240, 2009b. 1320 L EARNING F ROM C ROWDS A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977. P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 259–268, 2009. E. Frank and M. Hall. A simple approach to ordinal classification. Lecture Notes in Computer Science, pages 145–156, 2001. G. Fung, B. Krishnapuram, J. Bi, M. Dundar, V. C. Raykar, S. Yu, R. Rosales, S. Krishnan, and R. B. Rao. Mining medical images. In Fifteenth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining: Third Workshop on Data Mining Case Studies and Practice Prize, 2009. J. Howe. Crowd sourcing: Why the Power of the Crowd Is Driving the Future of Business. 2008. S. L. Hui and S. D. Walter. Estimating the error rates of diagnostic tests. Biometrics, 36:167–171, 1980. S. L. Hui and X. H. Zhou. Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research, 7:354–370, 1998. R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information Processing Systems 15, pages 897–904. 2003. B. Krishnapuram, J. Stoeckel, V. C. Raykar, R. B. Rao, P. Bamberger, E. Ratner, N. Merlet, I. Stainvas, M. Abramov, and A. Manevitch. Multiple-instance learning improves CAD detection of masses in digital mammography. In IWDM 2008: Proceedings of the 9th international workshop on Digital Mammography, pages 350–357. 2008. G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79–87, 1992. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 889–896, 2009. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. P. Smyth. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems 3, pages 163–182. MIT Press, 1995. 1321 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085– 1092. 1995. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast - But is it good? Evaluating nonexpert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263, 2008. A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. J. S. Uebersax and W. M. Grove. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 49:823–835, 1993. S. K. Warfield, K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004. J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. 1322
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. [sent-20, score-0.572]
2 Supervised Learning From Multiple Annotators/Experts A typical supervised learning scenario consists of a training set D = {(xi , yi )}N containing N i=1 instances, where xi ∈ X is an instance (typically a d-dimensional feature vector) and yi ∈ Y is the corresponding known label. [sent-24, score-0.365]
3 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY However, for many real life tasks, it may not be possible, or may be too expensive (or tedious) to acquire the actual label yi for training—which we refer to as the gold standard or the objective ground truth. [sent-40, score-0.667]
4 Since it is an expensive, invasive, and potentially dangerous process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. [sent-49, score-0.466]
5 Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard. [sent-50, score-0.431]
6 2 The radiologist also annotates various descriptors of the potentially malignant lesion, like the size (a regression problem), shape (a multiclass classification problem), and also degree of malignancy (an ordinal regression problem). [sent-51, score-0.442]
7 For a lot of tasks the labels provided by the annotators are inherently subjective and there will be substantial variation among different annotators. [sent-54, score-0.426]
8 The labels for training are assigned by human annotators who read the text and attribute their subjective category. [sent-57, score-0.426]
9 With the advent of crowdsourcing (Howe, 2008) services like Amazon’s Mechanical Turk,3 Games with a Purpose,4 and reCAPTCHA5 it is quite inexpensive to acquire labels from a large number of annotators (possibly thousands) in a short time (Sheng et al. [sent-58, score-0.376]
10 In situations like these, the performance of different annotators can vary widely (some may even be malicious), and without the actual gold standard, it may not be possible to evaluate the annotators. [sent-62, score-0.507]
11 How to adapt conventional supervised learning algorithms when we have multiple annotators providing subjective labels but no objective gold standard? [sent-64, score-0.622]
12 Sometimes even a biopsy cannot confirm whether it is cancer or not and hence all we can hope to get is subjective ground truth. [sent-73, score-0.345]
13 1 The Problem With Majority Voting When we have multiple labels a commonly used strategy is to use the labels on which the majority of them agree (or average for regression problem) as an estimate of the actual gold standard. [sent-89, score-0.566]
14 For binary classification problems this amounts to using the majority label,7 that is, j yi = ˆ 1 if (1/R) ∑R yi > 0. [sent-90, score-0.423]
15 Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, that is, R Pr[yi = 1|y1 , . [sent-94, score-0.486]
16 However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. [sent-99, score-0.491]
17 2 Proposed Approach and Organization To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator that jointly learns the classifier/regressor, the annotator accuracy, and the actual true label. [sent-103, score-0.311]
18 The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard (§ 2. [sent-105, score-0.539]
19 The final estimation is performed by an Expectation Maximization (EM) algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. [sent-110, score-0.547]
20 When there is no clear majority among the multiple experts (that is, yi = 0. [sent-119, score-0.36]
21 (1995) which just focused on estimating the ground truth from multiple noisy labels, we specifically address the issue of learning a classifier. [sent-127, score-0.458]
22 Estimating the ground truth and the annotator/classifer performance is a byproduct of our proposed algorithm. [sent-128, score-0.45]
23 In order to learn a classifier Smyth (1995) proposed to first estimate the ground truth (without using the features) and then use the probabilistic ground truth to learn a classifier. [sent-130, score-0.872]
24 In contrast, our proposed algorithm learns the classifier and the ground truth jointly. [sent-131, score-0.45]
25 1) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. [sent-134, score-0.978]
26 The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard. [sent-164, score-0.539]
27 If the true label is one, the sensitivity (true positive rate) for the jth annotator is defined as the probability that she labels it as one. [sent-172, score-0.548]
28 It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. [sent-207, score-0.39]
29 1301 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY j Conditioning on the true label yi , and also using the assumption yi is conditionally independent (of everything else) given α j , β j and yi , the likelihood can be decomposed as N Pr[D |θ] = ∏ Pr[y1 , . [sent-208, score-0.544]
30 , yN ] then the complete likelihood can be written as N ln Pr[D , y|θ] = ∑ yi ln pi ai + (1 − yi ) ln(1 − pi )bi . [sent-239, score-0.48]
31 By equating the gradient of (3) to zero we obtain the following estimates for the sensitivity and specificity: j j αj = ∑N µi yi i=1 , ∑N µi i=1 βj = ∑N (1 − µi )(1 − yi ) i=1 . [sent-258, score-0.464]
32 In this j paper we use majority voting µi = 1/R ∑R yi as the initialization for µi to start the EM-algorithm. [sent-268, score-0.373]
33 Given the sensitivity and specificity of each annotator and the model parameters, update µi as µi = ai pi , ai pi + bi (1 − pi ) (5) where pi = σ(w⊤ xi ). [sent-292, score-0.614]
34 Estimate of the gold standard The value of the posterior probability µi is a soft probabilistic estimate of the actual ground truth yi , that is, µi = Pr[yi = 1|y1 , . [sent-298, score-0.812]
35 The actual i i hidden label yi can be estimated by applying a threshold on µi , that is, yi = 1 if µi ≥ γ and zero otherwise. [sent-302, score-0.487]
36 By varying γ we can change the misclassification costs and obtain a ground truth with large sensitivity or large specificity. [sent-305, score-0.554]
37 Because of this in our experimental validation we can actually draw an ROC curve for the estimated ground truth. [sent-306, score-0.32]
38 This indicates that j=1 βj the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. [sent-317, score-0.602]
39 Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. [sent-328, score-1.07]
40 j=1 (b) Given µi , estimate the sensitivity and specificity of each annotator using (4). [sent-333, score-0.379]
41 p1 + p2 − 2 + N (c) Given the sensitivity and specificity of each annotator and prevalence, refine µi as follows. [sent-336, score-0.379]
42 Posterior approximation At the end of each EM iteration a crude approximation to the posterior is obtained as N N α j ∼ Beta α j |a1 + ∑ µi yi , a2 + ∑ µi (1 − yi ) , j j i=1 N j j i=1 N β j ∼ Beta β j |b1 + ∑ (1 − µi )(1 − yi ), b2 + ∑ (1 − µi )yi j j i=1 j j . [sent-349, score-0.498]
43 An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity—which are three 1306 L EARNING F ROM C ROWDS different kinds on nodules. [sent-353, score-0.505]
44 Specifically the K class ordinal labels are transformed into K − 1 binary class labels as follows: jc yi = j 1 if yi > c 0 otherwise c = 1, . [sent-376, score-0.586]
45 In the CAD domain as a part of the annotation process a common task for a radiologist is to measure the diameter of a suspicious lesion. [sent-388, score-0.333]
46 Our model is that the annotator provides a noisy version of the actual true value yi . [sent-391, score-0.477]
47 For the jth annotator we will assume a Gaussian noise model with mean yi (the true unknown value) and inverse-variance (precision) τ j , that is, j j Pr[yi |yi , τ j ] = N (yi |yi , 1/τ j ), (7) where the Gaussian distribution is defined as N (z|m, σ2 ) = (2πσ2 )−1/2 exp(−(z − m)2 /2σ2 ). [sent-392, score-0.443]
48 We assume that the actual target response yi is given by the deterministic regression function fw with additive Gaussian noise, that is, yi = w⊤ xi + ε, where ε is a zero-mean Gaussian random variable with inverse-variance (precision) γ. [sent-400, score-0.535]
49 3 Combined Model Combining both the annotator (7) and the regressor (8) model we have Pr[yi |xi , w, τ j , γ] = j Z Pr[yi |yi , τ j ]Pr[yi |xi , w, γ]dyi = N (yi |w⊤ xi , 1/γ + 1/τ j ). [sent-403, score-0.329]
50 (12) Equation 12 is essentially the solution to a standard linear regression model, except that we are training a linear regression model with y as the ground truth, which is a precision weighted mean of the response vectors from all the annotators. [sent-441, score-0.364]
51 The final algorithm iteratively establishes a particular gold standard (y), measures the performance of the annotators and learns a regressor given that gold standard, and refines the gold standard based on the performance measures. [sent-443, score-0.812]
52 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Since ∑R λ j > max j (λ j ) the proposed method has a lower variance than the regressor learnt j=1 with the best annotator (i. [sent-449, score-0.43]
53 If we know the true λ j then yi is the best linear unbiased estimator for yi which minij mizes the variance. [sent-455, score-0.332]
54 Obtaining a consensus without features When no features are available the same algorithm can be simplified to get a consensus estimate of the actual ground truth and also evaluate the annotators. [sent-461, score-0.486]
55 Essentially we have to iterate the following two updates till convergence ∑R λ j yi j=1 1 ∑R λ j j=1 λj j yi = = 1 N ∑ y j − yi N i=1 i 2 . [sent-462, score-0.531]
56 For the digital mammography data set we simulate the radiologists in order to validate our methods. [sent-468, score-0.351]
57 The ground truth (whether it is cancer or not) is obtained from biopsy. [sent-480, score-0.422]
58 Since biopsy is an expensive, tedious, and an invasive process, very often CAD systems are built from labels collected from multiple expert radiologists who visually examine the mammograms and mark the lesion locations—this constitutes our ground truth (multiple labels) for learning. [sent-481, score-0.987]
59 In order to validate our proposed algorithm, we simulate multiple radiologists according to the two-coin model described in § 2. [sent-485, score-0.339]
60 Based on the labels from multiple radiologists, 1310 L EARNING F ROM C ROWDS we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. [sent-487, score-0.583]
61 We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. [sent-488, score-0.486]
62 For the first set of experiments we use 5 radiologists with sensitivity α = [0. [sent-489, score-0.378]
63 This corresponds to a scenario where the first two radiologists are experts and the last three are novices. [sent-500, score-0.313]
64 The dotted (black) line is the ROC curve for the classifier learnt using the actual ground truth. [sent-508, score-0.445]
65 The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. [sent-510, score-0.562]
66 Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(b). [sent-514, score-0.389]
67 Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. [sent-521, score-1.184]
68 The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). [sent-524, score-0.351]
69 The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity. [sent-526, score-0.398]
70 (1995) proposed to first estimate the golden ground truth and then use the probabilistic ground truth to learn a classifier. [sent-530, score-0.963]
71 In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. [sent-531, score-0.45]
72 Figure 3 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. [sent-532, score-0.978]
73 The data has 800 tasks and 164 distinct readers, with 10 annotations per task along with the golden ground truth. [sent-574, score-0.395]
74 Figure 5 plots the accuracy of the estimated ground truth as a function of the number of annotators. [sent-578, score-0.467]
75 In other words to achieve a desired accuracy the proposed algorithm needs fewer annotators than the majority voting scheme. [sent-580, score-0.518]
76 (a) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). [sent-613, score-0.898]
77 (b) The ROC curve for the estimated ground truth. [sent-614, score-0.32]
78 The actual sensitivity and specificity of each of the radiologists is marked as a ×. [sent-615, score-0.442]
79 The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. [sent-616, score-0.339]
80 In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm. [sent-682, score-0.45]
81 5 20 40 60 80 100 120 Number of Annotators 140 160 Figure 5: The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. [sent-727, score-0.467]
82 We simulate 20 samples from three annotators with precisions 0. [sent-735, score-0.312]
83 While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. [sent-740, score-0.305]
84 In order to learn a regression function we collect our ground truth by asking many radiologists to manually measure the the diameter of the polyps from the three-dimensional images. [sent-751, score-0.843]
85 In practice there is a lot of disagreement among the radiologists as to the actual size of the polyp. [sent-752, score-0.31]
86 1317 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2 1 N=50 examples R=3 annotators 2 2 5 Actual regression model Model with annotator 1 (Precision=0. [sent-753, score-0.559]
87 We simulate 50 samples from three annotators with precisions 0. [sent-767, score-0.312]
88 While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. [sent-772, score-0.305]
89 In order to validate the feasibility of our proposed algorithm, we simulate five radiologists according to the noisy model described in § 5. [sent-788, score-0.303]
90 We compare the results with the classifier trained using the actual golden ground truth as well as the regressor learnt using the average of the radiologists measurements. [sent-795, score-0.978]
91 Figure 7 shows the scatter plot of the actual polyp diameter vs the diameter predicted by the three different models. [sent-797, score-0.388]
92 The regressor learnt using the proposed iterative algorithm (Figure 7(b)) is almost as good as the one learnt using the golden ground truth (Figure 7(a)). [sent-799, score-0.802]
93 The correlation coefficient for the proposed algorithm is significantly larger than that learnt using the average of the radiologists response. [sent-800, score-0.38]
94 Conclusions and Future Work In this paper we proposed a probabilistic framework for supervised learning with multiple annotators providing labels but no absolute gold standard. [sent-804, score-0.6]
95 The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. [sent-805, score-0.791]
96 We made two key assumptions: (1) the performance of each annotator does not depend on the feature vector for a given instance and (2) conditional on the truth the experts are independent, that is, they make their errors independently. [sent-807, score-0.505]
97 The key assumption here is that the annotators are independent conditional on both yi and γi . [sent-820, score-0.449]
98 2 Annotators Actually Look at the Data In our model we made the assumption that the sensitivity α j and the specificity β j of the jth annotator does not depend on the feature vector xi . [sent-823, score-0.442]
99 It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. [sent-825, score-0.39]
100 Inferring ground truth from subjective labelling of venus images. [sent-996, score-0.472]
wordName wordTfidf (topN-words)
[('annotators', 0.283), ('annotator', 0.247), ('radiologists', 0.246), ('pr', 0.232), ('ground', 0.231), ('cad', 0.193), ('radiologist', 0.193), ('truth', 0.191), ('valadez', 0.171), ('yi', 0.166), ('city', 0.163), ('gold', 0.16), ('yr', 0.14), ('aykar', 0.139), ('lorin', 0.139), ('ogoni', 0.139), ('sensitivity', 0.132), ('rowds', 0.128), ('hao', 0.119), ('polyp', 0.118), ('voting', 0.116), ('rom', 0.11), ('auc', 0.109), ('oy', 0.107), ('learnt', 0.106), ('diameter', 0.103), ('roc', 0.1), ('em', 0.098), ('labels', 0.093), ('golden', 0.091), ('majority', 0.091), ('malignant', 0.091), ('beta', 0.091), ('birads', 0.086), ('specifcity', 0.086), ('raykar', 0.075), ('siemens', 0.075), ('annotations', 0.073), ('ordinal', 0.068), ('experts', 0.067), ('biopsy', 0.064), ('actual', 0.064), ('er', 0.055), ('mri', 0.055), ('lesion', 0.053), ('lesions', 0.053), ('novices', 0.053), ('subjective', 0.05), ('regressor', 0.049), ('smyth', 0.049), ('label', 0.046), ('linda', 0.046), ('amazon', 0.046), ('snow', 0.046), ('mechanical', 0.046), ('fw', 0.046), ('estimated', 0.045), ('classi', 0.045), ('precision', 0.044), ('earning', 0.044), ('curve', 0.044), ('pi', 0.043), ('bogoni', 0.043), ('florin', 0.043), ('mammography', 0.043), ('polyps', 0.043), ('logit', 0.042), ('expert', 0.041), ('mm', 0.04), ('benign', 0.038), ('suspicious', 0.037), ('multiple', 0.036), ('categorical', 0.035), ('digital', 0.033), ('till', 0.033), ('sheng', 0.033), ('morphological', 0.033), ('classifier', 0.033), ('dekel', 0.033), ('xi', 0.033), ('gerardo', 0.032), ('malignancy', 0.032), ('mammograms', 0.032), ('shipeng', 0.032), ('skene', 0.032), ('sorokin', 0.032), ('vikas', 0.032), ('diagnostic', 0.032), ('ln', 0.031), ('response', 0.031), ('rmse', 0.03), ('prevalence', 0.03), ('jth', 0.03), ('bi', 0.03), ('regression', 0.029), ('false', 0.029), ('simulate', 0.029), ('ck', 0.029), ('proposed', 0.028), ('medical', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 61 jmlr-2010-Learning From Crowds
Author: Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy
Abstract: For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline. Keywords: multiple annotators, multiple experts, multiple teachers, crowdsourcing 1. Supervised Learning From Multiple Annotators/Experts A typical supervised learning scenario consists of a training set D = {(xi , yi )}N containing N i=1 instances, where xi ∈ X is an instance (typically a d-dimensional feature vector) and yi ∈ Y is the corresponding known label. The task is to learn a function f : X → Y which generalizes well on unseen data. Specifically for binary classification the supervision is from the set Y = {0, 1}, for multi-class classification Y = {1, . . . , K}, for ordinal regression Y = {1, . . . , K} (with an ordering 1 < . . . < K), and Y = R for regression. c 2010 Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo H. Valadez, Charles Florin, Luca Bogoni and Linda Moy. R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY However, for many real life tasks, it may not be possible, or may be too expensive (or tedious) to acquire the actual label yi for training—which we refer to as the gold standard or the objective ground truth. Instead, we may have multiple (possibly noisy) labels y1 , . . . , yR provided by R i i different experts or annotators. In practice, there is a substantial amount of disagreement among the experts, and hence it is of great practical interest to address conventional supervised learning algorithms in this scenario. Our motivation for this work comes from the area of computer-aided diagnosis1 (CAD), where the task is to build a classifier to predict whether a suspicious region on a medical image (like a X-ray, CT scan, or MRI) is malignant (cancerous) or benign. In order to train such a classifier, a set of images is collected from hospitals. The actual gold standard (whether it is cancer or not) can only be obtained from a biopsy of the tissue. Since it is an expensive, invasive, and potentially dangerous process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard.2 The radiologist also annotates various descriptors of the potentially malignant lesion, like the size (a regression problem), shape (a multiclass classification problem), and also degree of malignancy (an ordinal regression problem). The radiologists come from a diverse pool including luminaries, experts, residents, and novices. Very often there is lot of disagreement among the annotations. For a lot of tasks the labels provided by the annotators are inherently subjective and there will be substantial variation among different annotators. The domain of text classification offers such a scenario. In this context the task is to predict the category for a token of text. The labels for training are assigned by human annotators who read the text and attribute their subjective category. With the advent of crowdsourcing (Howe, 2008) services like Amazon’s Mechanical Turk,3 Games with a Purpose,4 and reCAPTCHA5 it is quite inexpensive to acquire labels from a large number of annotators (possibly thousands) in a short time (Sheng et al., 2008; Snow et al., 2008; Sorokin and Forsyth, 2008). Websites such as Galaxy Zoo6 allow the public to label astronomical images over the internet. In situations like these, the performance of different annotators can vary widely (some may even be malicious), and without the actual gold standard, it may not be possible to evaluate the annotators. In this work, we provide principled probabilistic solutions to the following questions: 1. How to adapt conventional supervised learning algorithms when we have multiple annotators providing subjective labels but no objective gold standard? 2. How to evaluate systems when we do not have absolute gold-standard? 3. A closely related problem—particularly relevant when there are a large number of annotators— is to estimate how reliable/trustworthy is each annotator. 1. See Fung et al. (2009) for an overview of the data mining issues in this area. 2. Sometimes even a biopsy cannot confirm whether it is cancer or not and hence all we can hope to get is subjective ground truth. 3. Mechanical Turk found at https://www.mturk.com. 4. Games with a Purpose found at http://www.gwap.com. 5. reCAPTCHA found at http://recaptcha.net/. 6. Galaxy Zoo found at http://galaxyzoo.org. 1298 L EARNING F ROM C ROWDS 1.1 The Problem With Majority Voting When we have multiple labels a commonly used strategy is to use the labels on which the majority of them agree (or average for regression problem) as an estimate of the actual gold standard. For binary classification problems this amounts to using the majority label,7 that is, j yi = ˆ 1 if (1/R) ∑R yi > 0.5 j=1 , j 0 if (1/R) ∑R yi < 0.5 j=1 as an estimate of the hidden true label and use this estimate to learn and evaluate classifiers/annotators. Another strategy is that of considering every pair (instance, label) provided by each expert as a separate example. Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, that is, R Pr[yi = 1|y1 , . . . , yR ] = (1/R) ∑ yi . i i j j=1 Majority voting assumes all experts are equally good. However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. One could address this problem by introducing a weight capturing how good each expert is. But how would one measure the performance of an expert when there is no gold standard available? 1.2 Proposed Approach and Organization To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator that jointly learns the classifier/regressor, the annotator accuracy, and the actual true label. For ease of exposition we start with binary classification problem in § 2. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard (§ 2.1). The proposed algorithm automatically discovers the best experts and assigns a higher weight to them. In order to incorporate prior knowledge about each annotator, we impose a beta prior on the sensitivity and specificity and derive the maximum-a-posteriori estimate (§ 2.6). The final estimation is performed by an Expectation Maximization (EM) algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. While the proposed approach is described using logistic regression as the base classifier (§ 2.2), it is quite general, and can be used with any black-box classifier (§ 2.7), and can also handle missing labels (that is, each expert is not required to label all the instances). Furthermore, we extend the proposed algorithm to handle categorical (§ 3), ordinal (§ 4), and regression problems (§ 5). In § 6 section we extensively validate our approach using both simulated data and real data from different domains. 1.3 Related Work and Novel Contributions We first summarize the novel contributions of this work in context of other related work in this emerging new area. There has been a long line of work in the biostatistics and epidemiology literature on latent variable models where the task is to get an estimate of the observer error rates based 7. When there is no clear majority among the multiple experts (that is, yi = 0.5) in CAD domain the final decision is ˆ often made by an adjudicator or a super-expert. When there is no adjudicator a fair coin toss is used. 1299 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY on the results from multiple diagnostic tests without a gold standard (see Dawid and Skene, 1979, Hui and Walter, 1980, Hui and Zhou, 1998, Albert and Dodd, 2004 and references therein). In the machine learning community Smyth et al. (1995) first addressed the same problem in the context of labeling volcanoes in satellite images of Venus. We differ from this previous body of work in the following aspects: 1. Unlike Dawid and Skene (1979) and Smyth et al. (1995) which just focused on estimating the ground truth from multiple noisy labels, we specifically address the issue of learning a classifier. Estimating the ground truth and the annotator/classifer performance is a byproduct of our proposed algorithm. 2. In order to learn a classifier Smyth (1995) proposed to first estimate the ground truth (without using the features) and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly. Our experiments (§ 6.1.1) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. 3. Our solution is more general and can be easily extended to categorical(§ 3), ordinal(§ 4), and continuous data(§ 5). It can also be used in conjunction with any supervised learning algorithm. A preliminary version of this paper (Raykar et al., 2009) mainly discussed the binary classification problem. 4. Our proposed algorithm is also Bayesian—we impose a prior on the experts. The priors can potential capture the skill of different annotators. In this paper we refrain from doing a full Bayesian inference and use the mode of the posterior as a point estimate. A recent complete Bayesian generalization of these kind of models has been developed by Carpenter (2008). 5. The EM approach used in this paper is similar to that proposed by Jin and Ghahramani (2003). However their motivation is somewhat different. In their setting, each training example is annotated with a set of possible labels, only one of which is correct. There has been recent interest in the natural language processing (Sheng et al., 2008; Snow et al., 2008) and computer vision (Sorokin and Forsyth, 2008) communities where they use Amazon’s Mechanical Turk to collect annotations from many people. They show that it can be potentially as good as that provided by an expert. Sheng et al. (2008) analyzed when it is worthwhile to acquire new labels for some of the training examples. There is also some theoretical work (see Lugosi, 1992 and Dekel and Shamir, 2009a) dealing with multiple experts. Recently Dekel and Shamir (2009b) presented an algorithm which does not resort to repeated labeling, that is, each example does not have to be labeled by multiple teachers. Donmez et al. (2009) address the issue of active learning in this scenario—How to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task? There has also been some work in the medical imaging community (Warfield et al., 2004; Cholleti et al., 2008). 2. Binary Classification We first describe our proposed noise model for the annotators. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard. 1300 L EARNING F ROM C ROWDS 2.1 A Two-coin Model for Annotators Let y j ∈ {0, 1} be the label assigned to the instance x by the jth annotator/expert. Let y be the actual (unobserved) label for this instance. Each annotator provides a version of this hidden true label based on two biased coins. If the true label is one, she flips a coin with bias α j (sensitivity). If the true label is zero, she flips a coin with bias β j (specificity). In each case, if she gets heads she keeps the original label, otherwise she flips the label. If the true label is one, the sensitivity (true positive rate) for the jth annotator is defined as the probability that she labels it as one. α j := Pr[y j = 1|y = 1]. (1) On the other hand, if the true label is zero, the specificity (1−false positive rate) is defined as the probability that she labels it as zero. β j := Pr[y j = 0|y = 0]. (2) The assumption introduced is that α j and β j do not depend on the instance x. For example, in the CAD domain, this means that the radiologist’s performance is consistent across different sub-groups of data.8 2.2 Classification Model While the proposed method can be used for any classifier, for ease of exposition, we consider the family of linear discriminating functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. The final classifier can be written in the following form: y = 1 if w⊤ x ≥ γ and 0 otherwise. The ˆ threshold γ determines the operating point of the classifier. The Receiver Operating Characteristic (ROC) curve is obtained as γ is swept from −∞ to ∞. The probability for the positive class is modeled as a logistic sigmoid acting on fw, that is, Pr[y = 1|x, w] = σ(w⊤ x), where the logistic sigmoid function is defined as σ(z) = 1/(1 + e−z ). This classification model is known as logistic regression. 2.3 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R annotators, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and also the sensitivity α = i i i=1 [α1 , . . . , αR ] and the specificity β = [β1 , . . . , βR ] of the R annotators. It is also of interest to get an estimate of the unknown gold standard y1 , . . . , yN . 2.4 Maximum Likelihood Estimator Assuming the training instances are independently sampled, the likelihood function of the parameters θ = {w, α, β} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 8. While this is a reasonable assumption, it is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. 1301 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY j Conditioning on the true label yi , and also using the assumption yi is conditionally independent (of everything else) given α j , β j and yi , the likelihood can be decomposed as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |yi = 1, α]Pr[yi = 1|xi, w] i i i=1 + Pr[y1 , . . . , yR |yi = 0, β]Pr[yi = 0|xi, w] . i i Given the true label yi , we assume that y1 , . . . , yR are independent, that is, the annotators make their i i decisions independently.9 Hence, R R Pr[y1 , . . . , yR |yi = 1, α] = ∏ Pr[yi |yi = 1, α j ] = ∏ [α j ]yi [1 − α j ]1−yi . i i j j=1 j j j=1 Similarly, we have R Pr[y1 , . . . , yR |yi = 0, β] = ∏ [β j ]1−yi [1 − β j ]yi . i i j j j=1 Hence the likelihood can be written as N Pr[D |θ] = ∏ ai pi + bi (1 − pi ) , i=1 where we have defined pi := σ(w⊤ xi ). R ai := ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi := ∏[β j ]1−y [1 − β j ]y . j i j i j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood, that is, ˆ θML = {α, β, w} = arg max{ln Pr[D |θ]}. ˆ ˆ ˆ θ 2.5 The EM Algorithm This maximization problem can be simplified a lot if we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is an efficient iterative procedure to compute the maximum-likelihood solution in presence of missing/hidden data. We will use the unknown hidden true label yi as the missing data. If we know the missing data y = [y1 , . . . , yN ] then the complete likelihood can be written as N ln Pr[D , y|θ] = ∑ yi ln pi ai + (1 − yi ) ln(1 − pi )bi . i=1 9. This assumption is not true in general and there is some correlations among the labels assigned by multiple annotators. For example in the CAD domain if the cancer is in advanced stage (which is very easy to detect) almost all the radiologists assign the same label. 1302 L EARNING F ROM C ROWDS Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)step. The M-step involves maximization of a lower bound on the log-likelihood that is refined in each iteration by the E-step. 1. E-step. Given the observation D and the current estimate of the model parameters θ, the conditional expectation (which is a lower bound on the true likelihood) is computed as N E {ln Pr[D , y|θ]} = ∑ µi ln pi ai + (1 − µi ) ln(1 − pi )bi , (3) i=1 where the expectation is with respect to Pr[y|D , θ], and µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. Using i i Bayes’ theorem we can compute µi ∝ Pr[y1 , . . . , yR |yi = 1, θ] · Pr[yi = 1|xi, θ] i i ai pi = . ai pi + bi (1 − pi ) 2. M-step. Based on the current estimate µi and the observations D , the model parameters θ are then estimated by maximizing the conditional expectation. By equating the gradient of (3) to zero we obtain the following estimates for the sensitivity and specificity: j j αj = ∑N µi yi i=1 , ∑N µi i=1 βj = ∑N (1 − µi )(1 − yi ) i=1 . ∑N (1 − µi ) i=1 Due to the non-linearity of the sigmoid, we do not have a closed form solution for w and we have to use gradient ascent based optimization methods. We use the Newton-Raphson update given by wt+1 = wt − ηH −1 g, where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by N g(w) = ∑ µi − σ(w⊤ xi ) xi . i=1 The Hessian matrix is given by N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ . i i=1 Essentially, we are estimating a logistic regression model with probabilistic labels µi . These two steps (the E- and the M-step) can be iterated till convergence. The log-likelihood increases monotonically after every iteration, which in practice implies convergence to a local maximum. The EM algorithm is only guaranteed to converge to a local maximum. In practice multiple restarts with different initializations can potentially mitigate the local maximum problem. In this j paper we use majority voting µi = 1/R ∑R yi as the initialization for µi to start the EM-algorithm. j=1 1303 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2.6 A Bayesian Approach In some applications we may want to trust a particular expert more than the others. One way to achieve this is by imposing priors on the sensitivity and specificity of the experts. Since α j and β j represent the probability of a binary event, a natural choice of prior is the beta prior. The beta prior is also conjugate to the binomial distribution. For any a > 0, b > 0, and δ ∈ [0, 1] the beta distribution is given by δa−1 (1 − δ)b−1 , Beta(δ|a, b) = B(a, b) where B(a, b) = 01 δa−1 (1 − δ)b−1 dδ is the beta function.We assume a beta prior10 for both the sensitivity and the specificity as R j j j j j j j j Pr[α j |a1 , a2 ] = Beta(α j |a1 , a2 ). Pr[β j |b1 , b2 ] = Beta(β j |b1 , b2 ). For sake of completeness we also assume a zero mean Gaussian prior on the weights w with inverse covariance matrix Γ, that is, Pr[w] = N (w|0, Γ−1 ). Assuming that {α j }, {β j }, and w have independent priors, the maximum-a-posteriori (MAP) estimator is found by maximizing the logposterior, that is, ˆ θMAP = arg max{ln Pr[D |θ] + ln Pr[θ]}. θ An EM algorithm can be derived in a similar fashion for MAP estimation by relying on the interpretation of Neal and Hinton (1998). The final algorithm is summarized below: j 1. Initialize µi = (1/R) ∑R yi based on majority voting. j=1 2. Given µi , estimate the sensitivity and specificity of each annotator/expert as follows. j αj = j a1 − 1 + ∑N µi yi i=1 j j a1 + a2 − 2 + ∑N µi i=1 j β j = . j b1 − 1 + ∑N (1 − µi )(1 − yi ) i=1 j j b1 + b2 − 2 + ∑N (1 − µi ) i=1 . (4) The Newton-Raphson update for optimizing w is given by wt+1 = wt − ηH −1 g, with step length η, gradient vector N g(w) = ∑ µi − σ(w⊤ xi ) xi − Γw, i=1 and Hessian matrix N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ − Γ. i i=1 10. It may be convenient to specify a prior in terms of the mean µ and variance σ2 . The mean and the variance for a beta prior are given by µ = a/(a + b) and σ2 = ab/((a + b)2 (a + b + 1)). Solving for a and b we get a = (−µ3 + µ2 − µσ2 )/σ2 and b = a(1 − µ)/µ. 1304 L EARNING F ROM C ROWDS 3. Given the sensitivity and specificity of each annotator and the model parameters, update µi as µi = ai pi , ai pi + bi (1 − pi ) (5) where pi = σ(w⊤ xi ). R ai = ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi = ∏[β j ]1−y [1 − β j ]y . j i j i (6) j=1 Iterate (2) and (3) till convergence. 2.7 Discussions 1. Estimate of the gold standard The value of the posterior probability µi is a soft probabilistic estimate of the actual ground truth yi , that is, µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. The actual i i hidden label yi can be estimated by applying a threshold on µi , that is, yi = 1 if µi ≥ γ and zero otherwise. We can use γ = 0.5 as the threshold. By varying γ we can change the misclassification costs and obtain a ground truth with large sensitivity or large specificity. Because of this in our experimental validation we can actually draw an ROC curve for the estimated ground truth. 2. Log-odds of µ A particularly revealing insight can be obtained in terms of the log-odds or the logit of the posterior probability µi . From (5) the logit of µi can be written as logit(µi ) = ln Pr[yi = 1|y1 , . . . , yR , xi , θ] µi i i = ln 1 − µi Pr[yi = 0|y1 , . . . , yR , xi , θ] i i R = w⊤ xi + c + ∑ yi [logit(α j ) + logit(β j )]. j j=1 j where c = ∑R log 1−α is a constant term which does not depend on i. This indicates that j=1 βj the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. The weight of each expert is the sum of the logit of the sensitivity and specificity. 3. Using any other classifier For ease of exposition we used logistic regression. However, the proposed algorithm can be used with any generalized linear model or in fact with any classifier that can be trained with soft probabilistic labels. In each step of the EM-algorithm, the classifier is trained with instances sampled from µi . This modification is easy for most probabilistic classifiers. For general black-box classifiers where we cannot tweak the training algorithm an alternate approach is to replicate the training examples according to the soft label. For example a probabilistic label µi = 0.8 can be effectively simulated by adding 8 training examples with deterministic label 1 and 2 examples with label 0. 1305 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 4. Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. Here instead of learning a classifier we estimate p which is the prevalence of the positive class, that is, p = Pr[yi = 1]. We further assume a beta prior for the prevalence, that is, Beta(p|p1 , p2 ). The algorithm simplifies as follows. j (a) Initialize µi = (1/R) ∑R yi based on majority voting. j=1 (b) Given µi , estimate the sensitivity and specificity of each annotator using (4). The prevalence of the positive class is estimated as follows. p = p1 − 1 + ∑N µi i=1 . p1 + p2 − 2 + N (c) Given the sensitivity and specificity of each annotator and prevalence, refine µi as follows. ai p µi = . ai p + bi (1 − p) Iterate (2) and (3) till convergence. This algorithm is similar to the one proposed by Dawid and Skene (1979) and Smyth et al. (1995). 5. Handling missing labels The proposed approach can easily handle missing labels, that is, when the labels from some experts are missing for some instances. Let Ri be the number of radiologists labeling the ith instance, and let N j be the number of instances labeled by the jth radiologist. Then in the EM algorithm, we just need to replace N by N j for estimating the sensitivity and specificity in (4), and replace R by Ri for updating µi in (6). 6. Evaluating a classifier We can use the probability scores µi directly to evaluate classifiers. If zi are the labels obtained from any other classifier, then sensitivity and specificity can be estimated as ∑N (1 − µi )(1 − zi ) ∑N µi zi , β = i=1 N . α = i=1 ∑N µi ∑i=1 (1 − µi ) i=1 7. Posterior approximation At the end of each EM iteration a crude approximation to the posterior is obtained as N N α j ∼ Beta α j |a1 + ∑ µi yi , a2 + ∑ µi (1 − yi ) , j j i=1 N j j i=1 N β j ∼ Beta β j |b1 + ∑ (1 − µi )(1 − yi ), b2 + ∑ (1 − µi )yi j j i=1 j j . i=1 3. Multi-class Classification In this section we describe how the proposed approach for binary classification can be extended to categorical data. Suppose there are K ≥ 2 categories. An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity—which are three 1306 L EARNING F ROM C ROWDS different kinds on nodules. We can extend the previous model and introduce a vector of multinomial j j j parameters αc = (αc1 , . . . , αcK ) for each annotator, where αck := Pr[y j = k|y = c] j and ∑K αck = 1. Here αck denotes the probability that the annotator j assigns class k to an instance k=1 j j given the true class is c. When K = 2, α11 and α00 are sensitivity and specificity, respectively. A similar EM algorithm can be derived. In the E-step, we estimate j j R K Pr[yi = c|Y , α] ∝ Pr[yi = c|xi ] ∏ ∏ (αck )δ(yi ,k) , j j j=1 k=1 where δ(u, v) = 1 if u = v and 0 otherwise and in the M-step we learn a multi-class classifier and update the multinomial parameter as j αck = j ∑N Pr[yi = c|Y , α]δ(yi , k) i=1 . ∑N Pr[yi = c|Y , α] i=1 One can also assign a Dirichlet prior for the multinomial parameters, and this results in a smoothing term in the above updates in the MAP estimate. 4. Ordinal Regression We now consider the situation where the outputs are categorical and have an ordering among the labels. In the CAD domain the radiologist often gives a score (for example, 1 to 5 from lowest to highest) to indicate how likely she thinks it is malignant. This is different from a multi-class setting in which we do not have any preference among the multiple class labels. j Let yi ∈ {1, . . . , K} be the label assigned to the ith instance by the jth expert. Note that there is an ordering in the labels 1 < . . . < K. A simple approach is to convert the ordinal data into a series of binary data (Frank and Hall, 2001). Specifically the K class ordinal labels are transformed into K − 1 binary class labels as follows: jc yi = j 1 if yi > c 0 otherwise c = 1, . . . , K − 1. Applying the same procedure used for binary labels we can estimate Pr[yi > c] for c = 1, . . . , K − 1. The probability of the actual class values can then be obtained as Pr[yi = c] = Pr[yi > c − 1 and yi ≤ c] = Pr[yi > c − 1] − Pr[yi > c]. The class with the maximum probability is assigned to the instance. 5. Regression In this section we develop a similar algorithm to learn a regression function using annotations from multiple experts. In the CAD domain as a part of the annotation process a common task for a radiologist is to measure the diameter of a suspicious lesion. 1307 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 5.1 Model for Annotators j Let yi ∈ R be the continuous target value assigned to the ith instance by the jth annotator. Our model is that the annotator provides a noisy version of the actual true value yi . For the jth annotator we will assume a Gaussian noise model with mean yi (the true unknown value) and inverse-variance (precision) τ j , that is, j j Pr[yi |yi , τ j ] = N (yi |yi , 1/τ j ), (7) where the Gaussian distribution is defined as N (z|m, σ2 ) = (2πσ2 )−1/2 exp(−(z − m)2 /2σ2 ). The unknown inverse-variance τ j measures the accuracy of each annotator—the larger the value of τ j the more accurate the annotator. We have assumed that τ j does not depend on the instance xi . For example, in the CAD domain, this means that the radiologist’s accuracy does not depend on the nodule she is measuring. While this a practical assumption, it is not entirely true. It is known that some nodules are harder to measure than others. 5.2 Linear Regression Model for Features As before we consider the family of linear regression functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. We assume that the actual target response yi is given by the deterministic regression function fw with additive Gaussian noise, that is, yi = w⊤ xi + ε, where ε is a zero-mean Gaussian random variable with inverse-variance (precision) γ. Hence Pr[yi |xi , w, γ] = N (yi |w⊤ xi , 1/γ). (8) 5.3 Combined Model Combining both the annotator (7) and the regressor (8) model we have Pr[yi |xi , w, τ j , γ] = j Z Pr[yi |yi , τ j ]Pr[yi |xi , w, γ]dyi = N (yi |w⊤ xi , 1/γ + 1/τ j ). j j Since the two precision terms (γ and τ j ) are grouped together they are not uniquely identifiable. Hence we will define a new precision term λ j as 1 1 1 = + j. j λ γ τ So we have the following model Pr[yi |xi , w, λ j ] = N (yi |w⊤ xi , 1/λ j ). j j (9) 5.4 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R experts, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and the precision λ = [λ1 , . . . , λR ] of i i=1 i all the annotators. 1308 L EARNING F ROM C ROWDS 5.5 Maximum-likelihood Estimator Assuming the instances are independent the likelihood of the parameters θ = {w, λ} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 Conditional on the instance xi we assume that y1 , . . . , yR are independent, that is, the annotators i i provide their responses independently. Hence from (9) the likelihood can be written as N R Pr[D |θ] = ∏ ∏ N (yi |w⊤ xi , 1/λ j ). j i=1 j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood θML = {λ, w} = arg max{ln Pr[D |θ]}. θ By equating the gradient of the log-likelihood to zero we obtain the following update equations for the precision and the weight vector. 1 λj = 1 N ∑ yij − w⊤ xi N i=1 N w = ∑ −1 2 . (10) N ∑R λ j yi j=1 i=1 ∑R λ j j=1 ∑ xi xi x⊤ i i=1 j . (11) As the parameters w and λ are coupled together we iterate these two steps till convergence. 5.6 Discussions 1. Is this standard least-squares? Define the design matrix X = [x1 , . . . , xN ]⊤ and the rej j sponse vector for each annotator as y j = [y1 , . . . , yN ]⊤ . Using matrix notation Equation 11 can be written as w = (X ⊤ X)−1 X ⊤ y where y = ∑R λ j y j j=1 ∑R λ j j=1 . (12) Equation 12 is essentially the solution to a standard linear regression model, except that we are training a linear regression model with y as the ground truth, which is a precision weighted mean of the response vectors from all the annotators. The variance of each annotator is estimated using (10). The final algorithm iteratively establishes a particular gold standard (y), measures the performance of the annotators and learns a regressor given that gold standard, and refines the gold standard based on the performance measures. 2. Are we better than the best annotator? If we assume λ is fixed (i.e., we ignore the variability and assume that it is well estimated) then w is an unbiased estimator of w and the ˆ covariance matrix is given by Cov(w) = Cov(y) X ⊤ X ˆ 1309 −1 = 1 ∑R λ j j=1 X ⊤X −1 . R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Since ∑R λ j > max j (λ j ) the proposed method has a lower variance than the regressor learnt j=1 with the best annotator (i.e., the one with the minimum variance). 3. Are we better than the average? For a fixed X the error in w depends only on the variance of y j . If we know the true λ j then yi is the best linear unbiased estimator for yi which minij mizes the variance. To see this consider any linear estimator of the form yi = ∑ j a j (yi − b j ). The variance is given by Var[yi ] = ∑ j (a j )2 /λ j . Since E[yi ] = yi ∑ j a j , for the bias of this estimator to be zero we require that ∑ j a j = 1. Solving the constrained minimization problem we see that a j = λ j / ∑ j λ j minimizes the variance. 4. Obtaining a consensus without features When no features are available the same algorithm can be simplified to get a consensus estimate of the actual ground truth and also evaluate the annotators. Essentially we have to iterate the following two updates till convergence ∑R λ j yi j=1 1 ∑R λ j j=1 λj j yi = = 1 N ∑ y j − yi N i=1 i 2 . 6. Experimental Validation We now experimentally validate the proposed algorithms on both simulated and real data. 6.1 Classification Experiments We use two CAD and one text data set in our experiments. The CAD data sets include a digital mammography data set and a breast MRI data set, both of which are biopsy proven, that is, the gold standard is available. For the digital mammography data set we simulate the radiologists in order to validate our methods. The breast MRI data has annotations from four radiologists. We also report results on a Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk which has annotations from 164 annotators. 6.1.1 D IGITAL M AMMOGRAPHY WITH S IMULATED R ADIOLOGISTS Mammograms are used as a screening tool to detect early breast cancer. CAD systems search for abnormal areas (lesions) in a digitized mammographic image. These lesions generally indicate the presence of malignant cancer. The CAD system then highlights these areas on the images, alerting the radiologist to the need for a further diagnostic mammogram or a biopsy. In classification terms, given a set of descriptive morphological features for a region on a image, the task is to predict whether it is potentially malignant (1) or not (0). In order to train such a classifier, a set of mammograms is collected from hospitals. The ground truth (whether it is cancer or not) is obtained from biopsy. Since biopsy is an expensive, tedious, and an invasive process, very often CAD systems are built from labels collected from multiple expert radiologists who visually examine the mammograms and mark the lesion locations—this constitutes our ground truth (multiple labels) for learning. In this experiment we use a proprietary biopsy-proven data set (Krishnapuram et al., 2008) containing 497 positive and 1618 negative examples. Each instance is described by a set of 27 morphological features. In order to validate our proposed algorithm, we simulate multiple radiologists according to the two-coin model described in § 2.1. Based on the labels from multiple radiologists, 1310 L EARNING F ROM C ROWDS we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. For the first set of experiments we use 5 radiologists with sensitivity α = [0.90 0.80 0.57 0.60 0.55] and specificity β = [0.95 0.85 0.62 0.65 0.58]. This corresponds to a scenario where the first two radiologists are experts and the last three are novices. Figure 1 summarizes the results. We compare on three different aspects: (1) How good is the learnt classifier? (2) How well can we estimate the sensitivity and specificity of each radiologist? (3) How good is the estimated ground truth? The following observations can be made. 1. Classifier performance Figure 1(a) plots the ROC curve of the learnt classifier on the training set. The dotted (black) line is the ROC curve for the classifier learnt using the actual ground truth. The solid (red) line is the ROC curve for the proposed algorithm and the dashed (blue) line is for the classifier learnt using the majority-voting scheme. The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. The area under the ROC curve (AUC) for the proposed algorithm is around 3.5% greater than that learnt using the majority-voting scheme. 2. Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(b). The end of the solid red line shows the estimates of the sensitivity and specificity from the proposed method. We used a uniform prior on all the parameters. The ellipse plots the contour of one standard deviation as obtained from the beta posterior estimates. The end of the dashed blue line shows the estimate obtained from the majorityvoting algorithm. We see that the proposed method is much closer to the actual values of sensitivity and specificity. 3. Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. From Figure 1(b) we can see that the ROC curve for the proposed method dominates the majority voting ROC curve. Furthermore, the area under the ROC curve (AUC) is around 3% higher. The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). It does not have an idea of who is an expert and who is a novice. The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity. The improvement obtained is quite large in Figure 2 which corresponds a situation where we have only one expert and 7 novices. 4. Joint Estimation To learn a classifier, Smyth et al. (1995) proposed to first estimate the golden ground truth and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. Figure 3 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. 6.1.2 B REAST MRI In this example, each radiologist reviews the breast MRI data and assesses the malignancy of each lesion on a BIRADS scale of 1 to 5. The BIRADS scale is defined as follows: 1 Negative, 2 Benign, 1311 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Majority Voting Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 EM algorithm Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0217 0.5869 0.2391 0.1521 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1785 0.1071 0.2500 0.4642 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0000 0.6957 0.1304 0.1739 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1428 0.0000 0.3214 0.5357 Table 1: The confusion matrix for the estimate obtained using majority voting and the proposed EM algorithm. The x indicates that there was no such category in the true labels (the gold standard). The gold-standard is obtained by the biopsy which can confirm whether it is benign (BIRADS=2) or malignant (BIRADS=5). 3 Probably Benign, 4 Suspicious abnormality, and 5 Highly suggestive of malignancy. Our data set comprises of 75 lesions with annotations from four radiologists, and the true labels from biopsy. Based on eight morphological features, we have to predict whether a lesion is malignant or not. For the first experiment we reduce the BIRADS scale to a binary one: any lesion with a BIRADS > 3 is considered malignant and benign otherwise. The set included 28 malignant and 47 benign lesions. Figure 4 summarizes the results. We show the leave-one-out cross validated ROC for the classifier. The cross-validated AUC of the proposed method is approximately 6% better than the majority voting baseline. We also consider the BIRADS labels as a set of ordinal measurements since there is an ordering among the BIRADS label. The confusion matrix in Table 1 shows that the EM algorithm is significantly superior than the majority voting in estimating the true BIRADS. 6.1.3 R ECOGNIZING T EXTUAL E NTAILMENT Finally we report results on Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk. In this task, the annotator is presented with two sentences and given a choice of whether the second sentence can be inferred from the first. The data has 800 tasks and 164 distinct readers, with 10 annotations per task along with the golden ground truth. The majority of the entries (94 %) in the 800x164 matrix are missing. There is one annotator who has labeled all the tasks. We use this data set to obtain an estimate of the actual ground truth. Figure 5 plots the accuracy of the estimated ground truth as a function of the number of annotators. The proposed EM algorithm achieves a higher accuracy than majority voting. In other words to achieve a desired accuracy the proposed algorithm needs fewer annotators than the majority voting scheme. 1312 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiologists. (a) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). (b) The ROC curve for the estimated ground truth. The actual sensitivity and specificity of each of the radiologists is marked as a ×. The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. The end of the solid red line shows the estimates from the proposed method. The ellipse plots the contour of one standard deviation. 1313 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.906 Majority voting baseline AUC=0.884 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm AUC=0.967 Majority voting baseline AUC=0.872 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 2: Same as Figure 1 except with 8 different radiologist annotations. 1314 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 3: ROC curves comparing the proposed algorithm (solid red line) with the Decoupled Estimation procedure (dotted blue line), which refers to the algorithm where the ground truth is first estimated using just the labels from the five radiologists and then a logistic regression classifier is trained using the soft probabilistic labels. In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm. 1315 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Leave−One−Out ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 0.9 True Positive Rate (sensitivity) 0.8 0.7 0.6 0.5 0.4 Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 4: Breast MRI results. (a) The leave-one-out cross validated ROC. (b) ROC for the estimated ground truth. 1316 L EARNING F ROM C ROWDS 0.95 0.9 0.85 Accuracy 0.8 0.75 0.7 0.65 Majority Voting EM Algorithm 0.6 0.55 0.5 20 40 60 80 100 120 Number of Annotators 140 160 Figure 5: The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. The plot was generated by randomly sampling the annotators 100 times. 6.2 Regression Experiments We first illustrate the algorithm on a toy dataset and then present a case study for automated polyp measurements. 6.2.1 I LLUSTRATION Figure 6 illustrates the the proposed algorithm for regression on a one-dimensional toy data set with three annotators. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 20 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 6.2.2 AUTOMATED P OLYP M EASUREMENTS Colorectal polyps are small colonic findings that may develop into cancer at a later stage. The diameter of the polyp is one of the key factors which decides the malignancy of a suspicious polyp. Hence accurate size estimation is crucial to decide the action to be taken on a polyp. We have developed various algorithms to segment a polyp. Multiple segmentation algorithms give rise to a set of features which are correlated with the diameter of the polyp. We want to learn a regression function which can predict the diameter of a polyp as a function of these features. In order to learn a regression function we collect our ground truth by asking many radiologists to manually measure the the diameter of the polyps from the three-dimensional images. In practice there is a lot of disagreement among the radiologists as to the actual size of the polyp. 1317 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2 1 N=50 examples R=3 annotators 2 2 5 Actual regression model Model with annotator 1 (Precision=0.01) 2 1 Model with annotator 2 (Precision=0.10) 2 Model with annotator 3 (Precision=1.00) 3 2 3 2 Model with average response 2 Proposed algorithm 2 3 1 1 4 3 2 2 y (response) 2 2 1 3 2 3 3 2 3 1 −1 3 3 2 2 3 3 1 2 2 2 3 33 3 1 1 3 3 3 3 2 3 3 2 33 33 2 2 2 3 3 3 32 1 2 3 2 1 3 3 2 1 2 1 2 2 1 1 2 3 2 3 2 3 1 3 3 −3 2 3 3 2 3 3 3 2 3 3 2 1 3 3 2 0 −2 2 2 3 2 3 2 3 1 2 2 1 2 3 −4 1 2 2 1 −5 0 2 0.1 0.2 1 0.3 0.4 0.5 0.6 10.8 2 0.7 x (feature) 0.9 1 1 Figure 6: Illustration of the proposed algorithm on a one-dimensional toy data set. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 50 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 5 5 10 Actual Polyp Diameter (mm) (a) Gold standard model 15 10 5 0 0 Pearson Correlation Coefficient=0.554966 RMSE=2.887740 15 Estimated Polyp Diameter (mm) 10 0 0 Pearson Correlation Coefficient=0.706558 RMSE=1.991576 15 Estimated Polyp Diameter (mm) Estimated Polyp Diameter (mm) Pearson Correlation Coefficient=0.714720 RMSE=1.970815 15 5 10 Actual Polyp Diameter (mm) (b) Proposed model 15 10 5 0 0 5 10 Actual Polyp Diameter (mm) 15 (c) Average model Figure 7: Scatter plot of the actual polyp diameter vs the diameter predicted by the models learnt using (a) the actual gold standard, (b) the proposed algorithm with annotations from five radiologists, and (c) the average of the radiologist’s annotations. (See § 6.2.2 for a description of the experimental setup.) 1318 L EARNING F ROM C ROWDS We use a proprietary data set containing 393 examples (which point to 285 distinct polyps— the segmentation algorithms generally return multiple marks on the same polyp.) along with the measured diameter (ranging from 2mm to 15mm) as our training set. Each example is described by a set of 60 morphological features which are correlated to the diameter of the polyp. In order to validate the feasibility of our proposed algorithm, we simulate five radiologists according to the noisy model described in § 5.1 with τ = [0.001 0.01 0.1 1 10]. This corresponds to a situation where the first three radiologists are extremely noisy and the last two are quite accurate. Based on the measurements from multiple radiologists, we can simultaneously (1) learn a linear regressor and (2) estimate the precision of each radiologist. We compare the results with the classifier trained using the actual golden ground truth as well as the regressor learnt using the average of the radiologists measurements. The results are validated on an independent test set containing 397 examples (which point to 298 distinct polyps). Figure 7 shows the scatter plot of the actual polyp diameter vs the diameter predicted by the three different models. We compare the performance based on the root mean squared error (RMSE) and also the Pearson’s correlation coefficient. The regressor learnt using the proposed iterative algorithm (Figure 7(b)) is almost as good as the one learnt using the golden ground truth (Figure 7(a)). The correlation coefficient for the proposed algorithm is significantly larger than that learnt using the average of the radiologists response. The estimate obtained by averaging is closer to the novices since they form a majority (3/5). The proposed algorithm appropriately weights each radiologist based on their estimated precisions. 7. Conclusions and Future Work In this paper we proposed a probabilistic framework for supervised learning with multiple annotators providing labels but no absolute gold standard. The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. We specifically discussed binary/categorical/ordinal classification and regression problems. We made two key assumptions: (1) the performance of each annotator does not depend on the feature vector for a given instance and (2) conditional on the truth the experts are independent, that is, they make their errors independently. As we pointed out earlier these assumptions are not true in practice. The annotator performance depends on the instance he is labeling and there is some degree of correlation among the annotators. We briefly discuss some strategies to relax these two assumptions. 7.1 Instance Difficulty One drawback of the current model is that it doesn’t estimate difficulty of items. It is often observed that for the easy instances all the annotators agree on the labels—thus violating our conditional independence assumption. The difficulty of annotating an item can be captured by another latent variable γi for each instance—which modulates the annotators performance. Models for this have been developed in the area of item-response theory (Baker and Kim, 2004) and also in epidemiology (Uebersax and Grove, 1993)—see also Whitehill et al. (2009) for a recent paper in the machine learning community. While these models do not take into account the available features our pro1319 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY posed model for sensitivity and specificity can be extended as follows (in place of (1) and (2)): α j (γi ) := Pr[yi = 1|yi = 1, γi ] = σ(a j1 + b j1 γi ). j β j (γi ) := Pr[yi = 0|yi = 0, γi ] = σ(a j0 + b j0 γi ). j Here the parameters a j1 and a j0 are related to the sensitivity and specificity of the jth annotator, while the latent term γi captures the difficulty of the instance. The key assumption here is that the annotators are independent conditional on both yi and γi . Various assumptions can be made on two parameters b j1 and b j0 to simplify these models further—for example we could set b j1 = b1 and b j0 = b0 for all the annotators. 7.2 Annotators Actually Look at the Data In our model we made the assumption that the sensitivity α j and the specificity β j of the jth annotator does not depend on the feature vector xi . For example, in the CAD domain, this meant that the radiologist’s performance is consistent across different sub-groups of data—which is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. We can extend the previous model such that the sensitivity and the specificity depends on the feature vector xi explicitly as follows j⊤ α j (γi , xi ) := Pr[yi = 1|yi = 1, γi , xi ] = σ(a j1 + b j1 γi + wα xi ). j j⊤ α j (γi , xi ) := Pr[yi = 0|yi = 0, γi , xi ] = σ(a j0 + b j0 γi + wβ xi ). j However this change increases the number of parameters to be learned. References P. S. Albert and L. E. Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60:427–435, 2004. F. B. Baker and S. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, 2 edition, 2004. B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. S. R. Cholleti, S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 2008 20th IEEE international Conference on Tools with Artificial intelligence, 2008. A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. O. Dekel and O. Shamir. Vox Populi: Collecting high-quality labels from a crowd. In COLT 2009: Proceedings of the 22nd Annual Conference on Learning Theory, 2009a. O. Dekel and O. Shamir. Good learners for evil teachers. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 233–240, 2009b. 1320 L EARNING F ROM C ROWDS A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977. P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 259–268, 2009. E. Frank and M. Hall. A simple approach to ordinal classification. Lecture Notes in Computer Science, pages 145–156, 2001. G. Fung, B. Krishnapuram, J. Bi, M. Dundar, V. C. Raykar, S. Yu, R. Rosales, S. Krishnan, and R. B. Rao. Mining medical images. In Fifteenth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining: Third Workshop on Data Mining Case Studies and Practice Prize, 2009. J. Howe. Crowd sourcing: Why the Power of the Crowd Is Driving the Future of Business. 2008. S. L. Hui and S. D. Walter. Estimating the error rates of diagnostic tests. Biometrics, 36:167–171, 1980. S. L. Hui and X. H. Zhou. Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research, 7:354–370, 1998. R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information Processing Systems 15, pages 897–904. 2003. B. Krishnapuram, J. Stoeckel, V. C. Raykar, R. B. Rao, P. Bamberger, E. Ratner, N. Merlet, I. Stainvas, M. Abramov, and A. Manevitch. Multiple-instance learning improves CAD detection of masses in digital mammography. In IWDM 2008: Proceedings of the 9th international workshop on Digital Mammography, pages 350–357. 2008. G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79–87, 1992. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 889–896, 2009. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. P. Smyth. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems 3, pages 163–182. MIT Press, 1995. 1321 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085– 1092. 1995. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast - But is it good? Evaluating nonexpert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263, 2008. A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. J. S. Uebersax and W. M. Grove. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 49:823–835, 1993. S. K. Warfield, K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004. J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. 1322
2 0.10118409 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
3 0.090884462 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
4 0.072286889 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
5 0.068509482 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
6 0.06827306 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
7 0.061628066 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
8 0.054114006 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
9 0.053774919 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
10 0.044182945 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
11 0.043657396 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
12 0.042619441 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.040554132 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
14 0.040328033 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
15 0.037149005 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
16 0.036295213 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
17 0.035856206 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
18 0.035470542 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
19 0.034667537 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
20 0.033358864 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
topicId topicWeight
[(0, -0.18), (1, 0.018), (2, -0.066), (3, 0.071), (4, 0.012), (5, 0.053), (6, 0.093), (7, 0.116), (8, 0.016), (9, -0.048), (10, -0.056), (11, 0.063), (12, 0.089), (13, 0.11), (14, 0.039), (15, 0.068), (16, -0.008), (17, 0.098), (18, 0.001), (19, 0.012), (20, 0.006), (21, 0.033), (22, 0.093), (23, 0.097), (24, -0.096), (25, 0.15), (26, 0.009), (27, 0.057), (28, 0.08), (29, 0.198), (30, 0.181), (31, -0.017), (32, -0.286), (33, -0.046), (34, -0.207), (35, -0.165), (36, -0.049), (37, 0.006), (38, 0.088), (39, 0.162), (40, -0.069), (41, -0.033), (42, 0.05), (43, 0.024), (44, 0.06), (45, -0.007), (46, 0.228), (47, 0.032), (48, 0.071), (49, -0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.93379301 61 jmlr-2010-Learning From Crowds
Author: Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy
Abstract: For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline. Keywords: multiple annotators, multiple experts, multiple teachers, crowdsourcing 1. Supervised Learning From Multiple Annotators/Experts A typical supervised learning scenario consists of a training set D = {(xi , yi )}N containing N i=1 instances, where xi ∈ X is an instance (typically a d-dimensional feature vector) and yi ∈ Y is the corresponding known label. The task is to learn a function f : X → Y which generalizes well on unseen data. Specifically for binary classification the supervision is from the set Y = {0, 1}, for multi-class classification Y = {1, . . . , K}, for ordinal regression Y = {1, . . . , K} (with an ordering 1 < . . . < K), and Y = R for regression. c 2010 Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo H. Valadez, Charles Florin, Luca Bogoni and Linda Moy. R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY However, for many real life tasks, it may not be possible, or may be too expensive (or tedious) to acquire the actual label yi for training—which we refer to as the gold standard or the objective ground truth. Instead, we may have multiple (possibly noisy) labels y1 , . . . , yR provided by R i i different experts or annotators. In practice, there is a substantial amount of disagreement among the experts, and hence it is of great practical interest to address conventional supervised learning algorithms in this scenario. Our motivation for this work comes from the area of computer-aided diagnosis1 (CAD), where the task is to build a classifier to predict whether a suspicious region on a medical image (like a X-ray, CT scan, or MRI) is malignant (cancerous) or benign. In order to train such a classifier, a set of images is collected from hospitals. The actual gold standard (whether it is cancer or not) can only be obtained from a biopsy of the tissue. Since it is an expensive, invasive, and potentially dangerous process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard.2 The radiologist also annotates various descriptors of the potentially malignant lesion, like the size (a regression problem), shape (a multiclass classification problem), and also degree of malignancy (an ordinal regression problem). The radiologists come from a diverse pool including luminaries, experts, residents, and novices. Very often there is lot of disagreement among the annotations. For a lot of tasks the labels provided by the annotators are inherently subjective and there will be substantial variation among different annotators. The domain of text classification offers such a scenario. In this context the task is to predict the category for a token of text. The labels for training are assigned by human annotators who read the text and attribute their subjective category. With the advent of crowdsourcing (Howe, 2008) services like Amazon’s Mechanical Turk,3 Games with a Purpose,4 and reCAPTCHA5 it is quite inexpensive to acquire labels from a large number of annotators (possibly thousands) in a short time (Sheng et al., 2008; Snow et al., 2008; Sorokin and Forsyth, 2008). Websites such as Galaxy Zoo6 allow the public to label astronomical images over the internet. In situations like these, the performance of different annotators can vary widely (some may even be malicious), and without the actual gold standard, it may not be possible to evaluate the annotators. In this work, we provide principled probabilistic solutions to the following questions: 1. How to adapt conventional supervised learning algorithms when we have multiple annotators providing subjective labels but no objective gold standard? 2. How to evaluate systems when we do not have absolute gold-standard? 3. A closely related problem—particularly relevant when there are a large number of annotators— is to estimate how reliable/trustworthy is each annotator. 1. See Fung et al. (2009) for an overview of the data mining issues in this area. 2. Sometimes even a biopsy cannot confirm whether it is cancer or not and hence all we can hope to get is subjective ground truth. 3. Mechanical Turk found at https://www.mturk.com. 4. Games with a Purpose found at http://www.gwap.com. 5. reCAPTCHA found at http://recaptcha.net/. 6. Galaxy Zoo found at http://galaxyzoo.org. 1298 L EARNING F ROM C ROWDS 1.1 The Problem With Majority Voting When we have multiple labels a commonly used strategy is to use the labels on which the majority of them agree (or average for regression problem) as an estimate of the actual gold standard. For binary classification problems this amounts to using the majority label,7 that is, j yi = ˆ 1 if (1/R) ∑R yi > 0.5 j=1 , j 0 if (1/R) ∑R yi < 0.5 j=1 as an estimate of the hidden true label and use this estimate to learn and evaluate classifiers/annotators. Another strategy is that of considering every pair (instance, label) provided by each expert as a separate example. Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, that is, R Pr[yi = 1|y1 , . . . , yR ] = (1/R) ∑ yi . i i j j=1 Majority voting assumes all experts are equally good. However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. One could address this problem by introducing a weight capturing how good each expert is. But how would one measure the performance of an expert when there is no gold standard available? 1.2 Proposed Approach and Organization To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator that jointly learns the classifier/regressor, the annotator accuracy, and the actual true label. For ease of exposition we start with binary classification problem in § 2. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard (§ 2.1). The proposed algorithm automatically discovers the best experts and assigns a higher weight to them. In order to incorporate prior knowledge about each annotator, we impose a beta prior on the sensitivity and specificity and derive the maximum-a-posteriori estimate (§ 2.6). The final estimation is performed by an Expectation Maximization (EM) algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. While the proposed approach is described using logistic regression as the base classifier (§ 2.2), it is quite general, and can be used with any black-box classifier (§ 2.7), and can also handle missing labels (that is, each expert is not required to label all the instances). Furthermore, we extend the proposed algorithm to handle categorical (§ 3), ordinal (§ 4), and regression problems (§ 5). In § 6 section we extensively validate our approach using both simulated data and real data from different domains. 1.3 Related Work and Novel Contributions We first summarize the novel contributions of this work in context of other related work in this emerging new area. There has been a long line of work in the biostatistics and epidemiology literature on latent variable models where the task is to get an estimate of the observer error rates based 7. When there is no clear majority among the multiple experts (that is, yi = 0.5) in CAD domain the final decision is ˆ often made by an adjudicator or a super-expert. When there is no adjudicator a fair coin toss is used. 1299 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY on the results from multiple diagnostic tests without a gold standard (see Dawid and Skene, 1979, Hui and Walter, 1980, Hui and Zhou, 1998, Albert and Dodd, 2004 and references therein). In the machine learning community Smyth et al. (1995) first addressed the same problem in the context of labeling volcanoes in satellite images of Venus. We differ from this previous body of work in the following aspects: 1. Unlike Dawid and Skene (1979) and Smyth et al. (1995) which just focused on estimating the ground truth from multiple noisy labels, we specifically address the issue of learning a classifier. Estimating the ground truth and the annotator/classifer performance is a byproduct of our proposed algorithm. 2. In order to learn a classifier Smyth (1995) proposed to first estimate the ground truth (without using the features) and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly. Our experiments (§ 6.1.1) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. 3. Our solution is more general and can be easily extended to categorical(§ 3), ordinal(§ 4), and continuous data(§ 5). It can also be used in conjunction with any supervised learning algorithm. A preliminary version of this paper (Raykar et al., 2009) mainly discussed the binary classification problem. 4. Our proposed algorithm is also Bayesian—we impose a prior on the experts. The priors can potential capture the skill of different annotators. In this paper we refrain from doing a full Bayesian inference and use the mode of the posterior as a point estimate. A recent complete Bayesian generalization of these kind of models has been developed by Carpenter (2008). 5. The EM approach used in this paper is similar to that proposed by Jin and Ghahramani (2003). However their motivation is somewhat different. In their setting, each training example is annotated with a set of possible labels, only one of which is correct. There has been recent interest in the natural language processing (Sheng et al., 2008; Snow et al., 2008) and computer vision (Sorokin and Forsyth, 2008) communities where they use Amazon’s Mechanical Turk to collect annotations from many people. They show that it can be potentially as good as that provided by an expert. Sheng et al. (2008) analyzed when it is worthwhile to acquire new labels for some of the training examples. There is also some theoretical work (see Lugosi, 1992 and Dekel and Shamir, 2009a) dealing with multiple experts. Recently Dekel and Shamir (2009b) presented an algorithm which does not resort to repeated labeling, that is, each example does not have to be labeled by multiple teachers. Donmez et al. (2009) address the issue of active learning in this scenario—How to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task? There has also been some work in the medical imaging community (Warfield et al., 2004; Cholleti et al., 2008). 2. Binary Classification We first describe our proposed noise model for the annotators. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard. 1300 L EARNING F ROM C ROWDS 2.1 A Two-coin Model for Annotators Let y j ∈ {0, 1} be the label assigned to the instance x by the jth annotator/expert. Let y be the actual (unobserved) label for this instance. Each annotator provides a version of this hidden true label based on two biased coins. If the true label is one, she flips a coin with bias α j (sensitivity). If the true label is zero, she flips a coin with bias β j (specificity). In each case, if she gets heads she keeps the original label, otherwise she flips the label. If the true label is one, the sensitivity (true positive rate) for the jth annotator is defined as the probability that she labels it as one. α j := Pr[y j = 1|y = 1]. (1) On the other hand, if the true label is zero, the specificity (1−false positive rate) is defined as the probability that she labels it as zero. β j := Pr[y j = 0|y = 0]. (2) The assumption introduced is that α j and β j do not depend on the instance x. For example, in the CAD domain, this means that the radiologist’s performance is consistent across different sub-groups of data.8 2.2 Classification Model While the proposed method can be used for any classifier, for ease of exposition, we consider the family of linear discriminating functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. The final classifier can be written in the following form: y = 1 if w⊤ x ≥ γ and 0 otherwise. The ˆ threshold γ determines the operating point of the classifier. The Receiver Operating Characteristic (ROC) curve is obtained as γ is swept from −∞ to ∞. The probability for the positive class is modeled as a logistic sigmoid acting on fw, that is, Pr[y = 1|x, w] = σ(w⊤ x), where the logistic sigmoid function is defined as σ(z) = 1/(1 + e−z ). This classification model is known as logistic regression. 2.3 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R annotators, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and also the sensitivity α = i i i=1 [α1 , . . . , αR ] and the specificity β = [β1 , . . . , βR ] of the R annotators. It is also of interest to get an estimate of the unknown gold standard y1 , . . . , yN . 2.4 Maximum Likelihood Estimator Assuming the training instances are independently sampled, the likelihood function of the parameters θ = {w, α, β} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 8. While this is a reasonable assumption, it is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. 1301 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY j Conditioning on the true label yi , and also using the assumption yi is conditionally independent (of everything else) given α j , β j and yi , the likelihood can be decomposed as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |yi = 1, α]Pr[yi = 1|xi, w] i i i=1 + Pr[y1 , . . . , yR |yi = 0, β]Pr[yi = 0|xi, w] . i i Given the true label yi , we assume that y1 , . . . , yR are independent, that is, the annotators make their i i decisions independently.9 Hence, R R Pr[y1 , . . . , yR |yi = 1, α] = ∏ Pr[yi |yi = 1, α j ] = ∏ [α j ]yi [1 − α j ]1−yi . i i j j=1 j j j=1 Similarly, we have R Pr[y1 , . . . , yR |yi = 0, β] = ∏ [β j ]1−yi [1 − β j ]yi . i i j j j=1 Hence the likelihood can be written as N Pr[D |θ] = ∏ ai pi + bi (1 − pi ) , i=1 where we have defined pi := σ(w⊤ xi ). R ai := ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi := ∏[β j ]1−y [1 − β j ]y . j i j i j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood, that is, ˆ θML = {α, β, w} = arg max{ln Pr[D |θ]}. ˆ ˆ ˆ θ 2.5 The EM Algorithm This maximization problem can be simplified a lot if we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is an efficient iterative procedure to compute the maximum-likelihood solution in presence of missing/hidden data. We will use the unknown hidden true label yi as the missing data. If we know the missing data y = [y1 , . . . , yN ] then the complete likelihood can be written as N ln Pr[D , y|θ] = ∑ yi ln pi ai + (1 − yi ) ln(1 − pi )bi . i=1 9. This assumption is not true in general and there is some correlations among the labels assigned by multiple annotators. For example in the CAD domain if the cancer is in advanced stage (which is very easy to detect) almost all the radiologists assign the same label. 1302 L EARNING F ROM C ROWDS Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)step. The M-step involves maximization of a lower bound on the log-likelihood that is refined in each iteration by the E-step. 1. E-step. Given the observation D and the current estimate of the model parameters θ, the conditional expectation (which is a lower bound on the true likelihood) is computed as N E {ln Pr[D , y|θ]} = ∑ µi ln pi ai + (1 − µi ) ln(1 − pi )bi , (3) i=1 where the expectation is with respect to Pr[y|D , θ], and µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. Using i i Bayes’ theorem we can compute µi ∝ Pr[y1 , . . . , yR |yi = 1, θ] · Pr[yi = 1|xi, θ] i i ai pi = . ai pi + bi (1 − pi ) 2. M-step. Based on the current estimate µi and the observations D , the model parameters θ are then estimated by maximizing the conditional expectation. By equating the gradient of (3) to zero we obtain the following estimates for the sensitivity and specificity: j j αj = ∑N µi yi i=1 , ∑N µi i=1 βj = ∑N (1 − µi )(1 − yi ) i=1 . ∑N (1 − µi ) i=1 Due to the non-linearity of the sigmoid, we do not have a closed form solution for w and we have to use gradient ascent based optimization methods. We use the Newton-Raphson update given by wt+1 = wt − ηH −1 g, where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by N g(w) = ∑ µi − σ(w⊤ xi ) xi . i=1 The Hessian matrix is given by N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ . i i=1 Essentially, we are estimating a logistic regression model with probabilistic labels µi . These two steps (the E- and the M-step) can be iterated till convergence. The log-likelihood increases monotonically after every iteration, which in practice implies convergence to a local maximum. The EM algorithm is only guaranteed to converge to a local maximum. In practice multiple restarts with different initializations can potentially mitigate the local maximum problem. In this j paper we use majority voting µi = 1/R ∑R yi as the initialization for µi to start the EM-algorithm. j=1 1303 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2.6 A Bayesian Approach In some applications we may want to trust a particular expert more than the others. One way to achieve this is by imposing priors on the sensitivity and specificity of the experts. Since α j and β j represent the probability of a binary event, a natural choice of prior is the beta prior. The beta prior is also conjugate to the binomial distribution. For any a > 0, b > 0, and δ ∈ [0, 1] the beta distribution is given by δa−1 (1 − δ)b−1 , Beta(δ|a, b) = B(a, b) where B(a, b) = 01 δa−1 (1 − δ)b−1 dδ is the beta function.We assume a beta prior10 for both the sensitivity and the specificity as R j j j j j j j j Pr[α j |a1 , a2 ] = Beta(α j |a1 , a2 ). Pr[β j |b1 , b2 ] = Beta(β j |b1 , b2 ). For sake of completeness we also assume a zero mean Gaussian prior on the weights w with inverse covariance matrix Γ, that is, Pr[w] = N (w|0, Γ−1 ). Assuming that {α j }, {β j }, and w have independent priors, the maximum-a-posteriori (MAP) estimator is found by maximizing the logposterior, that is, ˆ θMAP = arg max{ln Pr[D |θ] + ln Pr[θ]}. θ An EM algorithm can be derived in a similar fashion for MAP estimation by relying on the interpretation of Neal and Hinton (1998). The final algorithm is summarized below: j 1. Initialize µi = (1/R) ∑R yi based on majority voting. j=1 2. Given µi , estimate the sensitivity and specificity of each annotator/expert as follows. j αj = j a1 − 1 + ∑N µi yi i=1 j j a1 + a2 − 2 + ∑N µi i=1 j β j = . j b1 − 1 + ∑N (1 − µi )(1 − yi ) i=1 j j b1 + b2 − 2 + ∑N (1 − µi ) i=1 . (4) The Newton-Raphson update for optimizing w is given by wt+1 = wt − ηH −1 g, with step length η, gradient vector N g(w) = ∑ µi − σ(w⊤ xi ) xi − Γw, i=1 and Hessian matrix N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ − Γ. i i=1 10. It may be convenient to specify a prior in terms of the mean µ and variance σ2 . The mean and the variance for a beta prior are given by µ = a/(a + b) and σ2 = ab/((a + b)2 (a + b + 1)). Solving for a and b we get a = (−µ3 + µ2 − µσ2 )/σ2 and b = a(1 − µ)/µ. 1304 L EARNING F ROM C ROWDS 3. Given the sensitivity and specificity of each annotator and the model parameters, update µi as µi = ai pi , ai pi + bi (1 − pi ) (5) where pi = σ(w⊤ xi ). R ai = ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi = ∏[β j ]1−y [1 − β j ]y . j i j i (6) j=1 Iterate (2) and (3) till convergence. 2.7 Discussions 1. Estimate of the gold standard The value of the posterior probability µi is a soft probabilistic estimate of the actual ground truth yi , that is, µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. The actual i i hidden label yi can be estimated by applying a threshold on µi , that is, yi = 1 if µi ≥ γ and zero otherwise. We can use γ = 0.5 as the threshold. By varying γ we can change the misclassification costs and obtain a ground truth with large sensitivity or large specificity. Because of this in our experimental validation we can actually draw an ROC curve for the estimated ground truth. 2. Log-odds of µ A particularly revealing insight can be obtained in terms of the log-odds or the logit of the posterior probability µi . From (5) the logit of µi can be written as logit(µi ) = ln Pr[yi = 1|y1 , . . . , yR , xi , θ] µi i i = ln 1 − µi Pr[yi = 0|y1 , . . . , yR , xi , θ] i i R = w⊤ xi + c + ∑ yi [logit(α j ) + logit(β j )]. j j=1 j where c = ∑R log 1−α is a constant term which does not depend on i. This indicates that j=1 βj the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. The weight of each expert is the sum of the logit of the sensitivity and specificity. 3. Using any other classifier For ease of exposition we used logistic regression. However, the proposed algorithm can be used with any generalized linear model or in fact with any classifier that can be trained with soft probabilistic labels. In each step of the EM-algorithm, the classifier is trained with instances sampled from µi . This modification is easy for most probabilistic classifiers. For general black-box classifiers where we cannot tweak the training algorithm an alternate approach is to replicate the training examples according to the soft label. For example a probabilistic label µi = 0.8 can be effectively simulated by adding 8 training examples with deterministic label 1 and 2 examples with label 0. 1305 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 4. Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. Here instead of learning a classifier we estimate p which is the prevalence of the positive class, that is, p = Pr[yi = 1]. We further assume a beta prior for the prevalence, that is, Beta(p|p1 , p2 ). The algorithm simplifies as follows. j (a) Initialize µi = (1/R) ∑R yi based on majority voting. j=1 (b) Given µi , estimate the sensitivity and specificity of each annotator using (4). The prevalence of the positive class is estimated as follows. p = p1 − 1 + ∑N µi i=1 . p1 + p2 − 2 + N (c) Given the sensitivity and specificity of each annotator and prevalence, refine µi as follows. ai p µi = . ai p + bi (1 − p) Iterate (2) and (3) till convergence. This algorithm is similar to the one proposed by Dawid and Skene (1979) and Smyth et al. (1995). 5. Handling missing labels The proposed approach can easily handle missing labels, that is, when the labels from some experts are missing for some instances. Let Ri be the number of radiologists labeling the ith instance, and let N j be the number of instances labeled by the jth radiologist. Then in the EM algorithm, we just need to replace N by N j for estimating the sensitivity and specificity in (4), and replace R by Ri for updating µi in (6). 6. Evaluating a classifier We can use the probability scores µi directly to evaluate classifiers. If zi are the labels obtained from any other classifier, then sensitivity and specificity can be estimated as ∑N (1 − µi )(1 − zi ) ∑N µi zi , β = i=1 N . α = i=1 ∑N µi ∑i=1 (1 − µi ) i=1 7. Posterior approximation At the end of each EM iteration a crude approximation to the posterior is obtained as N N α j ∼ Beta α j |a1 + ∑ µi yi , a2 + ∑ µi (1 − yi ) , j j i=1 N j j i=1 N β j ∼ Beta β j |b1 + ∑ (1 − µi )(1 − yi ), b2 + ∑ (1 − µi )yi j j i=1 j j . i=1 3. Multi-class Classification In this section we describe how the proposed approach for binary classification can be extended to categorical data. Suppose there are K ≥ 2 categories. An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity—which are three 1306 L EARNING F ROM C ROWDS different kinds on nodules. We can extend the previous model and introduce a vector of multinomial j j j parameters αc = (αc1 , . . . , αcK ) for each annotator, where αck := Pr[y j = k|y = c] j and ∑K αck = 1. Here αck denotes the probability that the annotator j assigns class k to an instance k=1 j j given the true class is c. When K = 2, α11 and α00 are sensitivity and specificity, respectively. A similar EM algorithm can be derived. In the E-step, we estimate j j R K Pr[yi = c|Y , α] ∝ Pr[yi = c|xi ] ∏ ∏ (αck )δ(yi ,k) , j j j=1 k=1 where δ(u, v) = 1 if u = v and 0 otherwise and in the M-step we learn a multi-class classifier and update the multinomial parameter as j αck = j ∑N Pr[yi = c|Y , α]δ(yi , k) i=1 . ∑N Pr[yi = c|Y , α] i=1 One can also assign a Dirichlet prior for the multinomial parameters, and this results in a smoothing term in the above updates in the MAP estimate. 4. Ordinal Regression We now consider the situation where the outputs are categorical and have an ordering among the labels. In the CAD domain the radiologist often gives a score (for example, 1 to 5 from lowest to highest) to indicate how likely she thinks it is malignant. This is different from a multi-class setting in which we do not have any preference among the multiple class labels. j Let yi ∈ {1, . . . , K} be the label assigned to the ith instance by the jth expert. Note that there is an ordering in the labels 1 < . . . < K. A simple approach is to convert the ordinal data into a series of binary data (Frank and Hall, 2001). Specifically the K class ordinal labels are transformed into K − 1 binary class labels as follows: jc yi = j 1 if yi > c 0 otherwise c = 1, . . . , K − 1. Applying the same procedure used for binary labels we can estimate Pr[yi > c] for c = 1, . . . , K − 1. The probability of the actual class values can then be obtained as Pr[yi = c] = Pr[yi > c − 1 and yi ≤ c] = Pr[yi > c − 1] − Pr[yi > c]. The class with the maximum probability is assigned to the instance. 5. Regression In this section we develop a similar algorithm to learn a regression function using annotations from multiple experts. In the CAD domain as a part of the annotation process a common task for a radiologist is to measure the diameter of a suspicious lesion. 1307 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 5.1 Model for Annotators j Let yi ∈ R be the continuous target value assigned to the ith instance by the jth annotator. Our model is that the annotator provides a noisy version of the actual true value yi . For the jth annotator we will assume a Gaussian noise model with mean yi (the true unknown value) and inverse-variance (precision) τ j , that is, j j Pr[yi |yi , τ j ] = N (yi |yi , 1/τ j ), (7) where the Gaussian distribution is defined as N (z|m, σ2 ) = (2πσ2 )−1/2 exp(−(z − m)2 /2σ2 ). The unknown inverse-variance τ j measures the accuracy of each annotator—the larger the value of τ j the more accurate the annotator. We have assumed that τ j does not depend on the instance xi . For example, in the CAD domain, this means that the radiologist’s accuracy does not depend on the nodule she is measuring. While this a practical assumption, it is not entirely true. It is known that some nodules are harder to measure than others. 5.2 Linear Regression Model for Features As before we consider the family of linear regression functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. We assume that the actual target response yi is given by the deterministic regression function fw with additive Gaussian noise, that is, yi = w⊤ xi + ε, where ε is a zero-mean Gaussian random variable with inverse-variance (precision) γ. Hence Pr[yi |xi , w, γ] = N (yi |w⊤ xi , 1/γ). (8) 5.3 Combined Model Combining both the annotator (7) and the regressor (8) model we have Pr[yi |xi , w, τ j , γ] = j Z Pr[yi |yi , τ j ]Pr[yi |xi , w, γ]dyi = N (yi |w⊤ xi , 1/γ + 1/τ j ). j j Since the two precision terms (γ and τ j ) are grouped together they are not uniquely identifiable. Hence we will define a new precision term λ j as 1 1 1 = + j. j λ γ τ So we have the following model Pr[yi |xi , w, λ j ] = N (yi |w⊤ xi , 1/λ j ). j j (9) 5.4 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R experts, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and the precision λ = [λ1 , . . . , λR ] of i i=1 i all the annotators. 1308 L EARNING F ROM C ROWDS 5.5 Maximum-likelihood Estimator Assuming the instances are independent the likelihood of the parameters θ = {w, λ} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 Conditional on the instance xi we assume that y1 , . . . , yR are independent, that is, the annotators i i provide their responses independently. Hence from (9) the likelihood can be written as N R Pr[D |θ] = ∏ ∏ N (yi |w⊤ xi , 1/λ j ). j i=1 j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood θML = {λ, w} = arg max{ln Pr[D |θ]}. θ By equating the gradient of the log-likelihood to zero we obtain the following update equations for the precision and the weight vector. 1 λj = 1 N ∑ yij − w⊤ xi N i=1 N w = ∑ −1 2 . (10) N ∑R λ j yi j=1 i=1 ∑R λ j j=1 ∑ xi xi x⊤ i i=1 j . (11) As the parameters w and λ are coupled together we iterate these two steps till convergence. 5.6 Discussions 1. Is this standard least-squares? Define the design matrix X = [x1 , . . . , xN ]⊤ and the rej j sponse vector for each annotator as y j = [y1 , . . . , yN ]⊤ . Using matrix notation Equation 11 can be written as w = (X ⊤ X)−1 X ⊤ y where y = ∑R λ j y j j=1 ∑R λ j j=1 . (12) Equation 12 is essentially the solution to a standard linear regression model, except that we are training a linear regression model with y as the ground truth, which is a precision weighted mean of the response vectors from all the annotators. The variance of each annotator is estimated using (10). The final algorithm iteratively establishes a particular gold standard (y), measures the performance of the annotators and learns a regressor given that gold standard, and refines the gold standard based on the performance measures. 2. Are we better than the best annotator? If we assume λ is fixed (i.e., we ignore the variability and assume that it is well estimated) then w is an unbiased estimator of w and the ˆ covariance matrix is given by Cov(w) = Cov(y) X ⊤ X ˆ 1309 −1 = 1 ∑R λ j j=1 X ⊤X −1 . R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Since ∑R λ j > max j (λ j ) the proposed method has a lower variance than the regressor learnt j=1 with the best annotator (i.e., the one with the minimum variance). 3. Are we better than the average? For a fixed X the error in w depends only on the variance of y j . If we know the true λ j then yi is the best linear unbiased estimator for yi which minij mizes the variance. To see this consider any linear estimator of the form yi = ∑ j a j (yi − b j ). The variance is given by Var[yi ] = ∑ j (a j )2 /λ j . Since E[yi ] = yi ∑ j a j , for the bias of this estimator to be zero we require that ∑ j a j = 1. Solving the constrained minimization problem we see that a j = λ j / ∑ j λ j minimizes the variance. 4. Obtaining a consensus without features When no features are available the same algorithm can be simplified to get a consensus estimate of the actual ground truth and also evaluate the annotators. Essentially we have to iterate the following two updates till convergence ∑R λ j yi j=1 1 ∑R λ j j=1 λj j yi = = 1 N ∑ y j − yi N i=1 i 2 . 6. Experimental Validation We now experimentally validate the proposed algorithms on both simulated and real data. 6.1 Classification Experiments We use two CAD and one text data set in our experiments. The CAD data sets include a digital mammography data set and a breast MRI data set, both of which are biopsy proven, that is, the gold standard is available. For the digital mammography data set we simulate the radiologists in order to validate our methods. The breast MRI data has annotations from four radiologists. We also report results on a Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk which has annotations from 164 annotators. 6.1.1 D IGITAL M AMMOGRAPHY WITH S IMULATED R ADIOLOGISTS Mammograms are used as a screening tool to detect early breast cancer. CAD systems search for abnormal areas (lesions) in a digitized mammographic image. These lesions generally indicate the presence of malignant cancer. The CAD system then highlights these areas on the images, alerting the radiologist to the need for a further diagnostic mammogram or a biopsy. In classification terms, given a set of descriptive morphological features for a region on a image, the task is to predict whether it is potentially malignant (1) or not (0). In order to train such a classifier, a set of mammograms is collected from hospitals. The ground truth (whether it is cancer or not) is obtained from biopsy. Since biopsy is an expensive, tedious, and an invasive process, very often CAD systems are built from labels collected from multiple expert radiologists who visually examine the mammograms and mark the lesion locations—this constitutes our ground truth (multiple labels) for learning. In this experiment we use a proprietary biopsy-proven data set (Krishnapuram et al., 2008) containing 497 positive and 1618 negative examples. Each instance is described by a set of 27 morphological features. In order to validate our proposed algorithm, we simulate multiple radiologists according to the two-coin model described in § 2.1. Based on the labels from multiple radiologists, 1310 L EARNING F ROM C ROWDS we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. For the first set of experiments we use 5 radiologists with sensitivity α = [0.90 0.80 0.57 0.60 0.55] and specificity β = [0.95 0.85 0.62 0.65 0.58]. This corresponds to a scenario where the first two radiologists are experts and the last three are novices. Figure 1 summarizes the results. We compare on three different aspects: (1) How good is the learnt classifier? (2) How well can we estimate the sensitivity and specificity of each radiologist? (3) How good is the estimated ground truth? The following observations can be made. 1. Classifier performance Figure 1(a) plots the ROC curve of the learnt classifier on the training set. The dotted (black) line is the ROC curve for the classifier learnt using the actual ground truth. The solid (red) line is the ROC curve for the proposed algorithm and the dashed (blue) line is for the classifier learnt using the majority-voting scheme. The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. The area under the ROC curve (AUC) for the proposed algorithm is around 3.5% greater than that learnt using the majority-voting scheme. 2. Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(b). The end of the solid red line shows the estimates of the sensitivity and specificity from the proposed method. We used a uniform prior on all the parameters. The ellipse plots the contour of one standard deviation as obtained from the beta posterior estimates. The end of the dashed blue line shows the estimate obtained from the majorityvoting algorithm. We see that the proposed method is much closer to the actual values of sensitivity and specificity. 3. Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. From Figure 1(b) we can see that the ROC curve for the proposed method dominates the majority voting ROC curve. Furthermore, the area under the ROC curve (AUC) is around 3% higher. The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). It does not have an idea of who is an expert and who is a novice. The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity. The improvement obtained is quite large in Figure 2 which corresponds a situation where we have only one expert and 7 novices. 4. Joint Estimation To learn a classifier, Smyth et al. (1995) proposed to first estimate the golden ground truth and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. Figure 3 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. 6.1.2 B REAST MRI In this example, each radiologist reviews the breast MRI data and assesses the malignancy of each lesion on a BIRADS scale of 1 to 5. The BIRADS scale is defined as follows: 1 Negative, 2 Benign, 1311 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Majority Voting Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 EM algorithm Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0217 0.5869 0.2391 0.1521 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1785 0.1071 0.2500 0.4642 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0000 0.6957 0.1304 0.1739 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1428 0.0000 0.3214 0.5357 Table 1: The confusion matrix for the estimate obtained using majority voting and the proposed EM algorithm. The x indicates that there was no such category in the true labels (the gold standard). The gold-standard is obtained by the biopsy which can confirm whether it is benign (BIRADS=2) or malignant (BIRADS=5). 3 Probably Benign, 4 Suspicious abnormality, and 5 Highly suggestive of malignancy. Our data set comprises of 75 lesions with annotations from four radiologists, and the true labels from biopsy. Based on eight morphological features, we have to predict whether a lesion is malignant or not. For the first experiment we reduce the BIRADS scale to a binary one: any lesion with a BIRADS > 3 is considered malignant and benign otherwise. The set included 28 malignant and 47 benign lesions. Figure 4 summarizes the results. We show the leave-one-out cross validated ROC for the classifier. The cross-validated AUC of the proposed method is approximately 6% better than the majority voting baseline. We also consider the BIRADS labels as a set of ordinal measurements since there is an ordering among the BIRADS label. The confusion matrix in Table 1 shows that the EM algorithm is significantly superior than the majority voting in estimating the true BIRADS. 6.1.3 R ECOGNIZING T EXTUAL E NTAILMENT Finally we report results on Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk. In this task, the annotator is presented with two sentences and given a choice of whether the second sentence can be inferred from the first. The data has 800 tasks and 164 distinct readers, with 10 annotations per task along with the golden ground truth. The majority of the entries (94 %) in the 800x164 matrix are missing. There is one annotator who has labeled all the tasks. We use this data set to obtain an estimate of the actual ground truth. Figure 5 plots the accuracy of the estimated ground truth as a function of the number of annotators. The proposed EM algorithm achieves a higher accuracy than majority voting. In other words to achieve a desired accuracy the proposed algorithm needs fewer annotators than the majority voting scheme. 1312 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiologists. (a) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). (b) The ROC curve for the estimated ground truth. The actual sensitivity and specificity of each of the radiologists is marked as a ×. The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. The end of the solid red line shows the estimates from the proposed method. The ellipse plots the contour of one standard deviation. 1313 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.906 Majority voting baseline AUC=0.884 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm AUC=0.967 Majority voting baseline AUC=0.872 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 2: Same as Figure 1 except with 8 different radiologist annotations. 1314 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 3: ROC curves comparing the proposed algorithm (solid red line) with the Decoupled Estimation procedure (dotted blue line), which refers to the algorithm where the ground truth is first estimated using just the labels from the five radiologists and then a logistic regression classifier is trained using the soft probabilistic labels. In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm. 1315 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Leave−One−Out ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 0.9 True Positive Rate (sensitivity) 0.8 0.7 0.6 0.5 0.4 Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 4: Breast MRI results. (a) The leave-one-out cross validated ROC. (b) ROC for the estimated ground truth. 1316 L EARNING F ROM C ROWDS 0.95 0.9 0.85 Accuracy 0.8 0.75 0.7 0.65 Majority Voting EM Algorithm 0.6 0.55 0.5 20 40 60 80 100 120 Number of Annotators 140 160 Figure 5: The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. The plot was generated by randomly sampling the annotators 100 times. 6.2 Regression Experiments We first illustrate the algorithm on a toy dataset and then present a case study for automated polyp measurements. 6.2.1 I LLUSTRATION Figure 6 illustrates the the proposed algorithm for regression on a one-dimensional toy data set with three annotators. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 20 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 6.2.2 AUTOMATED P OLYP M EASUREMENTS Colorectal polyps are small colonic findings that may develop into cancer at a later stage. The diameter of the polyp is one of the key factors which decides the malignancy of a suspicious polyp. Hence accurate size estimation is crucial to decide the action to be taken on a polyp. We have developed various algorithms to segment a polyp. Multiple segmentation algorithms give rise to a set of features which are correlated with the diameter of the polyp. We want to learn a regression function which can predict the diameter of a polyp as a function of these features. In order to learn a regression function we collect our ground truth by asking many radiologists to manually measure the the diameter of the polyps from the three-dimensional images. In practice there is a lot of disagreement among the radiologists as to the actual size of the polyp. 1317 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2 1 N=50 examples R=3 annotators 2 2 5 Actual regression model Model with annotator 1 (Precision=0.01) 2 1 Model with annotator 2 (Precision=0.10) 2 Model with annotator 3 (Precision=1.00) 3 2 3 2 Model with average response 2 Proposed algorithm 2 3 1 1 4 3 2 2 y (response) 2 2 1 3 2 3 3 2 3 1 −1 3 3 2 2 3 3 1 2 2 2 3 33 3 1 1 3 3 3 3 2 3 3 2 33 33 2 2 2 3 3 3 32 1 2 3 2 1 3 3 2 1 2 1 2 2 1 1 2 3 2 3 2 3 1 3 3 −3 2 3 3 2 3 3 3 2 3 3 2 1 3 3 2 0 −2 2 2 3 2 3 2 3 1 2 2 1 2 3 −4 1 2 2 1 −5 0 2 0.1 0.2 1 0.3 0.4 0.5 0.6 10.8 2 0.7 x (feature) 0.9 1 1 Figure 6: Illustration of the proposed algorithm on a one-dimensional toy data set. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 50 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 5 5 10 Actual Polyp Diameter (mm) (a) Gold standard model 15 10 5 0 0 Pearson Correlation Coefficient=0.554966 RMSE=2.887740 15 Estimated Polyp Diameter (mm) 10 0 0 Pearson Correlation Coefficient=0.706558 RMSE=1.991576 15 Estimated Polyp Diameter (mm) Estimated Polyp Diameter (mm) Pearson Correlation Coefficient=0.714720 RMSE=1.970815 15 5 10 Actual Polyp Diameter (mm) (b) Proposed model 15 10 5 0 0 5 10 Actual Polyp Diameter (mm) 15 (c) Average model Figure 7: Scatter plot of the actual polyp diameter vs the diameter predicted by the models learnt using (a) the actual gold standard, (b) the proposed algorithm with annotations from five radiologists, and (c) the average of the radiologist’s annotations. (See § 6.2.2 for a description of the experimental setup.) 1318 L EARNING F ROM C ROWDS We use a proprietary data set containing 393 examples (which point to 285 distinct polyps— the segmentation algorithms generally return multiple marks on the same polyp.) along with the measured diameter (ranging from 2mm to 15mm) as our training set. Each example is described by a set of 60 morphological features which are correlated to the diameter of the polyp. In order to validate the feasibility of our proposed algorithm, we simulate five radiologists according to the noisy model described in § 5.1 with τ = [0.001 0.01 0.1 1 10]. This corresponds to a situation where the first three radiologists are extremely noisy and the last two are quite accurate. Based on the measurements from multiple radiologists, we can simultaneously (1) learn a linear regressor and (2) estimate the precision of each radiologist. We compare the results with the classifier trained using the actual golden ground truth as well as the regressor learnt using the average of the radiologists measurements. The results are validated on an independent test set containing 397 examples (which point to 298 distinct polyps). Figure 7 shows the scatter plot of the actual polyp diameter vs the diameter predicted by the three different models. We compare the performance based on the root mean squared error (RMSE) and also the Pearson’s correlation coefficient. The regressor learnt using the proposed iterative algorithm (Figure 7(b)) is almost as good as the one learnt using the golden ground truth (Figure 7(a)). The correlation coefficient for the proposed algorithm is significantly larger than that learnt using the average of the radiologists response. The estimate obtained by averaging is closer to the novices since they form a majority (3/5). The proposed algorithm appropriately weights each radiologist based on their estimated precisions. 7. Conclusions and Future Work In this paper we proposed a probabilistic framework for supervised learning with multiple annotators providing labels but no absolute gold standard. The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. We specifically discussed binary/categorical/ordinal classification and regression problems. We made two key assumptions: (1) the performance of each annotator does not depend on the feature vector for a given instance and (2) conditional on the truth the experts are independent, that is, they make their errors independently. As we pointed out earlier these assumptions are not true in practice. The annotator performance depends on the instance he is labeling and there is some degree of correlation among the annotators. We briefly discuss some strategies to relax these two assumptions. 7.1 Instance Difficulty One drawback of the current model is that it doesn’t estimate difficulty of items. It is often observed that for the easy instances all the annotators agree on the labels—thus violating our conditional independence assumption. The difficulty of annotating an item can be captured by another latent variable γi for each instance—which modulates the annotators performance. Models for this have been developed in the area of item-response theory (Baker and Kim, 2004) and also in epidemiology (Uebersax and Grove, 1993)—see also Whitehill et al. (2009) for a recent paper in the machine learning community. While these models do not take into account the available features our pro1319 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY posed model for sensitivity and specificity can be extended as follows (in place of (1) and (2)): α j (γi ) := Pr[yi = 1|yi = 1, γi ] = σ(a j1 + b j1 γi ). j β j (γi ) := Pr[yi = 0|yi = 0, γi ] = σ(a j0 + b j0 γi ). j Here the parameters a j1 and a j0 are related to the sensitivity and specificity of the jth annotator, while the latent term γi captures the difficulty of the instance. The key assumption here is that the annotators are independent conditional on both yi and γi . Various assumptions can be made on two parameters b j1 and b j0 to simplify these models further—for example we could set b j1 = b1 and b j0 = b0 for all the annotators. 7.2 Annotators Actually Look at the Data In our model we made the assumption that the sensitivity α j and the specificity β j of the jth annotator does not depend on the feature vector xi . For example, in the CAD domain, this meant that the radiologist’s performance is consistent across different sub-groups of data—which is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. We can extend the previous model such that the sensitivity and the specificity depends on the feature vector xi explicitly as follows j⊤ α j (γi , xi ) := Pr[yi = 1|yi = 1, γi , xi ] = σ(a j1 + b j1 γi + wα xi ). j j⊤ α j (γi , xi ) := Pr[yi = 0|yi = 0, γi , xi ] = σ(a j0 + b j0 γi + wβ xi ). j However this change increases the number of parameters to be learned. References P. S. Albert and L. E. Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60:427–435, 2004. F. B. Baker and S. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, 2 edition, 2004. B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. S. R. Cholleti, S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 2008 20th IEEE international Conference on Tools with Artificial intelligence, 2008. A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. O. Dekel and O. Shamir. Vox Populi: Collecting high-quality labels from a crowd. In COLT 2009: Proceedings of the 22nd Annual Conference on Learning Theory, 2009a. O. Dekel and O. Shamir. Good learners for evil teachers. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 233–240, 2009b. 1320 L EARNING F ROM C ROWDS A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977. P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 259–268, 2009. E. Frank and M. Hall. A simple approach to ordinal classification. Lecture Notes in Computer Science, pages 145–156, 2001. G. Fung, B. Krishnapuram, J. Bi, M. Dundar, V. C. Raykar, S. Yu, R. Rosales, S. Krishnan, and R. B. Rao. Mining medical images. In Fifteenth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining: Third Workshop on Data Mining Case Studies and Practice Prize, 2009. J. Howe. Crowd sourcing: Why the Power of the Crowd Is Driving the Future of Business. 2008. S. L. Hui and S. D. Walter. Estimating the error rates of diagnostic tests. Biometrics, 36:167–171, 1980. S. L. Hui and X. H. Zhou. Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research, 7:354–370, 1998. R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information Processing Systems 15, pages 897–904. 2003. B. Krishnapuram, J. Stoeckel, V. C. Raykar, R. B. Rao, P. Bamberger, E. Ratner, N. Merlet, I. Stainvas, M. Abramov, and A. Manevitch. Multiple-instance learning improves CAD detection of masses in digital mammography. In IWDM 2008: Proceedings of the 9th international workshop on Digital Mammography, pages 350–357. 2008. G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79–87, 1992. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 889–896, 2009. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. P. Smyth. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems 3, pages 163–182. MIT Press, 1995. 1321 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085– 1092. 1995. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast - But is it good? Evaluating nonexpert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263, 2008. A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. J. S. Uebersax and W. M. Grove. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 49:823–835, 1993. S. K. Warfield, K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004. J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. 1322
2 0.42167541 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
3 0.38256922 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
4 0.3783682 35 jmlr-2010-Error-Correcting Output Codes Library
Author: Sergio Escalera, Oriol Pujol, Petia Radeva
Abstract: In this paper, we present an open source Error-Correcting Output Codes (ECOC) library. The ECOC framework is a powerful tool to deal with multi-class categorization problems. This library contains both state-of-the-art coding (one-versus-one, one-versus-all, dense random, sparse random, DECOC, forest-ECOC, and ECOC-ONE) and decoding designs (hamming, euclidean, inverse hamming, laplacian, β-density, attenuated, loss-based, probabilistic kernel-based, and lossweighted) with the parameters defined by the authors, as well as the option to include your own coding, decoding, and base classifier. Keywords: error-correcting output codes, multi-class classification, coding, decoding, open source, matlab, octave 1. Error-Correcting Output Codes The Error-Correcting Output Codes (ECOC) framework (Dietterich and Bakiri, 1995) is a simple but powerful framework to deal with the multi-class categorization problem based on the embedding of binary classifiers. Given a set of Nc classes, the basis of the ECOC framework consists of designing a codeword for each of the classes. These codewords encode the membership information of each class for a given binary problem. Arranging the codewords as rows of a matrix, we obtain a ”coding matrix” Mc , where Mc ∈ {−1, 0, 1}Nc ×n , being n the length of the codewords codifying each class. From the point of view of learning, Mc is constructed by considering n binary problems, each one corresponding to a column of the matrix Mc . Each of these binary problems (or dichotomizers) splits the set of classes in two partitions (coded by +1 or -1 in Mc according to their class set membership, or 0 if the class is not considered by the current binary problem). Then, at the decoding step, applying the n trained binary classifiers, a code is obtained for each data point in the test set. This code is compared to the base codewords of each class defined in the matrix Mc , and the data point is assigned to the class with the ”closest” codeword. Several decoding strategies have been proposed in literature. The reader is referred to Escalera et al. (2008) for a more detailed review. An example of an ECOC design is described in Fig. 1. The ECOC designs are independent of the base classifier applied. They involve error-correcting properties (Dietterich and Bakiri, 1995) and have shown to be able to reduce the bias and variance produced by the learning algorithm (Kong and Dietterich, 1995). Because of these reasons, ECOCs have been widely used to deal with multi-class categorization problems. c 2010 Sergio Escalera, Oriol Pujol and Petia Radeva. E SCALERA , P UJOL AND R ADEVA ECOC coding design for a 4-class problem. White, black, and grey positions corresponds to the symbols +1, -1, and 0, respectively. Once the four binary problems are learnt, at the decoding step a new test sample X is tested by the n classifiers. Then, the new codeword x = {x1 , .., xn } is compared with the class codewords {C1 , ..C4 }, classifying the new sample by the class ci which codeword minimizes the decoding measure. Figure 1: ECOC design example. 2. Library Algorithms The ECOCs library is a Matlab/Octave code under the open source GPL license (gpl) with the implementation of the state-of-the-art coding and decoding ECOC designs. A main function defines the multi-class data, coding, decoding, and base classifier. A list of parameters are also included in order to tune the different strategies. In addition to the implemented coding and decoding designs, which are described in the following section, the user can include his own coding, decoding, and base classifier as defined in the user guide. 2.1 Implemented Coding Designs The ECOC designs of the ECOC library cover the state-of-the-art of coding strategies, mainly divided in two main groups: problem-independent approaches, which do not take into account the distribution of the data to define the coding matrix, and the problem-dependent designs, where information of the particular domain is used to guide the coding design. 2.1.1 P ROBLEM - INDEPENDENT ECOC D ESIGNS • One-versus-all (Rifkin and Klautau, 2004): Nc dichotomizers are learnt for Nc classes, where each one splits one class from the rest of classes. • One-versus-one (Nilsson, 1965): n = Nc (Nc − 1)/2 dichotomizers are learnt for Nc classes, splitting each possible pair of classes. • Dense Random (Allwein et al., 2002): n = 10 · log Nc dichotomizers are suggested to be learnt for Nc classes, where P(−1) = 1 − P(+1), being P(−1) and P(+1) the probability of the symbols -1 and +1 to appear, respectively. Then, from a set of defined random matrices, the one which maximizes a decoding measure among all possible rows of Mc is selected. • Sparse Random (Escalera et al., 2009): n = 15 · log Nc dichotomizers are suggested to be learnt for Nc classes, where P(0) = 1 − P(−1) − P(+1), defining a set of random matrices Mc and selecting the one which maximizes a decoding measure among all possible rows of Mc . 662 E RROR -C ORRECTING O UPUT C ODES L IBRARY 2.1.2 P ROBLEM - DEPENDENT ECOC D ESIGNS • DECOC (Pujol et al., 2006): problem-dependent design that uses n = Nc − 1 dichotomizers. The partitions of the problem are learnt by means of a binary tree structure using exhaustive search or a SFFS criterion. Finally, each internal node of the tree is embedded as a column in Mc . • Forest-ECOC (Escalera et al., 2007): problem-dependent design that uses n = (Nc − 1) · T dichotomizers, where T stands for the number of binary tree structures to be embedded. This approach extends the variability of the classifiers of the DECOC design by including extra dichotomizers. • ECOC-ONE (Pujol et al., 2008): problem-dependent design that uses n = 2 · Nc suggested dichotomizers. A validation sub-set is used to extend any initial matrix Mc and to increase its generalization by including new dichotomizers that focus on difficult to split classes. 2.2 Implemented Decoding Designs The software comes with a complete set of ECOC decoding strategies. The notation used refers to that used in (Escalera et al., 2008): j • Hamming decoding: HD(x, yi ) = ∑n (1 − sign(x j · yi ))/2, being x a test codeword and yi a j=1 codeword from Mc corresponding to class Ci . • Inverse Hamming decoding: IHD(x, yi ) = max(∆−1 DT ), where ∆(i1 , i2 ) = HD(yi1 , yi2 ), and D is the vector of Hamming decoding values of the test codeword x for each of the base codewords yi . • Euclidean decoding: ED(x, yi ) = j ∑n (x j − yi )2 . j=1 • Attenuated Euclidean decoding: AED(x, yi ) = j j ∑n | yi || x j | (x j − yi )2 . j=1 • Loss-based decoding: LB(ρ, yi ) = ∑n L(yi · f j (ρ)), where ρ is a test sample, L is a lossj=1 function, and f is a real-valued function f : R n → R . j • Probabilistic-based decoding: PD(yi , x)=−log ∏ j∈[1,..,n]:Mc (i, j)=0 P(x j = Mc (i, j)| f j ) + K , where K is a constant factor that collects the probability mass dispersed on the invalid codes, and the probability P(x j = Mc (i, j)| f j ) j 1 , where vectors υ and ω are obtained by is estimated by means of P(x j = yi | f j ) = j j j j 1+eyi (υ f +ω ) solving an optimization problem (Passerini et al., 2004). αi +1 • Laplacian decoding: LAP(x, yi ) = αi +βi +K , where αi is the number of matched positions between x and yi , βi is the number of miss-matches without considering the positions coded by 0, and K is an integer value that codifies the number of classes considered by the classifier. νi 1 • Pessimistic β-Density Distribution decoding: accuracy si : νi −si ψi (ν, αi , βi )dν = 3 , where 1 ψi (ν, αi , βi ) = K ναi (1 − ν)βi , ψi is the β-Density Distribution between a codeword x and a class codeword yi for class ci , and ν ∈ R : [0, 1]. R j • Loss-Weighted decoding: LW (ρ, i) = ∑n MW (i, j)L(yi · f (ρ, j)), where MW (i, j) = ∑nH(i, j) j) , j=1 H(i, j=1 j 1, if x j = yi , 1 H(i, j) = mi ∑mi ϕ(h j (ρi ), i, j), ϕ(x j , i, j) = , mi is the number of training k k=1 0, otherwise. samples from class Ci , and ρi is the kth sample from class Ci . k 663 E SCALERA , P UJOL AND R ADEVA 3. Implementation Details The ECOCs Library comes with detailed documentation. A user guide describes the usage of the software. All the strategies and parameters used in the functions and files are described in detail. The user guide also presents examples of variable setting and execution, including a demo file. About the computational complexity, the training and testing time depends on the data size, coding and decoding algorithms, as well as the base classifier used in the ECOC design. Acknowledgments This work has been supported in part by projects TIN2009-14404-C02 and CONSOLIDER-INGENIO CSD 2007-00018. References URL http://www.gnu.org/licences/. E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2002. T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–282, 1995. S. Escalera, Oriol Pujol, and Petia Radeva. Boosted landmarks of contextual descriptors and ForestECOC: A novel framework to detect and classify objects in clutter scenes. Pattern Recognition Letters, 28(13):1759–1768, 2007. S. Escalera, O. Pujol, and P. Radeva. On the decoding process in ternary error-correcting output codes. IEEE Transactions in Pattern Analysis and Machine Intelligence, 99, 2008. S. Escalera, O. Pujol, and P. Radeva. Separability of ternary codes for sparse designs of errorcorrecting output codes. Pattern Recognition Letters, 30:285–297, 2009. E. B. Kong and T. G. Dietterich. Error-correcting output coding corrects bias and variance. International Conference of Machine Learning, pages 313–321, 1995. N. J. Nilsson. Learning Machines. McGraw-Hill, 1965. A. Passerini, M. Pontil, and P. Frasconi. New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks, 15(1):45–54, 2004. O. Pujol, P. Radeva, , and J. Vitri` . Discriminant ECOC: A heuristic method for application depena dent design of error correcting output codes. IEEE Transactions in Pattern Analysis and Machine Intelligence, 28:1001–1007, 2006. O. Pujol, S. Escalera, and P. Radeva. An incremental node embedding technique for error-correcting output codes. Pattern Recognition, 4:713–725, 2008. R. Rifkin and A. Klautau. In defense of one-vs-all classification. The Journal of Machine Learning Research, 5:101–141, 2004. 664
5 0.3632299 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
Author: Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan
Abstract: Many structured information extraction tasks employ collective graphical models that capture interinstance associativity by coupling them with various clique potentials. We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials. We present three families of symmetric potentials—MAX, SUM, and MAJORITY . We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials. Our first message computation algorithm, called α-pass, is sub-quadratic in the clique size, outputs exact messages for 13 MAX , and computes 15 -approximate messages for Potts, a popular member of the SUM family. Empirically, it is upto two orders of magnitude faster than existing algorithms based on graph-cuts or belief propagation. Our second algorithm, based on Lagrangian relaxation, operates on MAJORITY potentials and provides close to exact solutions while being two orders of magnitude faster. We show that the cluster message passing framework is more principled, accurate and converges faster than competing approaches. We extend our collective inference framework to exploit associativity of more general intradomain properties of instance labelings, which opens up interesting applications in domain adaptation. Our approach leads to significant error reduction on unseen domains without incurring any overhead of model retraining. Keywords: passing graphical models, collective inference, clique potentials, cluster graphs, message
6 0.34394976 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
7 0.33780459 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
8 0.32978007 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
9 0.32889104 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
10 0.32185355 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
11 0.28949523 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
12 0.27892604 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
13 0.22738351 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
14 0.22281812 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
15 0.20776057 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
16 0.20664315 22 jmlr-2010-Classification Using Geometric Level Sets
17 0.20634803 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
18 0.20313406 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
19 0.20114891 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
20 0.19860834 80 jmlr-2010-On-Line Sequential Bin Packing
topicId topicWeight
[(4, 0.02), (21, 0.017), (32, 0.053), (36, 0.049), (37, 0.052), (43, 0.475), (75, 0.109), (81, 0.031), (85, 0.072), (96, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.68199533 61 jmlr-2010-Learning From Crowds
Author: Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy
Abstract: For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline. Keywords: multiple annotators, multiple experts, multiple teachers, crowdsourcing 1. Supervised Learning From Multiple Annotators/Experts A typical supervised learning scenario consists of a training set D = {(xi , yi )}N containing N i=1 instances, where xi ∈ X is an instance (typically a d-dimensional feature vector) and yi ∈ Y is the corresponding known label. The task is to learn a function f : X → Y which generalizes well on unseen data. Specifically for binary classification the supervision is from the set Y = {0, 1}, for multi-class classification Y = {1, . . . , K}, for ordinal regression Y = {1, . . . , K} (with an ordering 1 < . . . < K), and Y = R for regression. c 2010 Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo H. Valadez, Charles Florin, Luca Bogoni and Linda Moy. R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY However, for many real life tasks, it may not be possible, or may be too expensive (or tedious) to acquire the actual label yi for training—which we refer to as the gold standard or the objective ground truth. Instead, we may have multiple (possibly noisy) labels y1 , . . . , yR provided by R i i different experts or annotators. In practice, there is a substantial amount of disagreement among the experts, and hence it is of great practical interest to address conventional supervised learning algorithms in this scenario. Our motivation for this work comes from the area of computer-aided diagnosis1 (CAD), where the task is to build a classifier to predict whether a suspicious region on a medical image (like a X-ray, CT scan, or MRI) is malignant (cancerous) or benign. In order to train such a classifier, a set of images is collected from hospitals. The actual gold standard (whether it is cancer or not) can only be obtained from a biopsy of the tissue. Since it is an expensive, invasive, and potentially dangerous process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard.2 The radiologist also annotates various descriptors of the potentially malignant lesion, like the size (a regression problem), shape (a multiclass classification problem), and also degree of malignancy (an ordinal regression problem). The radiologists come from a diverse pool including luminaries, experts, residents, and novices. Very often there is lot of disagreement among the annotations. For a lot of tasks the labels provided by the annotators are inherently subjective and there will be substantial variation among different annotators. The domain of text classification offers such a scenario. In this context the task is to predict the category for a token of text. The labels for training are assigned by human annotators who read the text and attribute their subjective category. With the advent of crowdsourcing (Howe, 2008) services like Amazon’s Mechanical Turk,3 Games with a Purpose,4 and reCAPTCHA5 it is quite inexpensive to acquire labels from a large number of annotators (possibly thousands) in a short time (Sheng et al., 2008; Snow et al., 2008; Sorokin and Forsyth, 2008). Websites such as Galaxy Zoo6 allow the public to label astronomical images over the internet. In situations like these, the performance of different annotators can vary widely (some may even be malicious), and without the actual gold standard, it may not be possible to evaluate the annotators. In this work, we provide principled probabilistic solutions to the following questions: 1. How to adapt conventional supervised learning algorithms when we have multiple annotators providing subjective labels but no objective gold standard? 2. How to evaluate systems when we do not have absolute gold-standard? 3. A closely related problem—particularly relevant when there are a large number of annotators— is to estimate how reliable/trustworthy is each annotator. 1. See Fung et al. (2009) for an overview of the data mining issues in this area. 2. Sometimes even a biopsy cannot confirm whether it is cancer or not and hence all we can hope to get is subjective ground truth. 3. Mechanical Turk found at https://www.mturk.com. 4. Games with a Purpose found at http://www.gwap.com. 5. reCAPTCHA found at http://recaptcha.net/. 6. Galaxy Zoo found at http://galaxyzoo.org. 1298 L EARNING F ROM C ROWDS 1.1 The Problem With Majority Voting When we have multiple labels a commonly used strategy is to use the labels on which the majority of them agree (or average for regression problem) as an estimate of the actual gold standard. For binary classification problems this amounts to using the majority label,7 that is, j yi = ˆ 1 if (1/R) ∑R yi > 0.5 j=1 , j 0 if (1/R) ∑R yi < 0.5 j=1 as an estimate of the hidden true label and use this estimate to learn and evaluate classifiers/annotators. Another strategy is that of considering every pair (instance, label) provided by each expert as a separate example. Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, that is, R Pr[yi = 1|y1 , . . . , yR ] = (1/R) ∑ yi . i i j j=1 Majority voting assumes all experts are equally good. However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. One could address this problem by introducing a weight capturing how good each expert is. But how would one measure the performance of an expert when there is no gold standard available? 1.2 Proposed Approach and Organization To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator that jointly learns the classifier/regressor, the annotator accuracy, and the actual true label. For ease of exposition we start with binary classification problem in § 2. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard (§ 2.1). The proposed algorithm automatically discovers the best experts and assigns a higher weight to them. In order to incorporate prior knowledge about each annotator, we impose a beta prior on the sensitivity and specificity and derive the maximum-a-posteriori estimate (§ 2.6). The final estimation is performed by an Expectation Maximization (EM) algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. While the proposed approach is described using logistic regression as the base classifier (§ 2.2), it is quite general, and can be used with any black-box classifier (§ 2.7), and can also handle missing labels (that is, each expert is not required to label all the instances). Furthermore, we extend the proposed algorithm to handle categorical (§ 3), ordinal (§ 4), and regression problems (§ 5). In § 6 section we extensively validate our approach using both simulated data and real data from different domains. 1.3 Related Work and Novel Contributions We first summarize the novel contributions of this work in context of other related work in this emerging new area. There has been a long line of work in the biostatistics and epidemiology literature on latent variable models where the task is to get an estimate of the observer error rates based 7. When there is no clear majority among the multiple experts (that is, yi = 0.5) in CAD domain the final decision is ˆ often made by an adjudicator or a super-expert. When there is no adjudicator a fair coin toss is used. 1299 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY on the results from multiple diagnostic tests without a gold standard (see Dawid and Skene, 1979, Hui and Walter, 1980, Hui and Zhou, 1998, Albert and Dodd, 2004 and references therein). In the machine learning community Smyth et al. (1995) first addressed the same problem in the context of labeling volcanoes in satellite images of Venus. We differ from this previous body of work in the following aspects: 1. Unlike Dawid and Skene (1979) and Smyth et al. (1995) which just focused on estimating the ground truth from multiple noisy labels, we specifically address the issue of learning a classifier. Estimating the ground truth and the annotator/classifer performance is a byproduct of our proposed algorithm. 2. In order to learn a classifier Smyth (1995) proposed to first estimate the ground truth (without using the features) and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly. Our experiments (§ 6.1.1) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. 3. Our solution is more general and can be easily extended to categorical(§ 3), ordinal(§ 4), and continuous data(§ 5). It can also be used in conjunction with any supervised learning algorithm. A preliminary version of this paper (Raykar et al., 2009) mainly discussed the binary classification problem. 4. Our proposed algorithm is also Bayesian—we impose a prior on the experts. The priors can potential capture the skill of different annotators. In this paper we refrain from doing a full Bayesian inference and use the mode of the posterior as a point estimate. A recent complete Bayesian generalization of these kind of models has been developed by Carpenter (2008). 5. The EM approach used in this paper is similar to that proposed by Jin and Ghahramani (2003). However their motivation is somewhat different. In their setting, each training example is annotated with a set of possible labels, only one of which is correct. There has been recent interest in the natural language processing (Sheng et al., 2008; Snow et al., 2008) and computer vision (Sorokin and Forsyth, 2008) communities where they use Amazon’s Mechanical Turk to collect annotations from many people. They show that it can be potentially as good as that provided by an expert. Sheng et al. (2008) analyzed when it is worthwhile to acquire new labels for some of the training examples. There is also some theoretical work (see Lugosi, 1992 and Dekel and Shamir, 2009a) dealing with multiple experts. Recently Dekel and Shamir (2009b) presented an algorithm which does not resort to repeated labeling, that is, each example does not have to be labeled by multiple teachers. Donmez et al. (2009) address the issue of active learning in this scenario—How to jointly learn the accuracy of labeling sources and obtain the most informative labels for the active learning task? There has also been some work in the medical imaging community (Warfield et al., 2004; Cholleti et al., 2008). 2. Binary Classification We first describe our proposed noise model for the annotators. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the unknown gold standard. 1300 L EARNING F ROM C ROWDS 2.1 A Two-coin Model for Annotators Let y j ∈ {0, 1} be the label assigned to the instance x by the jth annotator/expert. Let y be the actual (unobserved) label for this instance. Each annotator provides a version of this hidden true label based on two biased coins. If the true label is one, she flips a coin with bias α j (sensitivity). If the true label is zero, she flips a coin with bias β j (specificity). In each case, if she gets heads she keeps the original label, otherwise she flips the label. If the true label is one, the sensitivity (true positive rate) for the jth annotator is defined as the probability that she labels it as one. α j := Pr[y j = 1|y = 1]. (1) On the other hand, if the true label is zero, the specificity (1−false positive rate) is defined as the probability that she labels it as zero. β j := Pr[y j = 0|y = 0]. (2) The assumption introduced is that α j and β j do not depend on the instance x. For example, in the CAD domain, this means that the radiologist’s performance is consistent across different sub-groups of data.8 2.2 Classification Model While the proposed method can be used for any classifier, for ease of exposition, we consider the family of linear discriminating functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. The final classifier can be written in the following form: y = 1 if w⊤ x ≥ γ and 0 otherwise. The ˆ threshold γ determines the operating point of the classifier. The Receiver Operating Characteristic (ROC) curve is obtained as γ is swept from −∞ to ∞. The probability for the positive class is modeled as a logistic sigmoid acting on fw, that is, Pr[y = 1|x, w] = σ(w⊤ x), where the logistic sigmoid function is defined as σ(z) = 1/(1 + e−z ). This classification model is known as logistic regression. 2.3 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R annotators, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and also the sensitivity α = i i i=1 [α1 , . . . , αR ] and the specificity β = [β1 , . . . , βR ] of the R annotators. It is also of interest to get an estimate of the unknown gold standard y1 , . . . , yN . 2.4 Maximum Likelihood Estimator Assuming the training instances are independently sampled, the likelihood function of the parameters θ = {w, α, β} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 8. While this is a reasonable assumption, it is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. 1301 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY j Conditioning on the true label yi , and also using the assumption yi is conditionally independent (of everything else) given α j , β j and yi , the likelihood can be decomposed as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |yi = 1, α]Pr[yi = 1|xi, w] i i i=1 + Pr[y1 , . . . , yR |yi = 0, β]Pr[yi = 0|xi, w] . i i Given the true label yi , we assume that y1 , . . . , yR are independent, that is, the annotators make their i i decisions independently.9 Hence, R R Pr[y1 , . . . , yR |yi = 1, α] = ∏ Pr[yi |yi = 1, α j ] = ∏ [α j ]yi [1 − α j ]1−yi . i i j j=1 j j j=1 Similarly, we have R Pr[y1 , . . . , yR |yi = 0, β] = ∏ [β j ]1−yi [1 − β j ]yi . i i j j j=1 Hence the likelihood can be written as N Pr[D |θ] = ∏ ai pi + bi (1 − pi ) , i=1 where we have defined pi := σ(w⊤ xi ). R ai := ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi := ∏[β j ]1−y [1 − β j ]y . j i j i j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood, that is, ˆ θML = {α, β, w} = arg max{ln Pr[D |θ]}. ˆ ˆ ˆ θ 2.5 The EM Algorithm This maximization problem can be simplified a lot if we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is an efficient iterative procedure to compute the maximum-likelihood solution in presence of missing/hidden data. We will use the unknown hidden true label yi as the missing data. If we know the missing data y = [y1 , . . . , yN ] then the complete likelihood can be written as N ln Pr[D , y|θ] = ∑ yi ln pi ai + (1 − yi ) ln(1 − pi )bi . i=1 9. This assumption is not true in general and there is some correlations among the labels assigned by multiple annotators. For example in the CAD domain if the cancer is in advanced stage (which is very easy to detect) almost all the radiologists assign the same label. 1302 L EARNING F ROM C ROWDS Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)step. The M-step involves maximization of a lower bound on the log-likelihood that is refined in each iteration by the E-step. 1. E-step. Given the observation D and the current estimate of the model parameters θ, the conditional expectation (which is a lower bound on the true likelihood) is computed as N E {ln Pr[D , y|θ]} = ∑ µi ln pi ai + (1 − µi ) ln(1 − pi )bi , (3) i=1 where the expectation is with respect to Pr[y|D , θ], and µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. Using i i Bayes’ theorem we can compute µi ∝ Pr[y1 , . . . , yR |yi = 1, θ] · Pr[yi = 1|xi, θ] i i ai pi = . ai pi + bi (1 − pi ) 2. M-step. Based on the current estimate µi and the observations D , the model parameters θ are then estimated by maximizing the conditional expectation. By equating the gradient of (3) to zero we obtain the following estimates for the sensitivity and specificity: j j αj = ∑N µi yi i=1 , ∑N µi i=1 βj = ∑N (1 − µi )(1 − yi ) i=1 . ∑N (1 − µi ) i=1 Due to the non-linearity of the sigmoid, we do not have a closed form solution for w and we have to use gradient ascent based optimization methods. We use the Newton-Raphson update given by wt+1 = wt − ηH −1 g, where g is the gradient vector, H is the Hessian matrix, and η is the step length. The gradient vector is given by N g(w) = ∑ µi − σ(w⊤ xi ) xi . i=1 The Hessian matrix is given by N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ . i i=1 Essentially, we are estimating a logistic regression model with probabilistic labels µi . These two steps (the E- and the M-step) can be iterated till convergence. The log-likelihood increases monotonically after every iteration, which in practice implies convergence to a local maximum. The EM algorithm is only guaranteed to converge to a local maximum. In practice multiple restarts with different initializations can potentially mitigate the local maximum problem. In this j paper we use majority voting µi = 1/R ∑R yi as the initialization for µi to start the EM-algorithm. j=1 1303 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2.6 A Bayesian Approach In some applications we may want to trust a particular expert more than the others. One way to achieve this is by imposing priors on the sensitivity and specificity of the experts. Since α j and β j represent the probability of a binary event, a natural choice of prior is the beta prior. The beta prior is also conjugate to the binomial distribution. For any a > 0, b > 0, and δ ∈ [0, 1] the beta distribution is given by δa−1 (1 − δ)b−1 , Beta(δ|a, b) = B(a, b) where B(a, b) = 01 δa−1 (1 − δ)b−1 dδ is the beta function.We assume a beta prior10 for both the sensitivity and the specificity as R j j j j j j j j Pr[α j |a1 , a2 ] = Beta(α j |a1 , a2 ). Pr[β j |b1 , b2 ] = Beta(β j |b1 , b2 ). For sake of completeness we also assume a zero mean Gaussian prior on the weights w with inverse covariance matrix Γ, that is, Pr[w] = N (w|0, Γ−1 ). Assuming that {α j }, {β j }, and w have independent priors, the maximum-a-posteriori (MAP) estimator is found by maximizing the logposterior, that is, ˆ θMAP = arg max{ln Pr[D |θ] + ln Pr[θ]}. θ An EM algorithm can be derived in a similar fashion for MAP estimation by relying on the interpretation of Neal and Hinton (1998). The final algorithm is summarized below: j 1. Initialize µi = (1/R) ∑R yi based on majority voting. j=1 2. Given µi , estimate the sensitivity and specificity of each annotator/expert as follows. j αj = j a1 − 1 + ∑N µi yi i=1 j j a1 + a2 − 2 + ∑N µi i=1 j β j = . j b1 − 1 + ∑N (1 − µi )(1 − yi ) i=1 j j b1 + b2 − 2 + ∑N (1 − µi ) i=1 . (4) The Newton-Raphson update for optimizing w is given by wt+1 = wt − ηH −1 g, with step length η, gradient vector N g(w) = ∑ µi − σ(w⊤ xi ) xi − Γw, i=1 and Hessian matrix N H(w) = − ∑ σ(w⊤ xi ) 1 − σ(w⊤ xi ) xi x⊤ − Γ. i i=1 10. It may be convenient to specify a prior in terms of the mean µ and variance σ2 . The mean and the variance for a beta prior are given by µ = a/(a + b) and σ2 = ab/((a + b)2 (a + b + 1)). Solving for a and b we get a = (−µ3 + µ2 − µσ2 )/σ2 and b = a(1 − µ)/µ. 1304 L EARNING F ROM C ROWDS 3. Given the sensitivity and specificity of each annotator and the model parameters, update µi as µi = ai pi , ai pi + bi (1 − pi ) (5) where pi = σ(w⊤ xi ). R ai = ∏[α j ]y [1 − α j ]1−y . j i j i j=1 R bi = ∏[β j ]1−y [1 − β j ]y . j i j i (6) j=1 Iterate (2) and (3) till convergence. 2.7 Discussions 1. Estimate of the gold standard The value of the posterior probability µi is a soft probabilistic estimate of the actual ground truth yi , that is, µi = Pr[yi = 1|y1 , . . . , yR , xi , θ]. The actual i i hidden label yi can be estimated by applying a threshold on µi , that is, yi = 1 if µi ≥ γ and zero otherwise. We can use γ = 0.5 as the threshold. By varying γ we can change the misclassification costs and obtain a ground truth with large sensitivity or large specificity. Because of this in our experimental validation we can actually draw an ROC curve for the estimated ground truth. 2. Log-odds of µ A particularly revealing insight can be obtained in terms of the log-odds or the logit of the posterior probability µi . From (5) the logit of µi can be written as logit(µi ) = ln Pr[yi = 1|y1 , . . . , yR , xi , θ] µi i i = ln 1 − µi Pr[yi = 0|y1 , . . . , yR , xi , θ] i i R = w⊤ xi + c + ∑ yi [logit(α j ) + logit(β j )]. j j=1 j where c = ∑R log 1−α is a constant term which does not depend on i. This indicates that j=1 βj the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. The weight of each expert is the sum of the logit of the sensitivity and specificity. 3. Using any other classifier For ease of exposition we used logistic regression. However, the proposed algorithm can be used with any generalized linear model or in fact with any classifier that can be trained with soft probabilistic labels. In each step of the EM-algorithm, the classifier is trained with instances sampled from µi . This modification is easy for most probabilistic classifiers. For general black-box classifiers where we cannot tweak the training algorithm an alternate approach is to replicate the training examples according to the soft label. For example a probabilistic label µi = 0.8 can be effectively simulated by adding 8 training examples with deterministic label 1 and 2 examples with label 0. 1305 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 4. Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. Here instead of learning a classifier we estimate p which is the prevalence of the positive class, that is, p = Pr[yi = 1]. We further assume a beta prior for the prevalence, that is, Beta(p|p1 , p2 ). The algorithm simplifies as follows. j (a) Initialize µi = (1/R) ∑R yi based on majority voting. j=1 (b) Given µi , estimate the sensitivity and specificity of each annotator using (4). The prevalence of the positive class is estimated as follows. p = p1 − 1 + ∑N µi i=1 . p1 + p2 − 2 + N (c) Given the sensitivity and specificity of each annotator and prevalence, refine µi as follows. ai p µi = . ai p + bi (1 − p) Iterate (2) and (3) till convergence. This algorithm is similar to the one proposed by Dawid and Skene (1979) and Smyth et al. (1995). 5. Handling missing labels The proposed approach can easily handle missing labels, that is, when the labels from some experts are missing for some instances. Let Ri be the number of radiologists labeling the ith instance, and let N j be the number of instances labeled by the jth radiologist. Then in the EM algorithm, we just need to replace N by N j for estimating the sensitivity and specificity in (4), and replace R by Ri for updating µi in (6). 6. Evaluating a classifier We can use the probability scores µi directly to evaluate classifiers. If zi are the labels obtained from any other classifier, then sensitivity and specificity can be estimated as ∑N (1 − µi )(1 − zi ) ∑N µi zi , β = i=1 N . α = i=1 ∑N µi ∑i=1 (1 − µi ) i=1 7. Posterior approximation At the end of each EM iteration a crude approximation to the posterior is obtained as N N α j ∼ Beta α j |a1 + ∑ µi yi , a2 + ∑ µi (1 − yi ) , j j i=1 N j j i=1 N β j ∼ Beta β j |b1 + ∑ (1 − µi )(1 − yi ), b2 + ∑ (1 − µi )yi j j i=1 j j . i=1 3. Multi-class Classification In this section we describe how the proposed approach for binary classification can be extended to categorical data. Suppose there are K ≥ 2 categories. An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity—which are three 1306 L EARNING F ROM C ROWDS different kinds on nodules. We can extend the previous model and introduce a vector of multinomial j j j parameters αc = (αc1 , . . . , αcK ) for each annotator, where αck := Pr[y j = k|y = c] j and ∑K αck = 1. Here αck denotes the probability that the annotator j assigns class k to an instance k=1 j j given the true class is c. When K = 2, α11 and α00 are sensitivity and specificity, respectively. A similar EM algorithm can be derived. In the E-step, we estimate j j R K Pr[yi = c|Y , α] ∝ Pr[yi = c|xi ] ∏ ∏ (αck )δ(yi ,k) , j j j=1 k=1 where δ(u, v) = 1 if u = v and 0 otherwise and in the M-step we learn a multi-class classifier and update the multinomial parameter as j αck = j ∑N Pr[yi = c|Y , α]δ(yi , k) i=1 . ∑N Pr[yi = c|Y , α] i=1 One can also assign a Dirichlet prior for the multinomial parameters, and this results in a smoothing term in the above updates in the MAP estimate. 4. Ordinal Regression We now consider the situation where the outputs are categorical and have an ordering among the labels. In the CAD domain the radiologist often gives a score (for example, 1 to 5 from lowest to highest) to indicate how likely she thinks it is malignant. This is different from a multi-class setting in which we do not have any preference among the multiple class labels. j Let yi ∈ {1, . . . , K} be the label assigned to the ith instance by the jth expert. Note that there is an ordering in the labels 1 < . . . < K. A simple approach is to convert the ordinal data into a series of binary data (Frank and Hall, 2001). Specifically the K class ordinal labels are transformed into K − 1 binary class labels as follows: jc yi = j 1 if yi > c 0 otherwise c = 1, . . . , K − 1. Applying the same procedure used for binary labels we can estimate Pr[yi > c] for c = 1, . . . , K − 1. The probability of the actual class values can then be obtained as Pr[yi = c] = Pr[yi > c − 1 and yi ≤ c] = Pr[yi > c − 1] − Pr[yi > c]. The class with the maximum probability is assigned to the instance. 5. Regression In this section we develop a similar algorithm to learn a regression function using annotations from multiple experts. In the CAD domain as a part of the annotation process a common task for a radiologist is to measure the diameter of a suspicious lesion. 1307 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 5.1 Model for Annotators j Let yi ∈ R be the continuous target value assigned to the ith instance by the jth annotator. Our model is that the annotator provides a noisy version of the actual true value yi . For the jth annotator we will assume a Gaussian noise model with mean yi (the true unknown value) and inverse-variance (precision) τ j , that is, j j Pr[yi |yi , τ j ] = N (yi |yi , 1/τ j ), (7) where the Gaussian distribution is defined as N (z|m, σ2 ) = (2πσ2 )−1/2 exp(−(z − m)2 /2σ2 ). The unknown inverse-variance τ j measures the accuracy of each annotator—the larger the value of τ j the more accurate the annotator. We have assumed that τ j does not depend on the instance xi . For example, in the CAD domain, this means that the radiologist’s accuracy does not depend on the nodule she is measuring. While this a practical assumption, it is not entirely true. It is known that some nodules are harder to measure than others. 5.2 Linear Regression Model for Features As before we consider the family of linear regression functions: F = { fw}, where for any x, w ∈ Rd , fw(x) = w⊤ x. We assume that the actual target response yi is given by the deterministic regression function fw with additive Gaussian noise, that is, yi = w⊤ xi + ε, where ε is a zero-mean Gaussian random variable with inverse-variance (precision) γ. Hence Pr[yi |xi , w, γ] = N (yi |w⊤ xi , 1/γ). (8) 5.3 Combined Model Combining both the annotator (7) and the regressor (8) model we have Pr[yi |xi , w, τ j , γ] = j Z Pr[yi |yi , τ j ]Pr[yi |xi , w, γ]dyi = N (yi |w⊤ xi , 1/γ + 1/τ j ). j j Since the two precision terms (γ and τ j ) are grouped together they are not uniquely identifiable. Hence we will define a new precision term λ j as 1 1 1 = + j. j λ γ τ So we have the following model Pr[yi |xi , w, λ j ] = N (yi |w⊤ xi , 1/λ j ). j j (9) 5.4 Estimation/Learning Problem Given the training data D consisting of N instances with annotations from R experts, that is, D = {xi , y1 , . . . , yR }N , the task is to estimate the weight vector w and the precision λ = [λ1 , . . . , λR ] of i i=1 i all the annotators. 1308 L EARNING F ROM C ROWDS 5.5 Maximum-likelihood Estimator Assuming the instances are independent the likelihood of the parameters θ = {w, λ} given the observations D can be factored as N Pr[D |θ] = ∏ Pr[y1 , . . . , yR |xi , θ]. i i i=1 Conditional on the instance xi we assume that y1 , . . . , yR are independent, that is, the annotators i i provide their responses independently. Hence from (9) the likelihood can be written as N R Pr[D |θ] = ∏ ∏ N (yi |w⊤ xi , 1/λ j ). j i=1 j=1 The maximum-likelihood estimator is found by maximizing the log-likelihood θML = {λ, w} = arg max{ln Pr[D |θ]}. θ By equating the gradient of the log-likelihood to zero we obtain the following update equations for the precision and the weight vector. 1 λj = 1 N ∑ yij − w⊤ xi N i=1 N w = ∑ −1 2 . (10) N ∑R λ j yi j=1 i=1 ∑R λ j j=1 ∑ xi xi x⊤ i i=1 j . (11) As the parameters w and λ are coupled together we iterate these two steps till convergence. 5.6 Discussions 1. Is this standard least-squares? Define the design matrix X = [x1 , . . . , xN ]⊤ and the rej j sponse vector for each annotator as y j = [y1 , . . . , yN ]⊤ . Using matrix notation Equation 11 can be written as w = (X ⊤ X)−1 X ⊤ y where y = ∑R λ j y j j=1 ∑R λ j j=1 . (12) Equation 12 is essentially the solution to a standard linear regression model, except that we are training a linear regression model with y as the ground truth, which is a precision weighted mean of the response vectors from all the annotators. The variance of each annotator is estimated using (10). The final algorithm iteratively establishes a particular gold standard (y), measures the performance of the annotators and learns a regressor given that gold standard, and refines the gold standard based on the performance measures. 2. Are we better than the best annotator? If we assume λ is fixed (i.e., we ignore the variability and assume that it is well estimated) then w is an unbiased estimator of w and the ˆ covariance matrix is given by Cov(w) = Cov(y) X ⊤ X ˆ 1309 −1 = 1 ∑R λ j j=1 X ⊤X −1 . R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Since ∑R λ j > max j (λ j ) the proposed method has a lower variance than the regressor learnt j=1 with the best annotator (i.e., the one with the minimum variance). 3. Are we better than the average? For a fixed X the error in w depends only on the variance of y j . If we know the true λ j then yi is the best linear unbiased estimator for yi which minij mizes the variance. To see this consider any linear estimator of the form yi = ∑ j a j (yi − b j ). The variance is given by Var[yi ] = ∑ j (a j )2 /λ j . Since E[yi ] = yi ∑ j a j , for the bias of this estimator to be zero we require that ∑ j a j = 1. Solving the constrained minimization problem we see that a j = λ j / ∑ j λ j minimizes the variance. 4. Obtaining a consensus without features When no features are available the same algorithm can be simplified to get a consensus estimate of the actual ground truth and also evaluate the annotators. Essentially we have to iterate the following two updates till convergence ∑R λ j yi j=1 1 ∑R λ j j=1 λj j yi = = 1 N ∑ y j − yi N i=1 i 2 . 6. Experimental Validation We now experimentally validate the proposed algorithms on both simulated and real data. 6.1 Classification Experiments We use two CAD and one text data set in our experiments. The CAD data sets include a digital mammography data set and a breast MRI data set, both of which are biopsy proven, that is, the gold standard is available. For the digital mammography data set we simulate the radiologists in order to validate our methods. The breast MRI data has annotations from four radiologists. We also report results on a Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk which has annotations from 164 annotators. 6.1.1 D IGITAL M AMMOGRAPHY WITH S IMULATED R ADIOLOGISTS Mammograms are used as a screening tool to detect early breast cancer. CAD systems search for abnormal areas (lesions) in a digitized mammographic image. These lesions generally indicate the presence of malignant cancer. The CAD system then highlights these areas on the images, alerting the radiologist to the need for a further diagnostic mammogram or a biopsy. In classification terms, given a set of descriptive morphological features for a region on a image, the task is to predict whether it is potentially malignant (1) or not (0). In order to train such a classifier, a set of mammograms is collected from hospitals. The ground truth (whether it is cancer or not) is obtained from biopsy. Since biopsy is an expensive, tedious, and an invasive process, very often CAD systems are built from labels collected from multiple expert radiologists who visually examine the mammograms and mark the lesion locations—this constitutes our ground truth (multiple labels) for learning. In this experiment we use a proprietary biopsy-proven data set (Krishnapuram et al., 2008) containing 497 positive and 1618 negative examples. Each instance is described by a set of 27 morphological features. In order to validate our proposed algorithm, we simulate multiple radiologists according to the two-coin model described in § 2.1. Based on the labels from multiple radiologists, 1310 L EARNING F ROM C ROWDS we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. For the first set of experiments we use 5 radiologists with sensitivity α = [0.90 0.80 0.57 0.60 0.55] and specificity β = [0.95 0.85 0.62 0.65 0.58]. This corresponds to a scenario where the first two radiologists are experts and the last three are novices. Figure 1 summarizes the results. We compare on three different aspects: (1) How good is the learnt classifier? (2) How well can we estimate the sensitivity and specificity of each radiologist? (3) How good is the estimated ground truth? The following observations can be made. 1. Classifier performance Figure 1(a) plots the ROC curve of the learnt classifier on the training set. The dotted (black) line is the ROC curve for the classifier learnt using the actual ground truth. The solid (red) line is the ROC curve for the proposed algorithm and the dashed (blue) line is for the classifier learnt using the majority-voting scheme. The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. The area under the ROC curve (AUC) for the proposed algorithm is around 3.5% greater than that learnt using the majority-voting scheme. 2. Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(b). The end of the solid red line shows the estimates of the sensitivity and specificity from the proposed method. We used a uniform prior on all the parameters. The ellipse plots the contour of one standard deviation as obtained from the beta posterior estimates. The end of the dashed blue line shows the estimate obtained from the majorityvoting algorithm. We see that the proposed method is much closer to the actual values of sensitivity and specificity. 3. Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. From Figure 1(b) we can see that the ROC curve for the proposed method dominates the majority voting ROC curve. Furthermore, the area under the ROC curve (AUC) is around 3% higher. The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). It does not have an idea of who is an expert and who is a novice. The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity. The improvement obtained is quite large in Figure 2 which corresponds a situation where we have only one expert and 7 novices. 4. Joint Estimation To learn a classifier, Smyth et al. (1995) proposed to first estimate the golden ground truth and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. Figure 3 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. 6.1.2 B REAST MRI In this example, each radiologist reviews the breast MRI data and assesses the malignancy of each lesion on a BIRADS scale of 1 to 5. The BIRADS scale is defined as follows: 1 Negative, 2 Benign, 1311 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Majority Voting Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 EM algorithm Estimated 1 Estimated 2 Estimated 3 Estimated 4 Estimated 5 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0217 0.5869 0.2391 0.1521 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1785 0.1071 0.2500 0.4642 True 1 True 2 True 3 True 4 True 5 x x x x x 0.0000 0.6957 0.1304 0.1739 0.0000 0 0 0 1 0 x x x x x 0.0000 0.1428 0.0000 0.3214 0.5357 Table 1: The confusion matrix for the estimate obtained using majority voting and the proposed EM algorithm. The x indicates that there was no such category in the true labels (the gold standard). The gold-standard is obtained by the biopsy which can confirm whether it is benign (BIRADS=2) or malignant (BIRADS=5). 3 Probably Benign, 4 Suspicious abnormality, and 5 Highly suggestive of malignancy. Our data set comprises of 75 lesions with annotations from four radiologists, and the true labels from biopsy. Based on eight morphological features, we have to predict whether a lesion is malignant or not. For the first experiment we reduce the BIRADS scale to a binary one: any lesion with a BIRADS > 3 is considered malignant and benign otherwise. The set included 28 malignant and 47 benign lesions. Figure 4 summarizes the results. We show the leave-one-out cross validated ROC for the classifier. The cross-validated AUC of the proposed method is approximately 6% better than the majority voting baseline. We also consider the BIRADS labels as a set of ordinal measurements since there is an ordering among the BIRADS label. The confusion matrix in Table 1 shows that the EM algorithm is significantly superior than the majority voting in estimating the true BIRADS. 6.1.3 R ECOGNIZING T EXTUAL E NTAILMENT Finally we report results on Recognizing Textual Entailment data collected by Snow et al. (2008) using the Amazon’s Mechanical Turk. In this task, the annotator is presented with two sentences and given a choice of whether the second sentence can be inferred from the first. The data has 800 tasks and 164 distinct readers, with 10 annotations per task along with the golden ground truth. The majority of the entries (94 %) in the 800x164 matrix are missing. There is one annotator who has labeled all the tasks. We use this data set to obtain an estimate of the actual ground truth. Figure 5 plots the accuracy of the estimated ground truth as a function of the number of annotators. The proposed EM algorithm achieves a higher accuracy than majority voting. In other words to achieve a desired accuracy the proposed algorithm needs fewer annotators than the majority voting scheme. 1312 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 1: Results for the digital mammography data set with annotations from 5 simulated radiologists. (a) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). (b) The ROC curve for the estimated ground truth. The actual sensitivity and specificity of each of the radiologists is marked as a ×. The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. The end of the solid red line shows the estimates from the proposed method. The ellipse plots the contour of one standard deviation. 1313 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.906 Majority voting baseline AUC=0.884 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm AUC=0.967 Majority voting baseline AUC=0.872 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 2: Same as Figure 1 except with 8 different radiologist annotations. 1314 L EARNING F ROM C ROWDS ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 3: ROC curves comparing the proposed algorithm (solid red line) with the Decoupled Estimation procedure (dotted blue line), which refers to the algorithm where the ground truth is first estimated using just the labels from the five radiologists and then a logistic regression classifier is trained using the soft probabilistic labels. In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm. 1315 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY Leave−One−Out ROC Curve for the classifier 1 True Positive Rate (sensitivity) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (a) ROC Curve for the estimated true labels 1 0.9 True Positive Rate (sensitivity) 0.8 0.7 0.6 0.5 0.4 Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1−specifcity) 0.8 1 (b) Figure 4: Breast MRI results. (a) The leave-one-out cross validated ROC. (b) ROC for the estimated ground truth. 1316 L EARNING F ROM C ROWDS 0.95 0.9 0.85 Accuracy 0.8 0.75 0.7 0.65 Majority Voting EM Algorithm 0.6 0.55 0.5 20 40 60 80 100 120 Number of Annotators 140 160 Figure 5: The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. The plot was generated by randomly sampling the annotators 100 times. 6.2 Regression Experiments We first illustrate the algorithm on a toy dataset and then present a case study for automated polyp measurements. 6.2.1 I LLUSTRATION Figure 6 illustrates the the proposed algorithm for regression on a one-dimensional toy data set with three annotators. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 20 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 6.2.2 AUTOMATED P OLYP M EASUREMENTS Colorectal polyps are small colonic findings that may develop into cancer at a later stage. The diameter of the polyp is one of the key factors which decides the malignancy of a suspicious polyp. Hence accurate size estimation is crucial to decide the action to be taken on a polyp. We have developed various algorithms to segment a polyp. Multiple segmentation algorithms give rise to a set of features which are correlated with the diameter of the polyp. We want to learn a regression function which can predict the diameter of a polyp as a function of these features. In order to learn a regression function we collect our ground truth by asking many radiologists to manually measure the the diameter of the polyps from the three-dimensional images. In practice there is a lot of disagreement among the radiologists as to the actual size of the polyp. 1317 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY 2 1 N=50 examples R=3 annotators 2 2 5 Actual regression model Model with annotator 1 (Precision=0.01) 2 1 Model with annotator 2 (Precision=0.10) 2 Model with annotator 3 (Precision=1.00) 3 2 3 2 Model with average response 2 Proposed algorithm 2 3 1 1 4 3 2 2 y (response) 2 2 1 3 2 3 3 2 3 1 −1 3 3 2 2 3 3 1 2 2 2 3 33 3 1 1 3 3 3 3 2 3 3 2 33 33 2 2 2 3 3 3 32 1 2 3 2 1 3 3 2 1 2 1 2 2 1 1 2 3 2 3 2 3 1 3 3 −3 2 3 3 2 3 3 3 2 3 3 2 1 3 3 2 0 −2 2 2 3 2 3 2 3 1 2 2 1 2 3 −4 1 2 2 1 −5 0 2 0.1 0.2 1 0.3 0.4 0.5 0.6 10.8 2 0.7 x (feature) 0.9 1 1 Figure 6: Illustration of the proposed algorithm on a one-dimensional toy data set. The actual regression model (shown as a blue dotted line) is given by y = 5x − 2. We simulate 50 samples from three annotators with precisions 0.01, 0.1, and 1.0. The data are shown by the annotators’s number. While we can fit a regression model using each annotators’s response, we see that only the model for annotator three (with highest precision) is close to the true regression model. The green dashed line shows the model learnt using the average response from all the three annotators. The red line shows the model learnt by the proposed algorithm. 5 5 10 Actual Polyp Diameter (mm) (a) Gold standard model 15 10 5 0 0 Pearson Correlation Coefficient=0.554966 RMSE=2.887740 15 Estimated Polyp Diameter (mm) 10 0 0 Pearson Correlation Coefficient=0.706558 RMSE=1.991576 15 Estimated Polyp Diameter (mm) Estimated Polyp Diameter (mm) Pearson Correlation Coefficient=0.714720 RMSE=1.970815 15 5 10 Actual Polyp Diameter (mm) (b) Proposed model 15 10 5 0 0 5 10 Actual Polyp Diameter (mm) 15 (c) Average model Figure 7: Scatter plot of the actual polyp diameter vs the diameter predicted by the models learnt using (a) the actual gold standard, (b) the proposed algorithm with annotations from five radiologists, and (c) the average of the radiologist’s annotations. (See § 6.2.2 for a description of the experimental setup.) 1318 L EARNING F ROM C ROWDS We use a proprietary data set containing 393 examples (which point to 285 distinct polyps— the segmentation algorithms generally return multiple marks on the same polyp.) along with the measured diameter (ranging from 2mm to 15mm) as our training set. Each example is described by a set of 60 morphological features which are correlated to the diameter of the polyp. In order to validate the feasibility of our proposed algorithm, we simulate five radiologists according to the noisy model described in § 5.1 with τ = [0.001 0.01 0.1 1 10]. This corresponds to a situation where the first three radiologists are extremely noisy and the last two are quite accurate. Based on the measurements from multiple radiologists, we can simultaneously (1) learn a linear regressor and (2) estimate the precision of each radiologist. We compare the results with the classifier trained using the actual golden ground truth as well as the regressor learnt using the average of the radiologists measurements. The results are validated on an independent test set containing 397 examples (which point to 298 distinct polyps). Figure 7 shows the scatter plot of the actual polyp diameter vs the diameter predicted by the three different models. We compare the performance based on the root mean squared error (RMSE) and also the Pearson’s correlation coefficient. The regressor learnt using the proposed iterative algorithm (Figure 7(b)) is almost as good as the one learnt using the golden ground truth (Figure 7(a)). The correlation coefficient for the proposed algorithm is significantly larger than that learnt using the average of the radiologists response. The estimate obtained by averaging is closer to the novices since they form a majority (3/5). The proposed algorithm appropriately weights each radiologist based on their estimated precisions. 7. Conclusions and Future Work In this paper we proposed a probabilistic framework for supervised learning with multiple annotators providing labels but no absolute gold standard. The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. We specifically discussed binary/categorical/ordinal classification and regression problems. We made two key assumptions: (1) the performance of each annotator does not depend on the feature vector for a given instance and (2) conditional on the truth the experts are independent, that is, they make their errors independently. As we pointed out earlier these assumptions are not true in practice. The annotator performance depends on the instance he is labeling and there is some degree of correlation among the annotators. We briefly discuss some strategies to relax these two assumptions. 7.1 Instance Difficulty One drawback of the current model is that it doesn’t estimate difficulty of items. It is often observed that for the easy instances all the annotators agree on the labels—thus violating our conditional independence assumption. The difficulty of annotating an item can be captured by another latent variable γi for each instance—which modulates the annotators performance. Models for this have been developed in the area of item-response theory (Baker and Kim, 2004) and also in epidemiology (Uebersax and Grove, 1993)—see also Whitehill et al. (2009) for a recent paper in the machine learning community. While these models do not take into account the available features our pro1319 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY posed model for sensitivity and specificity can be extended as follows (in place of (1) and (2)): α j (γi ) := Pr[yi = 1|yi = 1, γi ] = σ(a j1 + b j1 γi ). j β j (γi ) := Pr[yi = 0|yi = 0, γi ] = σ(a j0 + b j0 γi ). j Here the parameters a j1 and a j0 are related to the sensitivity and specificity of the jth annotator, while the latent term γi captures the difficulty of the instance. The key assumption here is that the annotators are independent conditional on both yi and γi . Various assumptions can be made on two parameters b j1 and b j0 to simplify these models further—for example we could set b j1 = b1 and b j0 = b0 for all the annotators. 7.2 Annotators Actually Look at the Data In our model we made the assumption that the sensitivity α j and the specificity β j of the jth annotator does not depend on the feature vector xi . For example, in the CAD domain, this meant that the radiologist’s performance is consistent across different sub-groups of data—which is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience. We can extend the previous model such that the sensitivity and the specificity depends on the feature vector xi explicitly as follows j⊤ α j (γi , xi ) := Pr[yi = 1|yi = 1, γi , xi ] = σ(a j1 + b j1 γi + wα xi ). j j⊤ α j (γi , xi ) := Pr[yi = 0|yi = 0, γi , xi ] = σ(a j0 + b j0 γi + wβ xi ). j However this change increases the number of parameters to be learned. References P. S. Albert and L. E. Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60:427–435, 2004. F. B. Baker and S. Kim. Item Response Theory: Parameter Estimation Techniques. CRC Press, 2 edition, 2004. B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. S. R. Cholleti, S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 2008 20th IEEE international Conference on Tools with Artificial intelligence, 2008. A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. O. Dekel and O. Shamir. Vox Populi: Collecting high-quality labels from a crowd. In COLT 2009: Proceedings of the 22nd Annual Conference on Learning Theory, 2009a. O. Dekel and O. Shamir. Good learners for evil teachers. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 233–240, 2009b. 1320 L EARNING F ROM C ROWDS A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977. P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 259–268, 2009. E. Frank and M. Hall. A simple approach to ordinal classification. Lecture Notes in Computer Science, pages 145–156, 2001. G. Fung, B. Krishnapuram, J. Bi, M. Dundar, V. C. Raykar, S. Yu, R. Rosales, S. Krishnan, and R. B. Rao. Mining medical images. In Fifteenth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining: Third Workshop on Data Mining Case Studies and Practice Prize, 2009. J. Howe. Crowd sourcing: Why the Power of the Crowd Is Driving the Future of Business. 2008. S. L. Hui and S. D. Walter. Estimating the error rates of diagnostic tests. Biometrics, 36:167–171, 1980. S. L. Hui and X. H. Zhou. Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research, 7:354–370, 1998. R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances in Neural Information Processing Systems 15, pages 897–904. 2003. B. Krishnapuram, J. Stoeckel, V. C. Raykar, R. B. Rao, P. Bamberger, E. Ratner, N. Merlet, I. Stainvas, M. Abramov, and A. Manevitch. Multiple-instance learning improves CAD detection of masses in digital mammography. In IWDM 2008: Proceedings of the 9th international workshop on Digital Mammography, pages 350–357. 2008. G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79–87, 1992. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML 2009: Proceedings of the 26th International Conference on Machine Learning, pages 889–896, 2009. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. P. Smyth. Learning with probabilistic supervision. In Computational Learning Theory and Natural Learning Systems 3, pages 163–182. MIT Press, 1995. 1321 R AYKAR , Y U , Z HAO , VALADEZ , F LORIN , B OGONI AND M OY P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085– 1092. 1995. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast - But is it good? Evaluating nonexpert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263, 2008. A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. J. S. Uebersax and W. M. Grove. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 49:823–835, 1993. S. K. Warfield, K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004. J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. 1322
2 0.30356196 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
3 0.30326378 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
4 0.30215353 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
5 0.3015413 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
Author: Miloš Radovanović, Alexandros Nanopoulos, Mirjana Ivanović
Abstract: Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent “popular” nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families. Keywords: nearest neighbors, curse of dimensionality, classification, semi-supervised learning, clustering
6 0.30013797 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
7 0.29961318 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
8 0.29759106 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
9 0.29707748 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
10 0.29474705 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
11 0.29471007 102 jmlr-2010-Semi-Supervised Novelty Detection
12 0.29466537 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.29459187 63 jmlr-2010-Learning Instance-Specific Predictive Models
15 0.2936644 109 jmlr-2010-Stochastic Composite Likelihood
16 0.29306296 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
17 0.29254028 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
18 0.2921401 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
19 0.29068077 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
20 0.29032549 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization