nips nips2011 nips2011-42 knowledge-graph by maker-knowledge-mining

42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

Source: pdf

Author: Fabian L. Wauthier, Michael I. Jordan

Abstract: Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the sources of bias by describing labelers as inﬂuenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efﬁciently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. [sent-7, score-0.509]

2 Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. [sent-11, score-0.917]

3 Our model captures the sources of bias by describing labelers as inﬂuenced by shared random effects. [sent-12, score-0.58]

4 This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. [sent-13, score-0.411]

5 We propose a general approximation strategy for Markov chains to efﬁciently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. [sent-15, score-0.317]

6 Unfortunately, the data collected from crowdsourcing services is often very dirty: Unhelpful labelers may provide incorrect or biased responses that can have major, uncontrolled effects on learning algorithms. [sent-19, score-0.723]

7 Further, as soon as malicious labelers try to exploit incentive schemes in the data collection cycle yet more forms of bias enter. [sent-21, score-0.631]

8 The researcher farms the labeling tasks to a crowdsourcing service for annotation and possibly adds a small set of gold standard labels. [sent-23, score-0.675]

9 Since labels from the crowd are contaminated by errors and bias, some ﬁltering is applied to curate the data, possibly using the gold standard provided by the researcher. [sent-25, score-0.45]

10 Although the potential for active learning to make crowdsourcing much more cost effective and goal driven has been appreciated, research on the topic is still in its infancy [4, 9, 17]. [sent-30, score-0.345]

11 We believe that the lack of systematic solutions to these problems can make crowdsourcing brittle in situations where labelers are arbitrarily biased or even malicious, such as when tasks are particularly ambiguous/hard or when opinions or ratings are solicited. [sent-32, score-0.804]

12 The ﬁrst is a ﬂexible latent feature model that describes each labeler’s idiosyncrasies through multiple shared factors and allows us to combine data curation and learning (steps 2 and 3 above) into one inferential computation. [sent-36, score-0.414]

13 Most of the literature accounts for the effects of labeler bias by assuming a single, true latent labeling from which labelers report noisy observations of some kind [2, 3, 4, 6, 8, 9, 10, 11, 15, 16, 17, 18]. [sent-37, score-1.122]

14 This assumption is inappropriate when labels are solicited on subjective or ambiguous tasks (ratings, opinions, and preferences) or when learning must proceed in the face of arbitrarily biased labelers. [sent-38, score-0.309]

15 ” Our BBMC framework achieves this by modeling the sources of labeler bias through shared random effects. [sent-40, score-0.553]

16 Since our model requires Gibbs sampling for inference, a straightforward application of active learning is infeasible: Each active learning step relies on many inferential computations and would trigger a multitude of subordinate Gibbs samplers to be run within one large Gibbs sampler. [sent-42, score-0.522]

17 The basic idea is to approximate the stationary distribution of a perturbed Markov chain using that of an unperturbed chain. [sent-44, score-0.316]

18 We specialize this idea to active learning in our model and show that the computations are efﬁcient and that the resulting active learning strategy substantially outperforms other active learning schemes. [sent-45, score-0.565]

19 In Section 3 we propose the latent feature model for labelers and in Section 4 we discuss the inference procedure that combines data curation and learning. [sent-47, score-0.787]

20 Then we present a general method to approximate the stationary distribution of perturbed Markov chains and apply it to derive an efﬁcient active learning criterion in Section 5. [sent-48, score-0.353]

21 [9] use the multiset of current labels with a random forest label model to score which task to next solicit a repeat label for. [sent-52, score-0.374]

22 The quality of the labeler providing the new label does not enter the selection process. [sent-53, score-0.589]

23 [4] actively choose the labeler to query next using a formulation based on interval estimation, utilizing repeated labelings of tasks. [sent-55, score-0.516]

24 In contrast, our BBMC framework can perform meaningful inferences even without repeated labelings of tasks and treats the choices of which labeler to query on which task as a joint choice in a Bayesian framework. [sent-57, score-0.633]

25 [17] account for the effects of labeler bias through a coin ﬂip observation model that ﬁlters a latent label assignment, which in turn is modeled through a logistic regression. [sent-59, score-0.689]

26 As in [4], the labeler is chosen separately from the task by solving two optimization problems. [sent-60, score-0.501]

27 [14] require each labeler to ﬁrst pass a screening test before they are allowed to label any more data. [sent-62, score-0.554]

28 In a similar manner, reputation systems of various forms are used to weed out historically unreliable labelers before collecting data. [sent-63, score-0.519]

29 Consensus voting among multiple labels is a commonly used data curation method [12, 14]. [sent-64, score-0.321]

30 It works well when low levels of bias or noise are expected but becomes unreliable when labelers vary greatly in quality [9]. [sent-65, score-0.592]

31 [10] who looked at estimating the unknown true label for a task from a set of labelers of varying quality without external gold standard signal. [sent-67, score-0.896]

32 [8] who assign latent variables to labelers capturing their mislabeling probabilities. [sent-70, score-0.554]

33 [6] pointed out that a biased labeler who systematically mislabels tasks is still more useful than a labeler who reports labels at random. [sent-72, score-1.138]

34 A method is proposed that separates low quality labelers from high quality, but biased labelers. [sent-73, score-0.569]

35 First, they ﬁlter labelers by how far they disagree from an estimated true label and then retrain the model on the cleaned data. [sent-75, score-0.604]

36 Work has also focused on using gold standard labels to determine labeler quality. [sent-79, score-0.817]

37 Going beyond simply counting tasks on which labelers disagree with the gold standard, Snow et al. [sent-80, score-0.805]

38 [11] estimate labeler quality in a Bayesian setting by comparing to the gold standard. [sent-81, score-0.736]

39 Given some gold standard labels, collaborative ﬁltering methods could in principle also be used to curate data represented by a sparse label matrix. [sent-83, score-0.425]

40 3 Modeling Labeler Bias In this section we specify a Bayesian latent feature model that accounts for labeler bias and allows us to combine data curation and learning into a single inferential calculation. [sent-86, score-0.861]

41 In practical settings it is unlikely that a task is labeled by more than 3–10 labelers [14]. [sent-89, score-0.528]

42 The label responses are recorded in the matrix Y so that yi,l ∈ {−1, 0, +1} denotes the label given to task i by labeler l. [sent-94, score-0.708]

43 A researcher is interested in learning a model that can be used to predict labels for new tasks. [sent-96, score-0.291]

44 When consensus is lacking among labelers, our desideratum is to predict the labels that the researcher (or some other expert) would have assigned, as opposed to labels from an arbitrary labeler in the crowd. [sent-97, score-0.906]

45 In this situation it makes sense to stratify the labelers in some way. [sent-98, score-0.488]

46 To facilitate this, the researcher r provides gold standard labels in column r of Y to a small subset of the tasks. [sent-99, score-0.531]

47 Loosely speaking, the gold standard allows our model to curate the data by softly combining labels from those labelers whose responses will useful in predicting r’s remaining labels. [sent-100, score-0.961]

48 If instead we were interested in predicting labels for labeler l, we would treat column l as containing the gold standard labels. [sent-102, score-0.817]

49 To simplify our presentation, we will accordingly refer to labelers in the crowd and the researcher occasionally just as “labelers,” indexed by l, and only use the distinguishing index r when necessary. [sent-104, score-0.709]

50 We account for each labeler l’s idiosyncrasies by assigning a parameter βl ∈ Rd to l and modeling labels yi,l , i = 1, . [sent-105, score-0.609]

51 This section describes a joint Bayesian prior on parameters βl that allows for parameter sharing; two labelers that share parameters have similar responses. [sent-109, score-0.488]

52 In the context of this model, the two-step process of data curation and learning a model that predicts r’s labels is reduced to posterior inference on βr given X and Y . [sent-110, score-0.379]

53 Let zl be a latent binary vector for labeler l whose component zl,b indicates whether the latent factor γb ∈ Rd contributes to βl . [sent-115, score-0.875]

54 , zl is inﬁnitely long), as long as only a ﬁnite number of those factors is active (i. [sent-118, score-0.487]

55 Given a labeler’s vector ∞ zl and factors γ we deﬁne the parameter βl = b=1 zl,b γb . [sent-122, score-0.31]

56 For multiple labelers we let the inﬁnitely long matrix Z = (z1 , . [sent-123, score-0.488]

57 , zm ) collect the vectors zl and deﬁne the index set of all observed labels L = {(i, l) : yi,l = 0}, so that the likelihood is p(Y |X, γ, Z) = p(yi,l |xi , γ, zl ) = (i,l)∈L Φ(yi,l xi βl ). [sent-126, score-0.717]

58 The IBP is a stochastic process on inﬁnite binary matrices consisting of vectors zl . [sent-129, score-0.282]

59 With probability one, the number of columns K(Z) of Z is ﬁnite so we K(Z) may write βl = b=1 zl,b γb Zl γ, with Zl = zl ⊗ I the Kronecker product of zl and I. [sent-135, score-0.564]

60 4 Inference: Data Curation and Learning We noted before that our model combines data curation and learning in a single inferential computation. [sent-136, score-0.265]

61 Because latent factors can be shared across multiple labelers, the posterior will softly absorb label information from labelers whose latent factors tend to be similar to those of the researcher r. [sent-141, score-1.077]

62 Thus, Bayesian inference p(βr |Y, X) automatically combines data curation and learning by weighting label information through an inferred sharing structure. [sent-142, score-0.326]

63 Importantly, the posterior is informative even when no labeler in the crowd labeled any of the tasks the researcher labeled. [sent-143, score-0.766]

64 The generative model for the label yi,l given xi , γ and zl ﬁrst samples ti,l from a Gaussian N (βl xi , 1). [sent-147, score-0.476]

65 Rather than sampling from the lower chain directly (dashed arrows), we transform samples from the top chain to approximate samples from the lower (wavy arrows). [sent-158, score-0.259]

66 One simple way to achieve this is to work with a ﬁnite-dimensional approximation to the IBP: We constrain Z to be an m × K matrix, assigning each labeler at most K active latent features. [sent-161, score-0.726]

67 Let m−l,b = l =l zl ,b be the number of labelers, excluding l, with feature b active. [sent-163, score-0.307]

68 5 Active Learning The previous section outlined how, given a small set of gold standard labels from r, the remaining labels can be predicted via posterior inference p(βr |Y, X). [sent-171, score-0.558]

69 In this section we take an active learning approach [1, 7] to incrementally add labels to Y so as to quickly learn about βr while reducing data acquisition costs. [sent-172, score-0.293]

70 Active learning allows us to guide the data collection process through model inferences, thus integrating the data collection, data curation and learning steps of the crowdsourcing pipeline. [sent-173, score-0.44]

71 We envision a uniﬁed system that automatically asks for more labels from those labelers on those tasks that are most useful in inferring βr . [sent-174, score-0.658]

72 This is in contrast to [9], where labelers cannot be targeted with tasks. [sent-175, score-0.488]

73 It is also unlike [4] since we can let labelers be arbitrarily unhelpful, and differs from [17] which assumes a single latent truth. [sent-176, score-0.554]

74 A well-known active learning criterion popularized by Lindley [7] is to label that task next which maximizes the prior-posterior reduction in entropy of an inferential quantity of interest. [sent-177, score-0.37]

75 In our particular setup, we wish to infer the parameter βr to predict labels for the researcher r. [sent-180, score-0.291]

76 Suppose we chose to solicit a label for task i from labeler l , which produced label yi ,l . [sent-181, score-0.763]

77 The average utility of receiving 5 a label on task i from labeler l is I((i , l ) , p(βr )) = E(U (p(βr |yi ,l ))), where the expectation is taken with respect to the predictive label probabilities p(yi ,l |xi ) = p(yi ,l |xi , βl )p(βl )dβl . [sent-183, score-0.757]

78 To solve this problem, we propose a general purpose strategy to approximate the stationary distribution of a perturbed Markov chain using that of an unperturbed Markov chain. [sent-195, score-0.316]

79 If we are given the stationary distribution p∞ (βr ) ˆ ˆ of the unperturbed chain, then we propose to approximate the perturbed stationary distribution by p ∞ ( βr ) ≈ ˆ ˆ p(βr |βr )p∞ (βr )dβr . [sent-200, score-0.283]

80 To use this practically with MCMC, we ﬁrst run the unperturbed MCMC chain to approximate stationarity, and then use samples of p∞ (βr ) to compute approximate samples from p∞ (βr ). [sent-203, score-0.283]

81 ˆ ˆ t t−1 To map this idea to our active learning setup we conceptually let the unperturbed chain p(βr |βr ) be the chain on βr induced by the Gibbs sampler in Section 4. [sent-205, score-0.498]

82 The perturbed chain p(βr |βr ) ˆ ˆt ˆt−1 represents the chain where we have added a new observation yi ,l to the measured data. [sent-206, score-0.258]

83 If we have s S samples βr from p∞ (βr ), then we approximate the perturbed distribution as 1 p ∞ ( βr ) ≈ ˆ ˆ S S p(βr |βr ), ˆ ˆ s (9) s=1 and the active learning score as U (p(βr |yi ,l )) ≈ U p∞ (βr ) . [sent-207, score-0.299]

84 We asked labelers to determine if the triangle is to the left or above the square. [sent-221, score-0.488]

85 Finally, we use samples from the Gibbs sampler to approximate p(yi ,l |xi ) and estimate I((i , l ) , p(βr )) for querying labeler l on task i . [sent-237, score-0.618]

86 6 Experimental Results We evaluated our active learning method on an ambiguous localization task which asked labelers on Amazon Mechanical Turk to determine if a triangle was to the left or above a rectangle. [sent-238, score-0.75]

87 We expected labelers to use centroids, extreme points and object sizes in different ways to solve the tasks, thus leading to structurally biased responses. [sent-242, score-0.534]

88 The gold standard was to compare only the centroids of the two objects. [sent-244, score-0.268]

89 Tasks were solved by 75 labelers with moderate disagreement. [sent-246, score-0.488]

90 We provided about 60 gold standard labels to BBMC and then performed inference and active learning on βr so as to learn a predictive model emulating gold standard labels. [sent-248, score-0.827]

91 Here only about 60 gold standard labels and all the labeler data is available for training. [sent-257, score-0.817]

92 The BBMC scores were computed by running the Gibbs sampler of Section 4 with 2000 iterations burnin and then computing 1 The test set was similarly constructed by selecting from 2000 tasks those on which three labelers disagreed. [sent-260, score-0.641]

93 Training on the gold standard only often overﬁts, and training on the consensus systematically misleads. [sent-296, score-0.278]

94 We repeatedly select a new task for which to receive a gold standard label from the researcher. [sent-299, score-0.373]

95 Of course, in our framework we could have just as easily queried labelers in the crowd. [sent-301, score-0.488]

96 A more realistic alternative to our model is “DIS-ACT,” which picks one of the tasks with most labeler disagreement to label next. [sent-315, score-0.608]

97 Lastly, the baseline alternatives include “GOLD-ACT” and “CONS-ACT” which pick a random task to label and then learn logistic regressions on the gold standard or consensus labels respectively. [sent-316, score-0.527]

98 7 Conclusions We have presented Bayesian Bias Mitigation for Crowdsourcing (BBMC) as a framework to unify the three main steps in the crowdsourcing pipeline: data collection, data curation and learning. [sent-319, score-0.419]

99 Our model captures labeler bias through a ﬂexible latent feature model and conceives of the entire pipeline in terms of probabilistic inference. [sent-320, score-0.636]

100 An important contribution is a general purpose approximation strategy for Markov chains that allows us to efﬁciently perform active learning, despite relying on Gibbs sampling for inference. [sent-321, score-0.26]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('labelers', 0.488), ('labeler', 0.461), ('zl', 0.282), ('gold', 0.24), ('bbmc', 0.206), ('curation', 0.205), ('active', 0.177), ('researcher', 0.175), ('crowdsourcing', 0.168), ('gibbs', 0.124), ('labels', 0.116), ('unperturbed', 0.112), ('label', 0.093), ('ibp', 0.084), ('perturbed', 0.072), ('chain', 0.071), ('bias', 0.069), ('sampler', 0.067), ('latent', 0.066), ('zr', 0.064), ('inferential', 0.06), ('mechanical', 0.056), ('mitigation', 0.056), ('tasks', 0.054), ('amazon', 0.054), ('curate', 0.048), ('softly', 0.048), ('solicited', 0.048), ('biased', 0.046), ('crowd', 0.046), ('ambiguous', 0.045), ('utility', 0.044), ('yi', 0.044), ('collaborative', 0.044), ('collection', 0.043), ('chains', 0.043), ('samplers', 0.042), ('dekel', 0.04), ('pipeline', 0.04), ('task', 0.04), ('sampling', 0.04), ('stationary', 0.038), ('consensus', 0.038), ('labeling', 0.038), ('xi', 0.037), ('turk', 0.037), ('quality', 0.035), ('specialize', 0.034), ('mcmc', 0.034), ('bayesian', 0.033), ('absorb', 0.032), ('burnin', 0.032), ('donmez', 0.032), ('idiosyncrasies', 0.032), ('solicit', 0.032), ('wais', 0.032), ('labelings', 0.031), ('malicious', 0.031), ('scoring', 0.031), ('collecting', 0.031), ('markov', 0.031), ('posterior', 0.03), ('ep', 0.029), ('outlined', 0.028), ('inference', 0.028), ('factors', 0.028), ('centroids', 0.028), ('snow', 0.028), ('unhelpful', 0.028), ('samples', 0.027), ('ratings', 0.026), ('yan', 0.026), ('predictive', 0.026), ('ltering', 0.026), ('bogoni', 0.026), ('cons', 0.026), ('ipeirotis', 0.026), ('provost', 0.026), ('rosales', 0.026), ('sheng', 0.026), ('subordinate', 0.026), ('perturbation', 0.025), ('calculations', 0.025), ('excluding', 0.025), ('raykar', 0.024), ('steps', 0.024), ('query', 0.024), ('shared', 0.023), ('approximate', 0.023), ('fung', 0.023), ('disagree', 0.023), ('buffet', 0.023), ('wisdom', 0.023), ('inferences', 0.023), ('constrain', 0.022), ('opinions', 0.022), ('stationarity', 0.022), ('unify', 0.022), ('sigkdd', 0.022), ('responses', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

Author: Fabian L. Wauthier, Michael I. Jordan

2 0.095904864 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

Author: David R. Karger, Sewoong Oh, Devavrat Shah

Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classiﬁcation, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase conﬁdence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm signiﬁcantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1

3 0.088489778 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

Author: Jun Zhu, Ning Chen, Eric P. Xing

Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing inﬁnite latent support vector machines (iLSVM) and multi-task inﬁnite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classiﬁcation and multi-task learning, respectively. We present efﬁcient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.

4 0.073916927 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett

Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1

5 0.069335632 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, speciﬁed by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-speciﬁc approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efﬁciency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1

6 0.06675142 258 nips-2011-Sparse Bayesian Multi-Task Learning

7 0.063844778 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

8 0.060496267 303 nips-2011-Video Annotation and Tracking with Active Learning

9 0.057093315 162 nips-2011-Lower Bounds for Passive and Active Learning

10 0.055862796 66 nips-2011-Crowdclustering

11 0.05395988 123 nips-2011-How biased are maximum entropy models?

12 0.05129762 69 nips-2011-Differentially Private M-Estimators

13 0.051249988 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

14 0.048158802 55 nips-2011-Collective Graphical Models

15 0.046564892 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

16 0.045097534 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

17 0.043795642 131 nips-2011-Inference in continuous-time change-point models

18 0.042063441 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

19 0.041766319 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

20 0.041078061 240 nips-2011-Robust Multi-Class Gaussian Process Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, 0.032), (2, -0.009), (3, 0.007), (4, -0.014), (5, -0.098), (6, 0.02), (7, -0.096), (8, -0.021), (9, 0.046), (10, -0.03), (11, -0.058), (12, 0.039), (13, -0.052), (14, 0.053), (15, 0.015), (16, 0.001), (17, -0.036), (18, 0.091), (19, 0.007), (20, 0.049), (21, -0.051), (22, 0.035), (23, -0.1), (24, -0.008), (25, -0.044), (26, 0.039), (27, -0.034), (28, -0.012), (29, -0.041), (30, 0.111), (31, -0.019), (32, -0.057), (33, 0.035), (34, -0.014), (35, -0.062), (36, -0.091), (37, -0.011), (38, 0.039), (39, -0.088), (40, -0.007), (41, -0.032), (42, -0.015), (43, 0.02), (44, 0.043), (45, 0.102), (46, 0.01), (47, -0.098), (48, -0.053), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91383129 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

Author: Fabian L. Wauthier, Michael I. Jordan

2 0.58311117 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

Author: Jun Zhu, Ning Chen, Eric P. Xing

3 0.57133269 66 nips-2011-Crowdclustering

Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona

Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1

4 0.56656516 240 nips-2011-Robust Multi-Class Gaussian Process Classification

Author: Daniel Hernández-lobato, Jose M. Hernández-lobato, Pierre Dupont

Abstract: Multi-class Gaussian Process Classiﬁers (MGPCs) are often affected by overﬁtting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the beneﬁts of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difﬁcult to classify correctly in practice. 1

5 0.56324375 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

Author: Vikas C. Raykar, Shipeng Yu

Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers–annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we formalize the notion of a spammer and deﬁne a score which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a high score close to one. 1 Spammers in crowdsourced labeling tasks Annotating an unlabeled dataset is one of the bottlenecks in using supervised learning to build good predictive models. Getting a dataset labeled by experts can be expensive and time consuming. With the advent of crowdsourcing services (Amazon’s Mechanical Turk being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see [8], [10], and [11] for some computer vision and natural language processing case studies). One drawback of most crowdsourcing services is that we do not have tight control over the quality of the annotators. The annotators can come from a diverse pool including genuine experts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good quality labels requestors typically get each instance labeled by multiple annotators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods that model and correct for the annotator biases [3, 9, 6, 7, 14] and/or task complexity [2, 13, 12]. In this paper we are interested in ranking annotators based on how spammer like each annotator is. In our context a spammer is a low quality annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator). Spammers can signiﬁcantly increase the cost of acquiring annotations (since they need to be paid) and at the same time decrease the accuracy of the ﬁnal consensus labels. A mechanism to detect and eliminate spammers is a desirable feature for any crowdsourcing market place. For example one can give monetary bonuses to good annotators and deny payments to spammers. The main contribution of this paper is to formalize the notion of a spammer for binary, categorical, and ordinal labeling tasks. More speciﬁcally we deﬁne a scalar metric which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a score close to one (see Figure 4). We summarize the multiple parameters corresponding to each annotator into a single score indicative of how spammer like the annotator is. While this spammer score was implicit for binary labels in earlier works [3, 9, 2, 6] the extension to categorical and ordinal labels is novel and is quite different from the accuracy computed from the confusion rate matrix. An attempt to quantify the quality of the workers based on the confusion matrix was recently made by [4] where they transformed the observed labels into posterior soft labels based on the estimated confusion 1 matrix. While we obtain somewhat similar annotator rankings, we differ from this work in that our score is directly deﬁned in terms of the annotator parameters (see § 5 for more details). The rest of the paper is organized as follows. For ease of exposition we start with binary labels (§ 2) and later extend it to categorical (§ 3) and ordinal labels (§ 4). We ﬁrst specify the annotator model used, formalize the notion of a spammer, and propose an appropriate score in terms of the annotator model parameters. We do not dwell too much on the estimation of the annotator model parameters. These parameters can either be estimated directly using known gold standard 1 or the iterative algorithms that estimate the annotator model parameters without actually knowing the gold standard [3, 9, 2, 6, 7]. In the experimental section (§ 6) we obtain rankings for the annotators using the proposed spammer scores on some publicly available data from different domains. 2 Spammer score for crowdsourced binary labels j Annotator model Let yi ∈ {0, 1} be the label assigned to the ith instance by the j th annotator, and let yi ∈ {0, 1} be the actual (unobserved) binary label. We model the accuracy of the annotator separately on the positive and the negative examples. If the true label is one, the sensitivity (true positive rate) αj for the j th annotator is deﬁned as the probability that the annotator labels it as one. j αj := Pr[yi = 1|yi = 1]. On the other hand, if the true label is zero, the speciﬁcity (1−false positive rate) β j is deﬁned as the probability that annotator labels it as zero. j β j := Pr[yi = 0|yi = 0]. Extensions of this basic model have been proposed to include item level difﬁculty [2, 13] and also to model the annotator performance based on the feature vector [14]. For simplicity we use the basic model proposed in [7] in our formulation. Based on many instances labeled by multiple annotators the maximum likelihood estimator for the annotator parameters (αj , β j ) and also the consensus ground truth (yi ) can be estimated iteratively [3, 7] via the Expectation Maximization (EM) algorithm. The EM algorithm iteratively establishes a particular gold standard (initialized via majority voting), measures the performance of the annotators given that gold standard (M-step), and reﬁnes the gold standard based on the performance measures (E-step). Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator. More precisely an annotator is a spammer if the probability j of observed label yi being one given the true label yi is independent of the true label, i.e., j j Pr[yi = 1|yi ] = Pr[yi = 1]. This means that the annotator is assigning labels randomly by ﬂipping a coin with bias without actually looking at the data. Equivalently (1) can be written as j j Pr[yi = 1|yi = 1] = Pr[yi = 1|yi = 0] which implies αj = 1 − β j . (1) j Pr[yi = 1] (2) Hence in the context of the annotator model deﬁned earlier a perfect spammer is an annotator for whom αj + β j − 1 = 0. This corresponds to the diagonal line on the Receiver Operating Characteristic (ROC) plot (see Figure 1(a)) 2 . If αj + β j − 1 < 0 then the annotators lies below the diagonal line and is a malicious annotator who ﬂips the labels. Note that a malicious annotator has discriminatory power if we can detect them and ﬂip their labels. In fact the methods proposed in [3, 7] can automatically ﬂip the labels for the malicious annotators. Hence we deﬁne the spammer score for an annotator as S j = (αj + β j − 1)2 (3) An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 1 One of the commonly used strategy to ﬁlter out spammers is to inject some items into the annotations with known labels. This is the strategy used by CrowdFlower (http://crowdflower.com/docs/gold). 2 Also note that (αj + β j )/2 is equal to the area shown in the plot and can be considered as a non-parametric approximation to the area under the ROC curve (AUC) based on one observed point. It is also equal to the Balanced Classiﬁcation Rate (BCR). So a spammer can also be deﬁned as having BCR or AUC equal to 0.5. 2 Equal accuracy contours (prevalence=0.5) 0.9 1 Good Annotators Biased Annotators 0.8 0.9 Sensitivity Sensitivity ( αj ) Spammers 0.5 0.4 j 0.3 j 0.6 0.5 0.4 4 0. 5 0. 3 0. 7 0. 6 0. 4 0. 5 0. 2 0. 3 0. Malicious Annotators 0.2 0.4 0.6 1−Specificity ( βj ) 0.8 0.1 1 0. 0. 2 0. 3 0. 1 0. 0.6 0.5 1 0. 2 0. 2 0. 1 0. 0.4 3 1 0. 0. 2 0. 4 0. 0.2 4 0. 5 0. 0 0 1 3 0. 0.3 0.2 Biased Annotators 4 0.7 6 0. 4 0. 0.8 7 0.3 Area = (α +β )/2 0.2 0 0 0.9 0. 8 0. 8 0. 0.7 6 0. .5 0 5 0. 0.7 [ 1−βj, αj ] 0.6 0.1 6 0. 0.8 0.7 Equal spammer score contours 1 7 0. 8 0. 9 0. Sensitivity 1 (a) Binary annotator model 0.1 1 0. 2 0. 3 0. 0.2 0.4 0.6 1−Specificity 0.8 1 1 0. 0 0 (b) Accuracy 0.2 3 0. 4 0. 0.4 0.6 1−Specificity 5 0. .6 7 0 0. 8 0. 0.8 1 (c) Spammer score Figure 1: (a) For binary labels an annotator is modeled by his/her sensitivity and speciﬁcity. A perfect spammer lies on the diagonal line on the ROC plot. (b) Contours of equal accuracy (4) and (c) equal spammer score (3). Accuracy This notion of a spammer is quite different for that of the accuracy of an annotator. An annotator with high accuracy is a good annotator but one with low accuracy is not necessarily a spammer. The accuracy is computed as 1 j Accuracyj = Pr[yi = yi ] = j Pr[yi = 1|yi = k]Pr[yi = k] = αj p + β j (1 − p), (4) k=0 where p := Pr[yi = 1] is the prevalence of the positive class. Note that accuracy depends on prevalence. Our proposed spammer score does not depend on prevalence and essentially quantiﬁes the annotator’s inherent discriminatory power. Figure 1(b) shows the contours of equal accuracy on the ROC plot. Note that annotators below the diagonal line (malicious annotators) have low accuracy. The malicious annotators are good annotators but they ﬂip their labels and as such are not spammers if we can detect them and then correct for the ﬂipping. In fact the EM algorithms [3, 7] can correctly ﬂip the labels for the malicious annotators and hence they should not be treated as spammers. Figure 1(c) also shows the contours of equal score for our proposed score and it can be seen that the malicious annotators have a high score and only annotators along the diagonal have a low score (spammers). Log-odds Another interpretation of a spammer can be seen from the log odds. Using Bayes’ rule the posterior log-odds can be written as log j Pr[yi = 1|yi ] Pr[yi = j 0|yi ] = log j Pr[yi |yi = 1] j Pr[yi |yi = 0] + log p . 1−p Pr[y =1|y j ] p If an annotator is a spammer (i.e., (2) holds) then log Pr[yi =0|yi ] = log 1−p . Essentially the annotator j i i provides no information in updating the posterior log-odds and hence does not contribute to the estimation of the actual true label. 3 Spammer score for categorical labels Annotator model Suppose there are K ≥ 2 categories. We introduce a multinomial parameter αj = (αj , . . . , αj ) for each annotator, where c c1 cK K j αj := Pr[yi = k|yi = c] ck αj = 1. ck and k=1 αj ck The term denotes the probability that annotator j assigns class k to an instance given that the true class is c. When K = 2, αj and αj are sensitivity and speciﬁcity, respectively. 11 00 Who is a spammer? As earlier a spammer assigns labels randomly, i.e., j j Pr[yi = k|yi ] = Pr[yi = k], ∀k. 3 j j This is equivalent to Pr[yi = k|yi = c] = Pr[yi = k|yi = c ], ∀c, c , k = 1, . . . , K— which means knowing the true class label being c or c does not change the probability of the annotator’s assigned label. This indicates that the annotator j is a spammer if αj = αj k , ∀c, c , k = 1, . . . , K. ck c (5) Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer would have 0.50 0.50 0.50 all the rows of Aj equal, for example, Aj = 0.25 0.25 0.25 0.25 0.25 0.25 , for a three class categorical annotation problem. Essentially Aj is a rank one matrix of the form Aj = evj , for some column vector vj ∈ RK that satisﬁes vj e = 1, where e is column vector of ones. In the binary case we had this natural notion of spammer as an annotator for whom αj + β j − 1 was close to zero. One natural way to summarize (5) would be in terms of the distance (Frobenius norm) of the confusion matrix to the closest rank one approximation, i.e, S j := Aj − eˆj v 2 F, (6) where ˆj solves v ˆj = arg min Aj − evj v vj 2 F s.t. vj e = 1. (7) Solving (7) yields ˆj = (1/K)Aj e, which is the mean of the rows of Aj . Then from (6) we have v Sj = I− 1 ee K 2 Aj = F 1 K (αj − αj k )2 . ck c c < . . . < K. Annotator model It is conceptually easier to think of the true label to be binary, that is, yi ∈ {0, 1}. For example in mammography a lesion is either malignant (1) or benign (0) (which can be conﬁrmed by biopsy) and the BIRADS ordinal scale is a means for the radiologist to quantify the uncertainty based on the digital mammogram. The radiologist assigns a higher value of the label if he/she thinks the true label is closer to one. As earlier we characterize each annotator by the sensitivity and the speciﬁcity, but the main difference is that we now deﬁne the sensitivity and speciﬁcity for j each ordinal label (or threshold) k ∈ {1, . . . , K}. Let αj and βk be the sensitivity and speciﬁcity k th respectively of the j annotator corresponding to the threshold k, that is, j j j αj = Pr[yi ≥ k | yi = 1] and βk = Pr[yi < k | yi = 0]. k j j Note that αj = 1, β1 = 0 and αj 1 K+1 = 0, βK+1 = 1 from this deﬁnition. Hence each annotator j j is parameterized by a set of 2(K − 1) parameters [αj , β2 , . . . , αj , βK ]. This corresponds to an 2 K empirical ROC curve for the annotator (Figure 2). 4 Who is a spammer? As earlier we deﬁne an an1 j k=1 notator j to be a spammer if Pr[yi = k|yi = 1] = 0.9 j k=2 0.8 Pr[yi = k|yi = 0] ∀k = 1, . . . , K. Note that from j 0.7 k=3 [ 1−β , α ] the annotation model we have 3 Pr[yi = k | yi = 0.6 j j j 1] = αk − αk+1 and Pr[yi = k | yi = 0] = 0.5 k=4 j j 0.4 βk+1 − βk . This implies that annotator j is a spam0.3 j j mer if αj − αj k k+1 = βk+1 − βk , ∀k = 1, . . . , K, 0.2 j j 0.1 which leads to αj + βk = αj + β1 = 1, ∀k. This 1 k j 0 0 0.2 0.4 0.6 0.8 1 means that for every k, the point (1 − βk , αj ) lies on k 1−Specificity ( β ) the diagonal line in the ROC plot shown in Figure 2. The area under the empirical ROC curve can be comFigure 2: Ordinal labels: An annotator is modK 1 puted as (see Figure 2) AUCj = 2 k=1 (αj + eled by sensitivity/speciﬁcity for each threshold. k+1 j j αj )(βk+1 − βk ), and can be used to deﬁne the folk lowing spammer score as (2AUCj − 1)2 to rank the different annotators. 3 Sensitivity ( αj ) 3 j 2 K (αj k+1 k=1 j S = + j αj )(βk+1 k − j βk ) −1 (9) With two levels this expression defaults to the binary case. An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 5 Previous work Recently Ipeirotis et.al. [4] proposed a score for categorical labels based on the expected cost of the posterior label. In this section we brieﬂy describe their approach and compare it with our proposed score. For each instance labeled by the annotator they ﬁrst compute the posterior (soft) label j j Pr[yi = c|yi ] for c = 1, . . . , K, where yi is the label assigned to the ith instance by the j th annotator and yi is the true unknown label. The posterior label is computed via Bayes’ rule as j j j Pr[yi = c|yi ] ∝ Pr[yi |yi = c]Pr[yi = c] = (αj )δ(yi ,k) pc , where pc = Pr[yi = c] is the prevack lence of class c. The score for a spammer is based on the intuition that the posterior label vector j j (Pr[yi = 1|yi ], . . . , Pr[yi = K|yi ]) for a good annotator will have all the probability mass concentrated on single class. For example for a three class problem (with equal prevalence), a posterior label vector of (1, 0, 0) (certain that the class is one) comes from a good annotator while a (1/3, 1/3, 1/3) (complete uncertainty about the class label) comes from spammer. Based on this they deﬁne the following score for each annotator 1 Score = N N K K j j costck Pr[yi = k|yi ]Pr[yi = c|yi ] j i=1 . (10) c=1 k=1 where costck is the misclassiﬁcation cost when an instance of class c is classiﬁed as k. Essentially this is capturing some sort of uncertainty of the posterior label averaged over all the instances. Perfect workers have a score Scorej = 0 while spammers will have high score. An entropic version of this score based on similar ideas has also been recently proposed in [5]. Our proposed spammer score differs from this approach in the following aspects: (1) Implicit in the score deﬁned above (10) j is the assumption that an annotator is a spammer when Pr[yi = c|yi ] = Pr[yi = c], i.e., the estimated posterior labels are simply based on the prevalence and do not depend on the observed labels. By j j Bayes’ rule this is equivalent to Pr[yi |yi = c] = Pr[yi ] which is what we have used to deﬁne our spammer score. (2) While both notions of a spammer are equivalent, the approach of [4] ﬁrst computes the posterior labels based on the observed data, the class prevalence and the annotator j j j j This can be seen as follows: Pr[yi = k | yi = 1] = Pr[(yi ≥ k) AND (yi < k + 1) | yi = 1] = Pr[yi ≥ j j j j k | yi = 1] + Pr[yi < k + 1 | yi = 1] − Pr[(yi ≥ k) OR (yi < k + 1) | yi = 1] = Pr[yi ≥ k | yi = j j j 1] − Pr[yi ≥ k + 1 | yi = 1] = αj − αj . Here we used the fact that Pr[(yi ≥ k) OR (yi < k + 1)] = 1. k k+1 3 5 simulated | 500 instances | 30 annotators simulated | 500 instances | 30 annotators 1 12 0.8 Spammer Score 18 0.6 0.5 22 24 23 25 0.3 29 20 0.2 0.4 0.2 30 16 14 0.1 26 21 27 28 19 0 0 13 0 0.2 0.4 0.6 1−Specificity 0.8 1 500 500 500 500 500 500 500 500 500 500 0.4 0.6 500 500 500 1 0.7 500 500 500 500 500 500 500 500 500 500 500 500 500 500 3 1 500 500 500 2 8 510 7 17 4 9 27 8 30 6 3 28 7 10 2 23 22 26 24 5 1 21 29 25 14 12 17 11 18 20 19 15 16 13 4 0.8 Sensitivity 6 9 0.9 15 11 Annotator (a) Simulation setup (b) Annotator ranking Annotator rank (median) via accuracy simulated | 500 instances | 30 annotators Annotator rank (median) via Ipeirotis et.al.[4] simulated | 500 instances | 30 annotators 27 30 30 28 23 22 26 25 24 21 29 25 20 14 17 18 15 20 16 15 11 13 12 19 1 10 5 10 2 7 6 3 5 8 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (c) Comparison with accuracy 30 18 20 16 15 19 13 12 25 11 14 17 25 20 29 21 51 15 26 22 23 24 10 2 7 10 28 6 3 8 30 5 27 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (d) Comparison with Ipeirotis et. al. [4] Figure 3: (a) The simulation setup consisting of 10 good annotators (annotators 1 to 10), 10 spammers (11 to 20), and 10 malicious annotators (21 to 30). (b) The ranking of annotators obtained using the proposed spammer score. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. (c) and (d) Comparison of the median rank obtained via the spammer score with the rank obtained using (c) accuracy and (d) the method proposed by Ipeirotis et. al. [4]. parameters and then computes the expected cost. Our proposed spammer score does not depend on the prevalence of the class. Our score is also directly deﬁned only in terms of the annotator confusion matrix and does not need the observed labels. (3) For the score deﬁned in (10) while perfect annotators have a score of 0 it is not clear what should be a good baseline for a spammer. The authors suggest to compute the baseline by assuming that a worker assigns as label the class with maximum prevalence. Our proposed score has a natural scale with a perfect annotator having a score of 1 and a spammer having a score of 0. (4) However one advantage of the approach in [4] is that they can directly incorporate varied misclassiﬁcation costs. 6 Experiments Ranking annotators based on the conﬁdence interval As mentioned earlier the annotator model parameters can be estimated using the iterative EM algorithms [3, 7] and these estimated annotator parameters can then be used to compute the spammer score. The spammer score can then be used to rank the annotators. However one commonly observed phenomenon when working with crowdsourced data is that we have a lot of annotators who label only a very few instances. As a result the annotator parameters cannot be reliably estimated for these annotators. In order to factor this uncertainty in the estimation of the model parameters we compute the spammer score for 100 bootstrap replications. Based on this we compute the 95% conﬁdence intervals (CI) for the spammer score for each annotator. We rank the annotators based on the lower limit of the 95% CI. The CIs are wider 6 Table 1: Datasets N is the number of instances. M is the number of annotators. M ∗ is the mean/median number of annotators per instance. N ∗ is the mean/median number of instances labeled by each annotator. Dataset Type N M M∗ N∗ bluebird binary 108 39 39/39 108/108 temp binary 462 76 10/10 61/16 Brief Description wsd categorical/3 177 34 10/10 52/20 sentiment categorical/3 1660 33 6/6 291/175 30 100 10 38 10/10 10/10 bird identiﬁcation [12] The annotator had to identify whether there was an Indigo Bunting or Blue Grosbeak in the image. event annotation [10] Given a dialogue and a pair of verbs annotators need to label whether the event described by the ﬁrst verb occurs before or after the second. 30/30 26/20 30 30 30 30 30 30 3 30 30 1 30 20 20 20 20 20 77 117 20 60 Spammer Score 0.4 10 7 9 8 6 5 0 2 0.2 13 31 10 23 29 1 2 4 6 8 9 14 15 17 22 32 5 18 16 19 11 12 20 21 24 25 26 27 28 30 33 34 7 3 0 0.6 20 20 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 20 20 20 20 17 17 40 20 20 100 Spammer Score 0.4 108 108 108 108 108 108 108 108 108 108 108 108 108 0.8 0.6 0.2 17 8 27 30 25 35 1 12 32 37 38 16 22 9 29 15 20 19 5 39 3 21 23 14 2 10 24 7 33 13 36 31 4 34 28 18 11 6 26 0.2 30 77 77 4 108 108 108 108 0.4 0 wosi | 30 instances | 10 annotators 1 0.8 108 108 0.6 108 108 108 Spammer Score 0.8 wsd | 177 instances | 34 annotators 1 80 177 157 177 157 bluebird | 108 instances | 39 annotators 1 word similarity [10] Numeric judgements of word similarity. affect recognition [10] Each annotator is presented with a short headline and asked to rate it on a scale [-100,100] to denote the overall positive or negative valence. 40 40 20 ordinal/[0 10] ordinal[-100 100] 20 20 20 wosi valence word sense disambiguation [10] The labeler is given a paragraph of text containing the word ”president” and asked to label one of the three appropriate senses. irish economic sentiment analysis [1] Articles from three Irish online news sources were annotated by volunteer users as positive, negative, or irrelevant. 20 20 20 20 20 60 20 20 20 40 40 100 0.4 Annotator Annotator 0 1 26 10 18 28 15 5 36 23 12 8 32 31 38 13 17 27 11 2 35 24 19 9 6 30 33 37 14 29 4 3 20 34 22 25 7 16 21 40 20 0.2 26 2 6 11 5 14 3 20 9 22 31 10 12 18 8 13 30 4 1 29 19 17 27 28 21 15 25 23 7 33 16 24 32 10 132 10 360 10 0 13 18 52 75 33 32 12 74 31 51 41 55 7 14 70 42 58 65 43 1 10 47 61 73 25 37 76 67 24 46 54 48 39 56 15 62 68 44 53 64 40 9 28 6 2 57 3 4 5 8 11 16 17 19 20 21 22 23 26 27 29 30 34 35 36 38 45 49 50 59 60 63 66 69 71 72 442 462 452 10 10 0.6 238 171 75 654 20 0.2 0.2 0 12 77 67 374 249 229 453 346 428 0.4 Spammer Score 43 175 119 541 525 437 0.8 917 104 284 0.4 0.6 1211 1099 10 Spammer Score 0.8 572 30 52 402 60 0.6 30 Spammer Score 0.8 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 60 40 20 15 7 7 11 12 35 29 1 87 10 10 10 10 10 10 10 12 1 30 10 10 10 10 10 10 10 10 10 10 10 10 10 valence | 100 instances | 38 annotators 20 22 10 10 10 10 sentiment | 1660 instances | 33 annotators 10 10 30 20 10 Annotator temp | 462 instances | 76 annotators 1 Annotator 10 50 10 10 40 10 70 350 80 40 100 192 190 40 32 60 70 20 20 40 80 20 50 50 50 30 Annotator Annotator Figure 4: Annotator Rankings The rankings obtained for the datasets in Table 1. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances. when the annotator labels only a few instances. For a crowdsourced labeling task the annotator has to be good and also label a reasonable number of instances in order to be reliably identiﬁed. Simulated data We ﬁrst illustrate our proposed spammer score on simulated binary data (with equal prevalence for both classes) consisting of 500 instances labeled by 30 annotators of varying sensitivity and speciﬁcity (see Figure 3(a) for the simulation setup). Of the 30 annotators we have 10 good annotators (annotators 1 to 10 who lie above the diagonal in Figure 3(a)), 10 spammers (annotators 11 to 20 who lie around the diagonal), and 10 malicious annotators (annotators 21 to 30 who lie below the diagonal). Figure 3(b) plots the ranking of annotators obtained using the proposed spammer score with the annotator model parameters estimated via the EM algorithm [3, 7]. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence interval (CI) obtained via bootstrapping are shown. The annotators are ranked based on the lower limit of the 95% CI. As can be seen all the spammers (annotators 11 to 20) have a low spammer score and appear at the bottom of the list. The malicious annotators have higher score than the spammers since we can correct for their ﬂipping. The malicious annotators are good annotators but they ﬂip their labels and as such are not spammers if we detect that they are malicious. Figure 3(c) compares the (median) rank obtained via the spammer score with the (median) rank obtained using accuracy as the score to rank the annotators. While the good annotators are ranked high by both methods the accuracy score gives a low rank to the malicious annotators. Accuracy does not capture the notion of a spammer. Figure 3(d) compares the ranking with the method proposed by Ipeirotis et. al. [4] which gives almost similar rankings as our proposed score. 7 21 23 10 6 35 4 34 1126 18 147 30 3 31 13 2436 33 25 5 2 20 15 39 19 15 20 28 22 299 12 37 16 38 10 32 1 5 27 25 35 30 8 17 0 0 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 40 bluebird | 108 instances | 39 annotators 40 1 6 34 112618 4 31 1013 7 30 2 28 21 5 20 15 39 19 20 15 22 37 16 299 12 38 10 5 8 17 0 0 27 25 35 30 0.6 0.5 35 32 2 0.4 36 11 13 31 24 10 33 28 21 26 18 0 0 40 34 15 19 39 0.1 (a) 22 37 20 38 29 9 0.2 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 6 4 16 0.3 32 1 7 30 25 1 3 14 3 27 5 0.7 24 33 14 36 23 25 12 8 0.9 17 0.8 35 Sensitivity Annotator rank (median) via accuracy bluebird | 108 instances | 39 annotators Annotator rank (median) via Ipeirotis et.al.[4] bluebird | 108 instances | 39 annotators 40 23 0.2 0.4 0.6 1−Specificity (b) 0.8 1 (c) Figure 5: Comparison of the rank obtained via the spammer score with the rank obtained using (a) accuracy and (b) the method proposed by Ipeirotis et. al. [4] for the bluebird binary dataset. (c) The annotator model parameters as estimated by the EM algorithm [3, 7]. 19 25 12 18 7 3 14 20 32 5 8 1 16 20 9 21 15 34 10 31 29 17 28 22 26 2315 5 2 0 0 4 6 13 10 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 30 25 16 19 7 25 8 9 27 14 3 28 17 18 32 5 10 4 2 10 6 1529 31 23 22 21 15 0 0 33 30 11 1 20 5 sentiment | 1660 instances | 33 annotators 24 35 12 20 24 34 26 13 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 33 7 30 15 17 25 28 2719 2223 20 8 1 4 1812 15 13 10 20 32 30 10 3 29 9 31 16 5 6 2 5 14 11 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score 25 21 Annotator rank (median) via Ipeirotis et.al.[4] 25 24 27 Annotator rank (median) via accuracy Annotator rank (median) via accuracy 30 sentiment | 1660 instances | 33 annotators wsd | 177 instances | 34 annotators 33 30 11 Annotator rank (median) via Ipeirotis et.al.[4] wsd | 177 instances | 34 annotators 35 7 30 15 19 17 27 25 21 25 8 12 4 18 20 24 15 20 33 10 3 13 9 28 1 29 23 10 1632 11 14 5 6 2 5 31 30 22 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score Figure 6: Comparison of the median rank obtained via the spammer score with the rank obtained using accuracy and he method proposed by Ipeirotis et. al. [4] for the two categorial datasets in Table 1. Mechanical Turk data We report results on some publicly available linguistic and image annotation data collected using the Amazon’s Mechanical Turk (AMT) and other sources. Table 1 summarizes the datasets. Figure 4 plots the spammer scores and rankings obtained. The mean and the 95% CI obtained via bootstrapping are also shown. The number at the top of the CI bar shows the number of instances annotated by that annotator. The rankings are based on the lower limit of the 95% CI which factors the number of instances labeled by the annotator into the ranking. An annotator who labels only a few instances will have very wide CI. Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide and hence ranked lower. Ideally we would like to have annotators with a high score and at the same time label a lot of instances so that we can reliablly identify them. The authors [1] for the sentiment dataset shared with us some of the qualitative observations regarding the annotators and they somewhat agree with our rankings. For example the authors made the following comments about Annotator 7 ”Quirky annotator - had a lot of debate about what was the meaning of the annotation question. I’d say he changed his labeling strategy at least once during the process”. Our proposed score gave a low rank to this annotator. Comparison with other approaches Figure 5 and 6 compares the proposed ranking with the rank obtained using accuracy and the method proposed by Ipeirotis et. al. [4] for some binary and categorical datasets in Table 1. Our proposed ranking is somewhat similar to that obtained by Ipeirotis et. al. [4] but accuracy does not quite capture the notion of spammer. For example for the bluebird dataset for annotator 21 (see Figure 5(a)) accuracy ranks it at the bottom of the list while the proposed score puts is in the middle of the list. From the estimated model parameters it can be seen that annotator 21 actually ﬂips the labels (below the diagonal in Figure 5(c)) but is a good annotator. 7 Conclusions We proposed a score to rank annotators for crowdsourced binary, categorical, and ordinal labeling tasks. The obtained rankings and the scores can be used to allocate monetary bonuses to be paid to different annotators and also to eliminate spammers from further labeling tasks. A mechanism to rank annotators should be desirable feature of any crowdsourcing service. The proposed score should also be useful to specify the prior for Bayesian approaches to consolidate annotations. 8 References [1] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems (PAIS’10), 2010. [2] B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. [3] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. [4] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP’10), pages 64–67, 2010. [5] V. C. Raykar and S. Yu. An entropic score to rank annotators for crowdsourced labelling tasks. In Proceedings of the Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011. [6] V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), pages 889– 896, 2009. [7] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010. [8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. [9] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085–1092. 1995. [10] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast—but is it good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 254–263, 2008. [11] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. [12] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432. 2010. [13] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. [14] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pages 932–939, 2010. 9

6 0.55020452 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

7 0.53091711 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

8 0.52088666 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

9 0.50956303 33 nips-2011-An Exact Algorithm for F-Measure Maximization

10 0.49627835 14 nips-2011-A concave regularization technique for sparse mixture models

11 0.48775944 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

12 0.47925678 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

13 0.46771625 104 nips-2011-Generalized Beta Mixtures of Gaussians

14 0.46587715 229 nips-2011-Query-Aware MCMC

15 0.46130794 8 nips-2011-A Model for Temporal Dependencies in Event Streams

16 0.45073614 221 nips-2011-Priors over Recurrent Continuous Time Processes

17 0.44937935 55 nips-2011-Collective Graphical Models

18 0.44215888 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

19 0.43511808 277 nips-2011-Submodular Multi-Label Learning

20 0.43025336 60 nips-2011-Confidence Sets for Network Structure

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (4, 0.044), (9, 0.011), (20, 0.03), (26, 0.045), (31, 0.128), (33, 0.039), (43, 0.063), (45, 0.082), (57, 0.024), (74, 0.041), (80, 0.281), (83, 0.041), (84, 0.012), (86, 0.01), (99, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76815724 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

Author: Fabian L. Wauthier, Michael I. Jordan

2 0.73423171 285 nips-2011-The Kernel Beta Process

Author: Lu Ren, Yingjian Wang, Lawrence Carin, David B. Dunson

Abstract: A new L´ vy process prior is proposed for an uncountable collection of covariatee dependent feature-learning measures; the model is called the kernel beta process (KBP). Available covariates are handled efﬁciently via the kernel construction, with covariates assumed observed with each data sample (“customer”), and latent covariates learned for each feature (“dish”). Each customer selects dishes from an inﬁnite buffet, in a manner analogous to the beta process, with the added constraint that a customer ﬁrst decides probabilistically whether to “consider” a dish, based on the distance in covariate space between the customer and dish. If a customer does consider a particular dish, that dish is then selected probabilistically as in the beta process. The beta process is recovered as a limiting case of the KBP. An efﬁcient Gibbs sampler is developed for computations, and state-of-the-art results are presented for image processing and music analysis tasks. 1

3 0.55630666 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

4 0.55145425 75 nips-2011-Dynamical segmentation of single trials from population neural data

Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identiﬁed by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identiﬁed without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the ﬁring rate of an isolated neuron based on the ﬁring rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reﬂect its computational role. 1

5 0.54895806 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

Author: Armen Allahverdyan, Aram Galstyan

Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only ﬁnite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1

6 0.54710042 180 nips-2011-Multiple Instance Filtering

7 0.5456329 241 nips-2011-Scalable Training of Mixture Models via Coresets

8 0.5441013 249 nips-2011-Sequence learning with hidden units in spiking neural networks

9 0.54394752 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

10 0.54125941 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

11 0.54051852 301 nips-2011-Variational Gaussian Process Dynamical Systems

12 0.54027689 221 nips-2011-Priors over Recurrent Continuous Time Processes

13 0.53983271 66 nips-2011-Crowdclustering

14 0.53704494 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

15 0.53702939 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

16 0.53649533 158 nips-2011-Learning unbelievable probabilities

17 0.53605276 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

18 0.53598779 197 nips-2011-On Tracking The Partition Function

19 0.5358181 55 nips-2011-Collective Graphical Models

20 0.53433025 156 nips-2011-Learning to Learn with Compound HD Models