nips nips2011 nips2011-137 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. [sent-3, score-1.04]
2 We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. [sent-7, score-0.966]
3 We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. [sent-8, score-0.531]
4 Crowdsourcing systems such as Amazon Mechanical Turk provide a market where a “taskmaster” can submit batches of small tasks to be completed for a small fee by any worker choosing to pick them up. [sent-11, score-0.541]
5 For example, a worker may be able to earn a few cents by indicating which images from a set of 30 are suitable for children (one of the benefits of crowdsourcing is its applicability to such highly subjective questions). [sent-12, score-0.539]
6 Since typical crowdsourced tasks are tedious and the reward is small, errors are common even among workers who make an effort. [sent-13, score-0.558]
7 At the extreme, some workers are “spammers”, submitting arbitrary answers independent of the question in order to collect their fee. [sent-14, score-0.693]
8 Thus, all crowdsourcers need strategies to ensure the reliability of answers. [sent-15, score-0.24]
9 Because the worker crowd is large, anonymous, and transient, it is generally difficult to build up a trust relationship with particular workers. [sent-16, score-0.415]
10 1 It is also difficult to condition payment on correct answers, as the correct answer may never truly be known and delaying payment can annoy workers and make it harder to recruit them for your future tasks. [sent-17, score-0.653]
11 Instead, most crowdsourcers resort to redundancy, giving each task to multiple workers, paying them all irrespective of their answers, and aggregating the results by some method such as majority voting. [sent-18, score-0.319]
12 1 For certain high-value tasks, crowdsourcers can use entrance exams to “prequalify” workers and block spammers, but this increases the cost and still provides no guarantee that the prequalified workers will try hard. [sent-23, score-0.972]
13 1 taskmaster wishes to achieve a certain reliability in their answers, how can they do so at minimum cost (which is equivalent to asking how they can do so while asking the fewest possible questions)? [sent-24, score-0.359]
14 Workers are neither persistent nor identifiable; each batch of tasks will be solved by a worker who may be completely new and who you may never see again. [sent-26, score-0.484]
15 Nonetheless, by comparing one worker’s answer to others’ on the same question, it is possible to draw conclusions about a worker’s reliability, which can be used to weight their answers to other questions in their batch. [sent-28, score-0.351]
16 We will then describe a scheme for deciding which tasks to assign to which workers and introduce a novel iterative algorithm to infer the correct answers from the workers’ responses. [sent-33, score-1.089]
17 We model a set of m tasks {ti }i∈[m] as each being associated with an unobserved ‘correct’ answer si ∈ {±1}. [sent-35, score-0.284]
18 We assign these tasks to n workers from the crowd, which we denote by {wj }j∈[n] . [sent-38, score-0.607]
19 When a task is assigned to a worker, we get a possibly inaccurate answer from the worker. [sent-39, score-0.214]
20 We use Aij ∈ {±1} to denote the answer if task ti is assigned to worker wj . [sent-40, score-0.68]
21 Some workers are more diligent or have more expertise than others, while some other workers might be spammers. [sent-41, score-0.924]
22 We choose a simple model to capture this diversity in workers’ reliability: we assume that each worker wj is characterized by a reliability pj ∈ [0, 1], and that they make errors randomly on each question they answer. [sent-42, score-0.819]
23 Precisely, if task ti is assigned to worker wj then Aij = si −si with probability pj , with probability 1 − pj , and Aij = 0 if ti is not assigned to wj . [sent-43, score-1.394]
24 The random variable Aij is independent of any other event given pj . [sent-44, score-0.231]
25 ) The underlying assumption here is that the error probability of a worker does not depend on the particular task and all the tasks share an equal level of difficulty. [sent-46, score-0.592]
26 We further assume that the reliability of workers {pj }j∈[n] are independent and identically distributed random variables with a given distribution on [0, 1]. [sent-48, score-0.597]
27 One example is spammer-hammer model where, each worker is either a ‘hammer’ with probability q or is a ‘spammer’ with probability 1 − q. [sent-49, score-0.369]
28 A hammer answers all questions correctly, in which case pj = 1, and a spammer gives random answers, in which case pj = 1/2. [sent-50, score-0.839]
29 Given this random variable pj , we define an important parameter q ∈ [0, 1], which captures the ‘average quality’ of the crowd: q ≡ E[(2pj − 1)2 ]. [sent-51, score-0.231]
30 A value of q close to one indicates that a large proportion of the workers are diligent, whereas q close to zero indicates that there are many spammers in the crowd. [sent-52, score-0.519]
31 We will see later that our bound on the error rate of our inference algorithm holds for any distribution of pj but depends on the distribution only through this parameter q. [sent-54, score-0.36]
32 It is quite realistic to assume the existence of a prior distribution for pj . [sent-55, score-0.231]
33 The model is therefore quite general: in particular, it is met if we simply randomize the order in which we upload our task batches, since this will have the effect of randomizing which workers perform which batches, yielding a distribution that meets our requirements. [sent-56, score-0.521]
34 Under this crowdsourcing model, a taskmaster first decides which tasks should be assigned to which workers, and then estimates the correct solutions {si }i∈[m] once all the answers {Aij } are submitted. [sent-60, score-0.737]
35 We assume a one-shot scenario in which all questions are asked simultaneously and then an estimation is performed after all the answers are obtained. [sent-61, score-0.301]
36 In particular, we do not allow allocating 2 tasks adaptively based on the answers received thus far. [sent-62, score-0.365]
37 Then, assigning tasks to nodes amounts to designing a bipartite graph G({ti }i∈[m] ∪ {wj }j∈[n] , E) with m task and n worker nodes. [sent-63, score-0.778]
38 Each edge (i, j) ∈ E indicates that task ti was assigned to worker wj . [sent-64, score-0.63]
39 A naive approach to identify the correct answer from multiple workers’ responses is to use majority voting. [sent-66, score-0.247]
40 Majority voting simply chooses what the majority of workers agree on. [sent-67, score-0.714]
41 When there are many spammers, majority voting is error-prone since it weights all the workers equally. [sent-68, score-0.714]
42 We will show that majority voting is provably sub-optimal and can be significantly improved upon. [sent-69, score-0.271]
43 To infer the answers of the tasks and also the reliability of workers, Dawid and Skene [1, 2] proposed an algorithm based on expectation maximization (EM) [3]. [sent-70, score-0.546]
44 In this work, we provide a rigorous treatment of designing a crowdsourcing system with the aim of minimizing the budget to achieve completion of a set of tasks with a certain reliability. [sent-80, score-0.409]
45 We provide both an asymptotically optimal graph construction (random regular bipartite graph) and an asymptotically optimal algorithm for inference (iterative algorithm) on that graph. [sent-81, score-0.347]
46 The surprise lies in the fact that the optimality of our algorithm is established by comparing it with the best algorithm, one that is free to choose any graph, regular or irregular, and performs optimal estimation based on the information provided by an oracle about reliability of the workers. [sent-83, score-0.283]
47 None of the prior work on crowdsourcing provides any systematic treatment of the graph construction. [sent-85, score-0.251]
48 The iterative algorithm we introduce operates on real-valued messages whose distribution is a priori difficult to analyze. [sent-88, score-0.254]
49 2 Main result Under the crowdsourcing model introduced, we want to design algorithms to assign tasks and estimate the answers. [sent-91, score-0.334]
50 In what follows, we explain how to assign tasks using a random regular graph and introduce a novel iterative algorithm to infer the correct answers. [sent-92, score-0.496]
51 We state the performance guarantees for our algorithm and provide comparisons to majority voting and an oracle estimator. [sent-93, score-0.341]
52 Assigning tasks amounts to designing a bipartite graph G {ti }i∈[m] ∪ {wj }j∈[n] , E , where each edge corresponds to a task-worker assignment. [sent-95, score-0.303]
53 The taskmaster makes a choice of how many workers to assign to each task (the left degree l) and how many tasks to assign to each worker (the right degree r). [sent-96, score-1.211]
54 Since the total number of edges has to be consistent, the number of workers n directly follows from ml = nr. [sent-97, score-0.478]
55 To generate an (l, r)-regular bipartite graph we use a random graph generation scheme known as the configuration model in random graph literature [9, 10]. [sent-98, score-0.322]
56 In principle, one could use arbitrary bipartite graph G for task allocation. [sent-99, score-0.238]
57 We introduce a novel iterative algorithm which operates on real-valued task messages {xi→j }(i,j)∈E and worker messages {yj→i }(i,j)∈E . [sent-102, score-0.805]
58 The worker messages are initialized as independent Gaussian random variables. [sent-103, score-0.473]
59 Intuitively, a worker message yj→i represents our belief on how ‘reliable’ the worker j is, such that our final estimate is a weighted sum of the answers weighted by each worker’s reliability: si = sign( j∈∂i Aij yj→i ). [sent-105, score-1.134]
60 First, the iterative algorithm does not require any knowledge of the prior distribution of pj , whereas the standard BP requires the knowledge of the distribution. [sent-111, score-0.381]
61 On the otherhand, the iterative algorithm only passes messages that are real numbers regardless of the prior distribution of pj , which is easy to implement. [sent-113, score-0.485]
62 It is also very simple to write down the density evolution equations for the iterative algorithm, but it is not trivial to analyze the densities in this case either. [sent-117, score-0.286]
63 Further, an algorithm independent lower bound that we establish suggests that such an error dependence on lq is unavoidable. [sent-122, score-0.246]
64 For fixed l > 1 and r > 1, assume that m tasks are assigned to n = ml/r workers according to a random (l, r)-regular graph drawn from the configuration model. [sent-132, score-0.691]
65 If the distribution of the worker reliaiblity satisfy µ ≡ E[2pj − 1] > 0 and q 2 > 1/(ˆr), then for any s ∈ {±1}m , the lˆ estimates from k iterations of the iterative algorithm achieve lim sup m→∞ 1 m m P si = si {Aij }(i,j)∈E ˆ i=1 4 ≤ 2 e−lq/(2ρk ) . [sent-133, score-0.908]
66 1, lim sup lim sup k→∞ m→∞ 1 m m ≤ P si = si {Aij }(i,j)∈E ˆ 2 e−lq/(2ρ∞ ) . [sent-138, score-0.45]
67 Even if we fix the value√ q = E[(2pj − 1)2 ], different of distributions of pj can have different values of µ in the range of [q, q]. [sent-140, score-0.231]
68 Notice that the bound in (2) is only meaningful when it is less than a half, whence ˆrq 2 > 1 and lˆ ˆrq 2 < 1 may not be of interest, for the purpose lq > 6 log(2) > 4. [sent-145, score-0.232]
69 2 1 m m P si = si {Aij }(i,j)∈E ˆ 2 ≤ e−lµ /4 . [sent-153, score-0.238]
70 First, the iterative algorithm is efficient with runtime comparable to the simple majority voting which requires O(ml) operations. [sent-155, score-0.45]
71 The runtime is the worst when µ = q, which happens√ √ the spammer-hammer model, and it is the best when µ = q which happens if pj = (1 + q)/2 deterministically. [sent-162, score-0.26]
72 There exists a (non-iterative) polynomial time algorithm with runtime independent of q for computing the estimate which achieves (2), but in practice we expect that the number of iterations needed is small enough that the iterative algorithm will outperform this non-iterative algorithm. [sent-163, score-0.206]
73 If there is no assumption on µ, then we cannot distinguish if the responses came from tasks with {si }i∈[m] and workers with {pj }j∈[n] or tasks with {−si }i∈[m] and workers with {1 − pj }j∈[n] . [sent-167, score-1.347]
74 Third, our algorithm does not require any information on the distribution of pj . [sent-170, score-0.258]
75 Further, unlike other EM based algorithms, the iterative algorithm is not sensitive to initialization and with random initialization converges to a unique estimate with high probability. [sent-171, score-0.243]
76 In our case, the leading left singular vector of A can be used to estimate the correct answers, where A ∈ {0, ±1}m×n is the m × n adjacency matrix of the graph G weighted by the submitted answers. [sent-175, score-0.233]
77 In particular, if we use the leading singular vector of A to estimate s, such that si = sign(ui ), then existing analysis techniques from random matrix theory does not give the strong performance guarantee we have. [sent-181, score-0.229]
78 Since the iterative algorithm introduced in this paper is quite similar to power iteration used to compute the leading singular vectors, this suggests that our analysis may shed light on how to analyze the top singular vectors of a sparse random matrix. [sent-185, score-0.359]
79 4 Optimality of our algorithm As a taskmaster, the natural core optimization problem of our concern is how to achieve a certain reliability in our answers with minimum cost. [sent-187, score-0.476]
80 Here we compute the total budget sufficient to achieve a target error rate using our algorithm and show that this is within a constant factor from the necessary budget to achieve the given target error rate using any graph and the best possible inference algorithm. [sent-189, score-0.459]
81 all task assignments are done simultaneously, then an estimation is performed after all the answers are obtained. [sent-192, score-0.382]
82 To measure accuracy, we use the average probability of error per task denoted by dm (s, s) ≡ ˆ (1/m) i∈[m] P(si = si ). [sent-195, score-0.282]
83 We will show that Ω (1/q) log(1/ ) assignments per task is necesˆ sary and sufficient to achieve the target error rate: dm (s, s) ≤ . [sent-196, score-0.297]
84 Consider the case where nature chooses a set of correct answers s ∈ {±1}m and a distribution of the worker reliability pj ∼ f . [sent-198, score-1.046]
85 Let G(m, l) denote the set of all bipartite graphs, including irregular graphs, that have m task nodes and ml total number of edges. [sent-201, score-0.223]
86 This minimax bound is established by computing the error rate of an oracle esitimator, which makes an optimal decision based on the information provided by an oracle who knows how reliable each worker is. [sent-203, score-0.595]
87 Next, we show that the error rate of majority voting decays significantly slower: the leading term in the error exponent scales like −lq 2 . [sent-204, score-0.378]
88 Then, for q ∈ (0, 1), there exists a numerical constant C1 such that inf sup G∈G(m,l) s,f ∈F (q) dm (s, sMV ) ˆ = e−(C1 lq 2 +O(lq 4 +1)) . [sent-206, score-0.251]
89 (5) The lower bound in (4) does not depend on how many tasks are assigned to each worker. [sent-207, score-0.21]
90 Let sIter be the estimate given by random regular graphs and the ˆ iterative algorithm. [sent-210, score-0.223]
91 2 gives lq ˆ lim sup m→∞ s,f ∈F (q) dm (s, sIter ) ≤ e−C4 lq . [sent-212, score-0.453]
92 We ran numerical experiments with 1000 tasks and 1000 workers from the spammer-hammer model assigned according to random graphs with l = r from the configuration model. [sent-214, score-0.651]
93 4 Figure 1: The iterative algorithm improves over majority voting and EM algorithm [8]. [sent-233, score-0.448]
94 Now, let ∆LB be the minimum cost per task necessary to achieve a target accuracy ∈ (0, 1/2) using any graph and any possible algorithm. [sent-234, score-0.239]
95 If I is a random integer drawn uniformly in [m], then (k) (k) (1/m) i∈[m] P(si = si ) ≤ P(xI ≤ 0), where xi denotes the decision variable for task i after ˆ k iterations of the iterative algorithm. [sent-244, score-0.351]
96 We initialize yp with a Gaussian distribution independent of (0) d p: yp ∼ N (1, 1). [sent-253, score-0.358]
97 }, d (k−1) x(k) = zpi ,i ypi ,i , d (k) (k) yp = i∈[l−1] zp,j xj , (8) j∈[r−1] where xj ’s, pi ’s, and yp,i ’s are independent copies of x, p, and yp , respectively. [sent-258, score-0.488]
98 Applying the Chernoff bound 2 with λ = −mk /(σk ), we get (k) ˆ P x(k) ≤ 0 x ≤ E eλˆ 2 ˆ 2 ≤ e−l mk /(2 l σk ) , (11) 2 Since mk mk−1 /(σk ) ≤ µ2 ˆ2 (qˆr)2k−3 /(3µ2 qˆ3 r2 (qˆr)2k−4 ) = 1/(3ˆ), it is easy to check that l lˆ l ˆ lˆ r |λ| ≤ 1/(2mk−1 r). [sent-273, score-0.419]
99 We can write down a recursive formula for the evolution of the moment generating functions of x and yp as E eλx (k) (k) E eλyp (k−1) = Ep pE[eλyp = pE eλx (k) (k−1) |p] + pE[e−λyp ¯ (k) + pE e−λx ¯ ˆ l |p] , (12) r ˆ , (13) where p = 1 − p and p = 1 − p. [sent-277, score-0.241]
100 Hence, we get a recursion for mk and ˆ ˆ ˆ σk such that (10) holds for |λ| ≤ 1/(2mk−1 r): ˆ mk+1 = qˆrmk , lˆ 2 σk+1 = (3qˆr2 + ˆr)m2 + ˆrσk . [sent-299, score-0.238]
wordName wordTfidf (topN-words)
[('workers', 0.443), ('worker', 0.369), ('answers', 0.25), ('pj', 0.231), ('aij', 0.185), ('yp', 0.179), ('mk', 0.171), ('crowdsourcing', 0.17), ('majority', 0.155), ('reliability', 0.154), ('lq', 0.146), ('iterative', 0.123), ('si', 0.119), ('voting', 0.116), ('tasks', 0.115), ('pe', 0.11), ('taskmaster', 0.108), ('messages', 0.104), ('crowdsourcers', 0.086), ('graph', 0.081), ('bipartite', 0.079), ('task', 0.078), ('spammers', 0.076), ('yj', 0.069), ('em', 0.068), ('ti', 0.066), ('wj', 0.065), ('emk', 0.065), ('pemk', 0.065), ('ypi', 0.065), ('zpi', 0.065), ('singular', 0.063), ('evolution', 0.062), ('regular', 0.059), ('batches', 0.057), ('rq', 0.057), ('lim', 0.056), ('dm', 0.055), ('assignments', 0.054), ('assigned', 0.052), ('labelers', 0.052), ('kmax', 0.052), ('budget', 0.051), ('questions', 0.051), ('sup', 0.05), ('answer', 0.05), ('assign', 0.049), ('leading', 0.047), ('crowd', 0.046), ('bp', 0.045), ('achieve', 0.045), ('oracle', 0.043), ('bound', 0.043), ('siter', 0.043), ('skipped', 0.043), ('smv', 0.043), ('whence', 0.043), ('correct', 0.042), ('graphs', 0.041), ('deciding', 0.04), ('payment', 0.038), ('diligent', 0.038), ('hammer', 0.038), ('spammer', 0.038), ('zp', 0.038), ('reliable', 0.037), ('asymptotically', 0.036), ('analyze', 0.036), ('density', 0.035), ('ml', 0.035), ('dawid', 0.035), ('provost', 0.035), ('sheng', 0.035), ('iter', 0.035), ('lb', 0.035), ('converges', 0.035), ('target', 0.035), ('get', 0.034), ('recursion', 0.033), ('xi', 0.031), ('irregular', 0.031), ('error', 0.03), ('minimax', 0.03), ('densities', 0.03), ('nishes', 0.03), ('corollary', 0.029), ('runtime', 0.029), ('ep', 0.029), ('initialization', 0.029), ('inference', 0.029), ('acquiring', 0.028), ('designing', 0.028), ('substituting', 0.028), ('assigning', 0.028), ('guration', 0.028), ('chernoff', 0.027), ('bc', 0.027), ('message', 0.027), ('algorithm', 0.027), ('asking', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
2 0.40456694 66 nips-2011-Crowdclustering
Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona
Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1
3 0.24609482 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
4 0.096064813 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama
Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1
5 0.095904864 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the sources of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics. 1
6 0.092817441 231 nips-2011-Randomized Algorithms for Comparison-based Search
7 0.07128232 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
8 0.070087105 288 nips-2011-Thinning Measurement Models and Questionnaire Design
9 0.067642041 168 nips-2011-Maximum Margin Multi-Instance Learning
10 0.06675224 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
11 0.063732162 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
12 0.059215602 175 nips-2011-Multi-Bandit Best Arm Identification
13 0.057461247 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
14 0.057356678 277 nips-2011-Submodular Multi-Label Learning
15 0.057083301 258 nips-2011-Sparse Bayesian Multi-Task Learning
16 0.056741752 145 nips-2011-Learning Eigenvectors for Free
17 0.056719001 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
18 0.055073418 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
19 0.054518726 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
20 0.052777261 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
topicId topicWeight
[(0, 0.192), (1, -0.009), (2, -0.051), (3, -0.01), (4, -0.019), (5, -0.048), (6, -0.093), (7, -0.086), (8, -0.012), (9, -0.004), (10, -0.05), (11, -0.05), (12, -0.064), (13, -0.156), (14, 0.13), (15, 0.094), (16, -0.032), (17, -0.021), (18, 0.261), (19, 0.184), (20, 0.039), (21, -0.146), (22, 0.045), (23, -0.299), (24, 0.264), (25, -0.261), (26, 0.088), (27, -0.133), (28, -0.067), (29, -0.144), (30, -0.027), (31, 0.073), (32, 0.072), (33, 0.006), (34, -0.001), (35, -0.059), (36, -0.031), (37, -0.046), (38, 0.022), (39, 0.003), (40, 0.044), (41, -0.085), (42, -0.039), (43, 0.114), (44, -0.027), (45, 0.015), (46, 0.008), (47, 0.046), (48, -0.006), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.92432755 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
2 0.82110167 66 nips-2011-Crowdclustering
Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona
Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1
3 0.64828944 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
4 0.44082236 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
Author: Fabian L. Wauthier, Michael I. Jordan
Abstract: Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the sources of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to active learning. Experiments show BBMC to outperform many common heuristics. 1
5 0.37175211 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
Author: Vikas C. Raykar, Shipeng Yu
Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers–annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we formalize the notion of a spammer and define a score which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a high score close to one. 1 Spammers in crowdsourced labeling tasks Annotating an unlabeled dataset is one of the bottlenecks in using supervised learning to build good predictive models. Getting a dataset labeled by experts can be expensive and time consuming. With the advent of crowdsourcing services (Amazon’s Mechanical Turk being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see [8], [10], and [11] for some computer vision and natural language processing case studies). One drawback of most crowdsourcing services is that we do not have tight control over the quality of the annotators. The annotators can come from a diverse pool including genuine experts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good quality labels requestors typically get each instance labeled by multiple annotators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods that model and correct for the annotator biases [3, 9, 6, 7, 14] and/or task complexity [2, 13, 12]. In this paper we are interested in ranking annotators based on how spammer like each annotator is. In our context a spammer is a low quality annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator). Spammers can significantly increase the cost of acquiring annotations (since they need to be paid) and at the same time decrease the accuracy of the final consensus labels. A mechanism to detect and eliminate spammers is a desirable feature for any crowdsourcing market place. For example one can give monetary bonuses to good annotators and deny payments to spammers. The main contribution of this paper is to formalize the notion of a spammer for binary, categorical, and ordinal labeling tasks. More specifically we define a scalar metric which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a score close to one (see Figure 4). We summarize the multiple parameters corresponding to each annotator into a single score indicative of how spammer like the annotator is. While this spammer score was implicit for binary labels in earlier works [3, 9, 2, 6] the extension to categorical and ordinal labels is novel and is quite different from the accuracy computed from the confusion rate matrix. An attempt to quantify the quality of the workers based on the confusion matrix was recently made by [4] where they transformed the observed labels into posterior soft labels based on the estimated confusion 1 matrix. While we obtain somewhat similar annotator rankings, we differ from this work in that our score is directly defined in terms of the annotator parameters (see § 5 for more details). The rest of the paper is organized as follows. For ease of exposition we start with binary labels (§ 2) and later extend it to categorical (§ 3) and ordinal labels (§ 4). We first specify the annotator model used, formalize the notion of a spammer, and propose an appropriate score in terms of the annotator model parameters. We do not dwell too much on the estimation of the annotator model parameters. These parameters can either be estimated directly using known gold standard 1 or the iterative algorithms that estimate the annotator model parameters without actually knowing the gold standard [3, 9, 2, 6, 7]. In the experimental section (§ 6) we obtain rankings for the annotators using the proposed spammer scores on some publicly available data from different domains. 2 Spammer score for crowdsourced binary labels j Annotator model Let yi ∈ {0, 1} be the label assigned to the ith instance by the j th annotator, and let yi ∈ {0, 1} be the actual (unobserved) binary label. We model the accuracy of the annotator separately on the positive and the negative examples. If the true label is one, the sensitivity (true positive rate) αj for the j th annotator is defined as the probability that the annotator labels it as one. j αj := Pr[yi = 1|yi = 1]. On the other hand, if the true label is zero, the specificity (1−false positive rate) β j is defined as the probability that annotator labels it as zero. j β j := Pr[yi = 0|yi = 0]. Extensions of this basic model have been proposed to include item level difficulty [2, 13] and also to model the annotator performance based on the feature vector [14]. For simplicity we use the basic model proposed in [7] in our formulation. Based on many instances labeled by multiple annotators the maximum likelihood estimator for the annotator parameters (αj , β j ) and also the consensus ground truth (yi ) can be estimated iteratively [3, 7] via the Expectation Maximization (EM) algorithm. The EM algorithm iteratively establishes a particular gold standard (initialized via majority voting), measures the performance of the annotators given that gold standard (M-step), and refines the gold standard based on the performance measures (E-step). Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator. More precisely an annotator is a spammer if the probability j of observed label yi being one given the true label yi is independent of the true label, i.e., j j Pr[yi = 1|yi ] = Pr[yi = 1]. This means that the annotator is assigning labels randomly by flipping a coin with bias without actually looking at the data. Equivalently (1) can be written as j j Pr[yi = 1|yi = 1] = Pr[yi = 1|yi = 0] which implies αj = 1 − β j . (1) j Pr[yi = 1] (2) Hence in the context of the annotator model defined earlier a perfect spammer is an annotator for whom αj + β j − 1 = 0. This corresponds to the diagonal line on the Receiver Operating Characteristic (ROC) plot (see Figure 1(a)) 2 . If αj + β j − 1 < 0 then the annotators lies below the diagonal line and is a malicious annotator who flips the labels. Note that a malicious annotator has discriminatory power if we can detect them and flip their labels. In fact the methods proposed in [3, 7] can automatically flip the labels for the malicious annotators. Hence we define the spammer score for an annotator as S j = (αj + β j − 1)2 (3) An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 1 One of the commonly used strategy to filter out spammers is to inject some items into the annotations with known labels. This is the strategy used by CrowdFlower (http://crowdflower.com/docs/gold). 2 Also note that (αj + β j )/2 is equal to the area shown in the plot and can be considered as a non-parametric approximation to the area under the ROC curve (AUC) based on one observed point. It is also equal to the Balanced Classification Rate (BCR). So a spammer can also be defined as having BCR or AUC equal to 0.5. 2 Equal accuracy contours (prevalence=0.5) 0.9 1 Good Annotators Biased Annotators 0.8 0.9 Sensitivity Sensitivity ( αj ) Spammers 0.5 0.4 j 0.3 j 0.6 0.5 0.4 4 0. 5 0. 3 0. 7 0. 6 0. 4 0. 5 0. 2 0. 3 0. Malicious Annotators 0.2 0.4 0.6 1−Specificity ( βj ) 0.8 0.1 1 0. 0. 2 0. 3 0. 1 0. 0.6 0.5 1 0. 2 0. 2 0. 1 0. 0.4 3 1 0. 0. 2 0. 4 0. 0.2 4 0. 5 0. 0 0 1 3 0. 0.3 0.2 Biased Annotators 4 0.7 6 0. 4 0. 0.8 7 0.3 Area = (α +β )/2 0.2 0 0 0.9 0. 8 0. 8 0. 0.7 6 0. .5 0 5 0. 0.7 [ 1−βj, αj ] 0.6 0.1 6 0. 0.8 0.7 Equal spammer score contours 1 7 0. 8 0. 9 0. Sensitivity 1 (a) Binary annotator model 0.1 1 0. 2 0. 3 0. 0.2 0.4 0.6 1−Specificity 0.8 1 1 0. 0 0 (b) Accuracy 0.2 3 0. 4 0. 0.4 0.6 1−Specificity 5 0. .6 7 0 0. 8 0. 0.8 1 (c) Spammer score Figure 1: (a) For binary labels an annotator is modeled by his/her sensitivity and specificity. A perfect spammer lies on the diagonal line on the ROC plot. (b) Contours of equal accuracy (4) and (c) equal spammer score (3). Accuracy This notion of a spammer is quite different for that of the accuracy of an annotator. An annotator with high accuracy is a good annotator but one with low accuracy is not necessarily a spammer. The accuracy is computed as 1 j Accuracyj = Pr[yi = yi ] = j Pr[yi = 1|yi = k]Pr[yi = k] = αj p + β j (1 − p), (4) k=0 where p := Pr[yi = 1] is the prevalence of the positive class. Note that accuracy depends on prevalence. Our proposed spammer score does not depend on prevalence and essentially quantifies the annotator’s inherent discriminatory power. Figure 1(b) shows the contours of equal accuracy on the ROC plot. Note that annotators below the diagonal line (malicious annotators) have low accuracy. The malicious annotators are good annotators but they flip their labels and as such are not spammers if we can detect them and then correct for the flipping. In fact the EM algorithms [3, 7] can correctly flip the labels for the malicious annotators and hence they should not be treated as spammers. Figure 1(c) also shows the contours of equal score for our proposed score and it can be seen that the malicious annotators have a high score and only annotators along the diagonal have a low score (spammers). Log-odds Another interpretation of a spammer can be seen from the log odds. Using Bayes’ rule the posterior log-odds can be written as log j Pr[yi = 1|yi ] Pr[yi = j 0|yi ] = log j Pr[yi |yi = 1] j Pr[yi |yi = 0] + log p . 1−p Pr[y =1|y j ] p If an annotator is a spammer (i.e., (2) holds) then log Pr[yi =0|yi ] = log 1−p . Essentially the annotator j i i provides no information in updating the posterior log-odds and hence does not contribute to the estimation of the actual true label. 3 Spammer score for categorical labels Annotator model Suppose there are K ≥ 2 categories. We introduce a multinomial parameter αj = (αj , . . . , αj ) for each annotator, where c c1 cK K j αj := Pr[yi = k|yi = c] ck αj = 1. ck and k=1 αj ck The term denotes the probability that annotator j assigns class k to an instance given that the true class is c. When K = 2, αj and αj are sensitivity and specificity, respectively. 11 00 Who is a spammer? As earlier a spammer assigns labels randomly, i.e., j j Pr[yi = k|yi ] = Pr[yi = k], ∀k. 3 j j This is equivalent to Pr[yi = k|yi = c] = Pr[yi = k|yi = c ], ∀c, c , k = 1, . . . , K— which means knowing the true class label being c or c does not change the probability of the annotator’s assigned label. This indicates that the annotator j is a spammer if αj = αj k , ∀c, c , k = 1, . . . , K. ck c (5) Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer would have 0.50 0.50 0.50 all the rows of Aj equal, for example, Aj = 0.25 0.25 0.25 0.25 0.25 0.25 , for a three class categorical annotation problem. Essentially Aj is a rank one matrix of the form Aj = evj , for some column vector vj ∈ RK that satisfies vj e = 1, where e is column vector of ones. In the binary case we had this natural notion of spammer as an annotator for whom αj + β j − 1 was close to zero. One natural way to summarize (5) would be in terms of the distance (Frobenius norm) of the confusion matrix to the closest rank one approximation, i.e, S j := Aj − eˆj v 2 F, (6) where ˆj solves v ˆj = arg min Aj − evj v vj 2 F s.t. vj e = 1. (7) Solving (7) yields ˆj = (1/K)Aj e, which is the mean of the rows of Aj . Then from (6) we have v Sj = I− 1 ee K 2 Aj = F 1 K (αj − αj k )2 . ck c c < . . . < K. Annotator model It is conceptually easier to think of the true label to be binary, that is, yi ∈ {0, 1}. For example in mammography a lesion is either malignant (1) or benign (0) (which can be confirmed by biopsy) and the BIRADS ordinal scale is a means for the radiologist to quantify the uncertainty based on the digital mammogram. The radiologist assigns a higher value of the label if he/she thinks the true label is closer to one. As earlier we characterize each annotator by the sensitivity and the specificity, but the main difference is that we now define the sensitivity and specificity for j each ordinal label (or threshold) k ∈ {1, . . . , K}. Let αj and βk be the sensitivity and specificity k th respectively of the j annotator corresponding to the threshold k, that is, j j j αj = Pr[yi ≥ k | yi = 1] and βk = Pr[yi < k | yi = 0]. k j j Note that αj = 1, β1 = 0 and αj 1 K+1 = 0, βK+1 = 1 from this definition. Hence each annotator j j is parameterized by a set of 2(K − 1) parameters [αj , β2 , . . . , αj , βK ]. This corresponds to an 2 K empirical ROC curve for the annotator (Figure 2). 4 Who is a spammer? As earlier we define an an1 j k=1 notator j to be a spammer if Pr[yi = k|yi = 1] = 0.9 j k=2 0.8 Pr[yi = k|yi = 0] ∀k = 1, . . . , K. Note that from j 0.7 k=3 [ 1−β , α ] the annotation model we have 3 Pr[yi = k | yi = 0.6 j j j 1] = αk − αk+1 and Pr[yi = k | yi = 0] = 0.5 k=4 j j 0.4 βk+1 − βk . This implies that annotator j is a spam0.3 j j mer if αj − αj k k+1 = βk+1 − βk , ∀k = 1, . . . , K, 0.2 j j 0.1 which leads to αj + βk = αj + β1 = 1, ∀k. This 1 k j 0 0 0.2 0.4 0.6 0.8 1 means that for every k, the point (1 − βk , αj ) lies on k 1−Specificity ( β ) the diagonal line in the ROC plot shown in Figure 2. The area under the empirical ROC curve can be comFigure 2: Ordinal labels: An annotator is modK 1 puted as (see Figure 2) AUCj = 2 k=1 (αj + eled by sensitivity/specificity for each threshold. k+1 j j αj )(βk+1 − βk ), and can be used to define the folk lowing spammer score as (2AUCj − 1)2 to rank the different annotators. 3 Sensitivity ( αj ) 3 j 2 K (αj k+1 k=1 j S = + j αj )(βk+1 k − j βk ) −1 (9) With two levels this expression defaults to the binary case. An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 5 Previous work Recently Ipeirotis et.al. [4] proposed a score for categorical labels based on the expected cost of the posterior label. In this section we briefly describe their approach and compare it with our proposed score. For each instance labeled by the annotator they first compute the posterior (soft) label j j Pr[yi = c|yi ] for c = 1, . . . , K, where yi is the label assigned to the ith instance by the j th annotator and yi is the true unknown label. The posterior label is computed via Bayes’ rule as j j j Pr[yi = c|yi ] ∝ Pr[yi |yi = c]Pr[yi = c] = (αj )δ(yi ,k) pc , where pc = Pr[yi = c] is the prevack lence of class c. The score for a spammer is based on the intuition that the posterior label vector j j (Pr[yi = 1|yi ], . . . , Pr[yi = K|yi ]) for a good annotator will have all the probability mass concentrated on single class. For example for a three class problem (with equal prevalence), a posterior label vector of (1, 0, 0) (certain that the class is one) comes from a good annotator while a (1/3, 1/3, 1/3) (complete uncertainty about the class label) comes from spammer. Based on this they define the following score for each annotator 1 Score = N N K K j j costck Pr[yi = k|yi ]Pr[yi = c|yi ] j i=1 . (10) c=1 k=1 where costck is the misclassification cost when an instance of class c is classified as k. Essentially this is capturing some sort of uncertainty of the posterior label averaged over all the instances. Perfect workers have a score Scorej = 0 while spammers will have high score. An entropic version of this score based on similar ideas has also been recently proposed in [5]. Our proposed spammer score differs from this approach in the following aspects: (1) Implicit in the score defined above (10) j is the assumption that an annotator is a spammer when Pr[yi = c|yi ] = Pr[yi = c], i.e., the estimated posterior labels are simply based on the prevalence and do not depend on the observed labels. By j j Bayes’ rule this is equivalent to Pr[yi |yi = c] = Pr[yi ] which is what we have used to define our spammer score. (2) While both notions of a spammer are equivalent, the approach of [4] first computes the posterior labels based on the observed data, the class prevalence and the annotator j j j j This can be seen as follows: Pr[yi = k | yi = 1] = Pr[(yi ≥ k) AND (yi < k + 1) | yi = 1] = Pr[yi ≥ j j j j k | yi = 1] + Pr[yi < k + 1 | yi = 1] − Pr[(yi ≥ k) OR (yi < k + 1) | yi = 1] = Pr[yi ≥ k | yi = j j j 1] − Pr[yi ≥ k + 1 | yi = 1] = αj − αj . Here we used the fact that Pr[(yi ≥ k) OR (yi < k + 1)] = 1. k k+1 3 5 simulated | 500 instances | 30 annotators simulated | 500 instances | 30 annotators 1 12 0.8 Spammer Score 18 0.6 0.5 22 24 23 25 0.3 29 20 0.2 0.4 0.2 30 16 14 0.1 26 21 27 28 19 0 0 13 0 0.2 0.4 0.6 1−Specificity 0.8 1 500 500 500 500 500 500 500 500 500 500 0.4 0.6 500 500 500 1 0.7 500 500 500 500 500 500 500 500 500 500 500 500 500 500 3 1 500 500 500 2 8 510 7 17 4 9 27 8 30 6 3 28 7 10 2 23 22 26 24 5 1 21 29 25 14 12 17 11 18 20 19 15 16 13 4 0.8 Sensitivity 6 9 0.9 15 11 Annotator (a) Simulation setup (b) Annotator ranking Annotator rank (median) via accuracy simulated | 500 instances | 30 annotators Annotator rank (median) via Ipeirotis et.al.[4] simulated | 500 instances | 30 annotators 27 30 30 28 23 22 26 25 24 21 29 25 20 14 17 18 15 20 16 15 11 13 12 19 1 10 5 10 2 7 6 3 5 8 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (c) Comparison with accuracy 30 18 20 16 15 19 13 12 25 11 14 17 25 20 29 21 51 15 26 22 23 24 10 2 7 10 28 6 3 8 30 5 27 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (d) Comparison with Ipeirotis et. al. [4] Figure 3: (a) The simulation setup consisting of 10 good annotators (annotators 1 to 10), 10 spammers (11 to 20), and 10 malicious annotators (21 to 30). (b) The ranking of annotators obtained using the proposed spammer score. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. (c) and (d) Comparison of the median rank obtained via the spammer score with the rank obtained using (c) accuracy and (d) the method proposed by Ipeirotis et. al. [4]. parameters and then computes the expected cost. Our proposed spammer score does not depend on the prevalence of the class. Our score is also directly defined only in terms of the annotator confusion matrix and does not need the observed labels. (3) For the score defined in (10) while perfect annotators have a score of 0 it is not clear what should be a good baseline for a spammer. The authors suggest to compute the baseline by assuming that a worker assigns as label the class with maximum prevalence. Our proposed score has a natural scale with a perfect annotator having a score of 1 and a spammer having a score of 0. (4) However one advantage of the approach in [4] is that they can directly incorporate varied misclassification costs. 6 Experiments Ranking annotators based on the confidence interval As mentioned earlier the annotator model parameters can be estimated using the iterative EM algorithms [3, 7] and these estimated annotator parameters can then be used to compute the spammer score. The spammer score can then be used to rank the annotators. However one commonly observed phenomenon when working with crowdsourced data is that we have a lot of annotators who label only a very few instances. As a result the annotator parameters cannot be reliably estimated for these annotators. In order to factor this uncertainty in the estimation of the model parameters we compute the spammer score for 100 bootstrap replications. Based on this we compute the 95% confidence intervals (CI) for the spammer score for each annotator. We rank the annotators based on the lower limit of the 95% CI. The CIs are wider 6 Table 1: Datasets N is the number of instances. M is the number of annotators. M ∗ is the mean/median number of annotators per instance. N ∗ is the mean/median number of instances labeled by each annotator. Dataset Type N M M∗ N∗ bluebird binary 108 39 39/39 108/108 temp binary 462 76 10/10 61/16 Brief Description wsd categorical/3 177 34 10/10 52/20 sentiment categorical/3 1660 33 6/6 291/175 30 100 10 38 10/10 10/10 bird identification [12] The annotator had to identify whether there was an Indigo Bunting or Blue Grosbeak in the image. event annotation [10] Given a dialogue and a pair of verbs annotators need to label whether the event described by the first verb occurs before or after the second. 30/30 26/20 30 30 30 30 30 30 3 30 30 1 30 20 20 20 20 20 77 117 20 60 Spammer Score 0.4 10 7 9 8 6 5 0 2 0.2 13 31 10 23 29 1 2 4 6 8 9 14 15 17 22 32 5 18 16 19 11 12 20 21 24 25 26 27 28 30 33 34 7 3 0 0.6 20 20 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 20 20 20 20 17 17 40 20 20 100 Spammer Score 0.4 108 108 108 108 108 108 108 108 108 108 108 108 108 0.8 0.6 0.2 17 8 27 30 25 35 1 12 32 37 38 16 22 9 29 15 20 19 5 39 3 21 23 14 2 10 24 7 33 13 36 31 4 34 28 18 11 6 26 0.2 30 77 77 4 108 108 108 108 0.4 0 wosi | 30 instances | 10 annotators 1 0.8 108 108 0.6 108 108 108 Spammer Score 0.8 wsd | 177 instances | 34 annotators 1 80 177 157 177 157 bluebird | 108 instances | 39 annotators 1 word similarity [10] Numeric judgements of word similarity. affect recognition [10] Each annotator is presented with a short headline and asked to rate it on a scale [-100,100] to denote the overall positive or negative valence. 40 40 20 ordinal/[0 10] ordinal[-100 100] 20 20 20 wosi valence word sense disambiguation [10] The labeler is given a paragraph of text containing the word ”president” and asked to label one of the three appropriate senses. irish economic sentiment analysis [1] Articles from three Irish online news sources were annotated by volunteer users as positive, negative, or irrelevant. 20 20 20 20 20 60 20 20 20 40 40 100 0.4 Annotator Annotator 0 1 26 10 18 28 15 5 36 23 12 8 32 31 38 13 17 27 11 2 35 24 19 9 6 30 33 37 14 29 4 3 20 34 22 25 7 16 21 40 20 0.2 26 2 6 11 5 14 3 20 9 22 31 10 12 18 8 13 30 4 1 29 19 17 27 28 21 15 25 23 7 33 16 24 32 10 132 10 360 10 0 13 18 52 75 33 32 12 74 31 51 41 55 7 14 70 42 58 65 43 1 10 47 61 73 25 37 76 67 24 46 54 48 39 56 15 62 68 44 53 64 40 9 28 6 2 57 3 4 5 8 11 16 17 19 20 21 22 23 26 27 29 30 34 35 36 38 45 49 50 59 60 63 66 69 71 72 442 462 452 10 10 0.6 238 171 75 654 20 0.2 0.2 0 12 77 67 374 249 229 453 346 428 0.4 Spammer Score 43 175 119 541 525 437 0.8 917 104 284 0.4 0.6 1211 1099 10 Spammer Score 0.8 572 30 52 402 60 0.6 30 Spammer Score 0.8 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 60 40 20 15 7 7 11 12 35 29 1 87 10 10 10 10 10 10 10 12 1 30 10 10 10 10 10 10 10 10 10 10 10 10 10 valence | 100 instances | 38 annotators 20 22 10 10 10 10 sentiment | 1660 instances | 33 annotators 10 10 30 20 10 Annotator temp | 462 instances | 76 annotators 1 Annotator 10 50 10 10 40 10 70 350 80 40 100 192 190 40 32 60 70 20 20 40 80 20 50 50 50 30 Annotator Annotator Figure 4: Annotator Rankings The rankings obtained for the datasets in Table 1. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances. when the annotator labels only a few instances. For a crowdsourced labeling task the annotator has to be good and also label a reasonable number of instances in order to be reliably identified. Simulated data We first illustrate our proposed spammer score on simulated binary data (with equal prevalence for both classes) consisting of 500 instances labeled by 30 annotators of varying sensitivity and specificity (see Figure 3(a) for the simulation setup). Of the 30 annotators we have 10 good annotators (annotators 1 to 10 who lie above the diagonal in Figure 3(a)), 10 spammers (annotators 11 to 20 who lie around the diagonal), and 10 malicious annotators (annotators 21 to 30 who lie below the diagonal). Figure 3(b) plots the ranking of annotators obtained using the proposed spammer score with the annotator model parameters estimated via the EM algorithm [3, 7]. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence interval (CI) obtained via bootstrapping are shown. The annotators are ranked based on the lower limit of the 95% CI. As can be seen all the spammers (annotators 11 to 20) have a low spammer score and appear at the bottom of the list. The malicious annotators have higher score than the spammers since we can correct for their flipping. The malicious annotators are good annotators but they flip their labels and as such are not spammers if we detect that they are malicious. Figure 3(c) compares the (median) rank obtained via the spammer score with the (median) rank obtained using accuracy as the score to rank the annotators. While the good annotators are ranked high by both methods the accuracy score gives a low rank to the malicious annotators. Accuracy does not capture the notion of a spammer. Figure 3(d) compares the ranking with the method proposed by Ipeirotis et. al. [4] which gives almost similar rankings as our proposed score. 7 21 23 10 6 35 4 34 1126 18 147 30 3 31 13 2436 33 25 5 2 20 15 39 19 15 20 28 22 299 12 37 16 38 10 32 1 5 27 25 35 30 8 17 0 0 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 40 bluebird | 108 instances | 39 annotators 40 1 6 34 112618 4 31 1013 7 30 2 28 21 5 20 15 39 19 20 15 22 37 16 299 12 38 10 5 8 17 0 0 27 25 35 30 0.6 0.5 35 32 2 0.4 36 11 13 31 24 10 33 28 21 26 18 0 0 40 34 15 19 39 0.1 (a) 22 37 20 38 29 9 0.2 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 6 4 16 0.3 32 1 7 30 25 1 3 14 3 27 5 0.7 24 33 14 36 23 25 12 8 0.9 17 0.8 35 Sensitivity Annotator rank (median) via accuracy bluebird | 108 instances | 39 annotators Annotator rank (median) via Ipeirotis et.al.[4] bluebird | 108 instances | 39 annotators 40 23 0.2 0.4 0.6 1−Specificity (b) 0.8 1 (c) Figure 5: Comparison of the rank obtained via the spammer score with the rank obtained using (a) accuracy and (b) the method proposed by Ipeirotis et. al. [4] for the bluebird binary dataset. (c) The annotator model parameters as estimated by the EM algorithm [3, 7]. 19 25 12 18 7 3 14 20 32 5 8 1 16 20 9 21 15 34 10 31 29 17 28 22 26 2315 5 2 0 0 4 6 13 10 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 30 25 16 19 7 25 8 9 27 14 3 28 17 18 32 5 10 4 2 10 6 1529 31 23 22 21 15 0 0 33 30 11 1 20 5 sentiment | 1660 instances | 33 annotators 24 35 12 20 24 34 26 13 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 33 7 30 15 17 25 28 2719 2223 20 8 1 4 1812 15 13 10 20 32 30 10 3 29 9 31 16 5 6 2 5 14 11 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score 25 21 Annotator rank (median) via Ipeirotis et.al.[4] 25 24 27 Annotator rank (median) via accuracy Annotator rank (median) via accuracy 30 sentiment | 1660 instances | 33 annotators wsd | 177 instances | 34 annotators 33 30 11 Annotator rank (median) via Ipeirotis et.al.[4] wsd | 177 instances | 34 annotators 35 7 30 15 19 17 27 25 21 25 8 12 4 18 20 24 15 20 33 10 3 13 9 28 1 29 23 10 1632 11 14 5 6 2 5 31 30 22 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score Figure 6: Comparison of the median rank obtained via the spammer score with the rank obtained using accuracy and he method proposed by Ipeirotis et. al. [4] for the two categorial datasets in Table 1. Mechanical Turk data We report results on some publicly available linguistic and image annotation data collected using the Amazon’s Mechanical Turk (AMT) and other sources. Table 1 summarizes the datasets. Figure 4 plots the spammer scores and rankings obtained. The mean and the 95% CI obtained via bootstrapping are also shown. The number at the top of the CI bar shows the number of instances annotated by that annotator. The rankings are based on the lower limit of the 95% CI which factors the number of instances labeled by the annotator into the ranking. An annotator who labels only a few instances will have very wide CI. Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide and hence ranked lower. Ideally we would like to have annotators with a high score and at the same time label a lot of instances so that we can reliablly identify them. The authors [1] for the sentiment dataset shared with us some of the qualitative observations regarding the annotators and they somewhat agree with our rankings. For example the authors made the following comments about Annotator 7 ”Quirky annotator - had a lot of debate about what was the meaning of the annotation question. I’d say he changed his labeling strategy at least once during the process”. Our proposed score gave a low rank to this annotator. Comparison with other approaches Figure 5 and 6 compares the proposed ranking with the rank obtained using accuracy and the method proposed by Ipeirotis et. al. [4] for some binary and categorical datasets in Table 1. Our proposed ranking is somewhat similar to that obtained by Ipeirotis et. al. [4] but accuracy does not quite capture the notion of spammer. For example for the bluebird dataset for annotator 21 (see Figure 5(a)) accuracy ranks it at the bottom of the list while the proposed score puts is in the middle of the list. From the estimated model parameters it can be seen that annotator 21 actually flips the labels (below the diagonal in Figure 5(c)) but is a good annotator. 7 Conclusions We proposed a score to rank annotators for crowdsourced binary, categorical, and ordinal labeling tasks. The obtained rankings and the scores can be used to allocate monetary bonuses to be paid to different annotators and also to eliminate spammers from further labeling tasks. A mechanism to rank annotators should be desirable feature of any crowdsourcing service. The proposed score should also be useful to specify the prior for Bayesian approaches to consolidate annotations. 8 References [1] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems (PAIS’10), 2010. [2] B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. [3] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. [4] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP’10), pages 64–67, 2010. [5] V. C. Raykar and S. Yu. An entropic score to rank annotators for crowdsourced labelling tasks. In Proceedings of the Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011. [6] V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), pages 889– 896, 2009. [7] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010. [8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. [9] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085–1092. 1995. [10] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast—but is it good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 254–263, 2008. [11] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. [12] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432. 2010. [13] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. [14] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 932–939, 2010. 9
6 0.33379221 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
7 0.27393803 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
8 0.26328659 158 nips-2011-Learning unbelievable probabilities
9 0.26111907 277 nips-2011-Submodular Multi-Label Learning
10 0.25683105 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
11 0.2520982 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
12 0.25141236 123 nips-2011-How biased are maximum entropy models?
13 0.25101855 240 nips-2011-Robust Multi-Class Gaussian Process Classification
14 0.25028679 306 nips-2011-t-divergence Based Approximate Inference
15 0.24753273 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
16 0.24744147 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
17 0.24439292 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
18 0.24080345 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
19 0.23949306 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
20 0.23612106 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
topicId topicWeight
[(0, 0.031), (4, 0.029), (20, 0.032), (26, 0.022), (31, 0.505), (33, 0.016), (43, 0.065), (45, 0.091), (57, 0.037), (74, 0.039), (83, 0.019), (99, 0.035)]
simIndex simValue paperId paperTitle
1 0.99371552 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
Author: Dylan A. Simon, Nathaniel D. Daw
Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1
2 0.99263978 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
3 0.99179679 225 nips-2011-Probabilistic amplitude and frequency demodulation
Author: Richard Turner, Maneesh Sahani
Abstract: A number of recent scientific and engineering problems require signals to be decomposed into a product of a slowly varying positive envelope and a quickly varying carrier whose instantaneous frequency also varies slowly over time. Although signal processing provides algorithms for so-called amplitude- and frequencydemodulation (AFD), there are well known problems with all of the existing methods. Motivated by the fact that AFD is ill-posed, we approach the problem using probabilistic inference. The new approach, called probabilistic amplitude and frequency demodulation (PAFD), models instantaneous frequency using an auto-regressive generalization of the von Mises distribution, and the envelopes using Gaussian auto-regressive dynamics with a positivity constraint. A novel form of expectation propagation is used for inference. We demonstrate that although PAFD is computationally demanding, it outperforms previous approaches on synthetic and real signals in clean, noisy and missing data settings. 1
4 0.96808988 23 nips-2011-Active dendrites: adaptation to spike-based communication
Author: Balazs B. Ujfalussy, Máté Lengyel
Abstract: Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment fluctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. 1
same-paper 5 0.95381552 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
6 0.9447667 240 nips-2011-Robust Multi-Class Gaussian Process Classification
7 0.87378734 241 nips-2011-Scalable Training of Mixture Models via Coresets
8 0.84906793 249 nips-2011-Sequence learning with hidden units in spiking neural networks
9 0.83542693 131 nips-2011-Inference in continuous-time change-point models
10 0.82664287 229 nips-2011-Query-Aware MCMC
11 0.82400841 75 nips-2011-Dynamical segmentation of single trials from population neural data
12 0.82027775 221 nips-2011-Priors over Recurrent Continuous Time Processes
13 0.81977862 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
14 0.80674285 87 nips-2011-Energetically Optimal Action Potentials
15 0.80289149 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
16 0.80140603 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
17 0.79191482 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
18 0.78713036 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
19 0.78077257 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
20 0.77946121 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions