nips nips2011 nips2011-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier
Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. [sent-10, score-0.163]
2 Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. [sent-11, score-0.081]
3 Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. [sent-12, score-0.133]
4 The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). [sent-14, score-0.1]
5 Compared to measures like error rate in binary classification and Hamming loss in MLC, it enforces a better balance between performance on the minority and the majority class, respectively, and, therefore, it is more suitable in the case of imbalanced data. [sent-17, score-0.06]
6 , hm ) ∈ {0, 1}m of an m-dimensional binary label vector y = (y1 , . [sent-21, score-0.074]
7 , the class labels of a test set of size m in binary classification or the label vector associated with a single instance in MLC), the F-measure is defined as follows: m 2 i=1 yi hi F (y, h) = m ∈ [0, 1] , (1) m i=1 yi + i=1 hi where 0/0 = 1 by definition. [sent-26, score-0.717]
8 This measure essentially corresponds to the harmonic mean of precision prec and recall rec: m m yi hi yi hi prec(y, h) = i=1 , rec(y, h) = i=1 . [sent-27, score-0.618]
9 m m i=1 hi i=1 yi One can generalize the F-measure to a weighted harmonic average of these two values, but for the sake of simplicity, we stick to the unweighted mean, which is often referred to as the F1-score or the F1-measure. [sent-28, score-0.331]
10 In binary classification, the existing algorithms are extensions of support vector machines [2, 3] or logistic regression [4]. [sent-30, score-0.083]
11 However, the most popular methods, including [5], rely on explicit threshold adjustment. [sent-31, score-0.066]
12 Some algorithms have also been proposed for structured output prediction [6, 7, 8] and MLC [9, 10, 11]. [sent-32, score-0.122]
13 , assuming an underlying probability distribution p(Y ) on {0, 1}m , the prediction h∗ that maximizes the expected F-measure is given by F h∗ = arg max Ey∼p(Y ) [F (y, h)] = arg max F h∈{0,1}m h∈{0,1}m p(Y = y) F (y, h). [sent-39, score-0.216]
14 (2) y∈{0,1}m As discussed in Section 2, this setting was mainly examined before by [12], under the assumption of m independence of the Yi , i. [sent-40, score-0.123]
15 , p(Y = y) = i=1 pyi (1 − pi )1−yi with pi = p(Yi =1). [sent-42, score-0.214]
16 Indeed, finding i the maximizer (2) is in general a difficult problem. [sent-43, score-0.081]
17 Apparently, there is no closed-form expression, and a brute-force search is infeasible (it would require checking all 2m combinations of prediction vector h). [sent-44, score-0.061]
18 At first sight, it also seems that information about the entire joint distribution p(Y ) is needed to maximize the F-measure. [sent-45, score-0.087]
19 In Section 3, we present a general algorithm for maximizing the F-measure that requires only m2 + 1 parameters of the joint distribution. [sent-47, score-0.066]
20 In particular, unlike algorithms such as [12], we do not require independence of the binary response variables (labels). [sent-50, score-0.136]
21 While being natural for problems like binary classification, this assumption is indeed not tenable in domains like MLC and structured output prediction. [sent-51, score-0.124]
22 1 Algorithms Based on Label Independence By assuming independence of the random variables Y1 , . [sent-58, score-0.08]
23 It has been shown independently in [13] and [12] that the optimal solution always contains the labels with the highest marginal probabilities pi , or no labels at all. [sent-62, score-0.496]
24 Jansche [12], however, has proposed an exact procedure, called maximum expected utility framework (MEUF), that takes marginal probabilities p1 , p2 , . [sent-65, score-0.197]
25 He noticed that (2) can be solved via outer and inner maximization. [sent-70, score-0.106]
26 Namely, (2) can be transformed into an inner maximization ∗ h(k) = arg max Ey∼p(Y ) [F (y, h)] , (3) h∈Hk where Hk = {h ∈ {0, 1}m | m i=1 hi = k}, followed by an outer maximization h∗ = F arg max ∗ ∗ Ey∼p(Y ) [F (y, h)] . [sent-71, score-0.631]
27 ,h(m) } The outer maximization (4) can be done by simply checking all m + 1 possibilities. [sent-75, score-0.182]
28 The main effort is then devoted for solving the inner maximization (3). [sent-76, score-0.133]
29 1, to solve (3) for a given k, we need to check only one vector h in which hi = 1 for the k labels with highest marginal probabilities pi . [sent-78, score-0.57]
30 The cardinality of its domain is indeed exponential in m, but the cardinality of its range is polynomial in m, so the expectation can be computed in polynomial time. [sent-82, score-0.06]
31 He also presents approximate variants of this procedure, reducing its complexity from cubic to quadratic or even to linear. [sent-84, score-0.049]
32 The results of the quadratic-time approximation, according to [12], are almost indistinguishable in practice from the exact algorithm; but still the overall complexity of the approach is O(m3 ). [sent-85, score-0.07]
33 If the independence assumption is violated, the above methods may produce predictions being far away from the optimal one. [sent-86, score-0.146]
34 , i=1 yi = 1, corresponding to the multinomial distribution. [sent-97, score-0.095]
35 Denote by y(i) a vector for which yi = 1 and all the other entries are zeros. [sent-102, score-0.095]
36 Assume that p(Y ) is a joint distribution such that p(Y = y(i)) = pi . [sent-103, score-0.173]
37 The maximizer h∗ of (2) consists of the k labels with the highest marginal probabilities, where k is the F first integer for which k pj ≥ (1 + k)pk+1 ; j=1 if there is no such integer, then h = 1. [sent-104, score-0.309]
38 Obviously, to find an optimal threshold θ, access to the entire joint distribution is needed. [sent-108, score-0.122]
39 However, this is not the main problem here, since in the next section, we will show that only a polynomial number of parameters of the joint distribution is needed. [sent-109, score-0.096]
40 What is more interesting is the observation that the F-maximizer is in general not consistent with the order of marginal label probabilities. [sent-110, score-0.119]
41 Let hT be a vector of predictions obtained by putting a threshold on sorted marginal probabilities in the optimal way, then the worst-case regret is lower bounded by sup (EY F (Y, h∗ ) − F (Y, hT ) ) ≥ max(0, F p 1 2 − ), 6 m+4 where the supremum is taken over all possible distributions p(Y ). [sent-114, score-0.293]
42 3 This is a rather surprising result in light of the existence of many algorithms that rely on finding a threshold for maximizing the F-measure [5, 9, 10]. [sent-115, score-0.11]
43 3 An Exact Algorithm for F-Measure Maximization We now introduce an exact and efficient algorithm for computing the F-maximizer without using any additional assumption on the probability distribution p(Y ). [sent-119, score-0.088]
44 While adopting the idea of decomposing the problem into an outer and an inner maximization, our algorithm differs from Jansche’s in the way the inner maximization is solved. [sent-120, score-0.239]
45 As a key element, we consider equivalence classes for the labels in terms of the number of ones in the vectors h and y. [sent-121, score-0.087]
46 First, we show that only m2 + 1 parameters of the joint distribution p(Y ) are needed to compute the F-maximizer. [sent-123, score-0.066]
47 The solution of (2) can be computed by solely using p(Y = 0) and the values of pis = p(Yi = 1 , sy = s), i, s ∈ {1, . [sent-127, score-0.232]
48 The inner optimization problem (3) can be formulated as follows: ∗ h(k) = arg max Ey∼p(Y ) [F (y, h)] = arg max h∈Hk h∈Hk p(y) y∈{0,1}m 2 m i=1 yi hi . [sent-132, score-0.483]
49 sy + k The sums can be swapped, resulting in (k)∗ h m = arg max 2 h∈Hk hi i=1 y∈{0,1}m p(y)yi . [sent-133, score-0.384]
50 sy + k (5) Furthermore, one can sum up the probabilities p(y) for all ys with an equal value of sy . [sent-134, score-0.314]
51 3 Finding the exact value of the supremum is an interesting open question. [sent-137, score-0.082]
52 4 Algorithm 1 General F-measure Maximizer INPUT: matrix P and probability p(Y = 0) define matrix W with elements given by Eq. [sent-138, score-0.075]
53 To simplify the notation, let us introduce an m × m matrix W with elements 1 wsk = , s, k ∈ {1, . [sent-143, score-0.051]
54 , m} , (7) s+k The resulting algorithm, referred to as General F-measure Maximizer (GFM), is summarized in Algorithm 1 and its time complexity is analyzed in the following theorem. [sent-146, score-0.054]
55 By introducing the matrix W with elements (7), we can simplify (6) to m ∗ h(k) = arg max 2 h∈Hk hi fik , (8) i=1 where fik are elements of a matrix F = PW. [sent-152, score-0.536]
56 , the elements with the highest values) in the k-th column of matrix F, which can be carried out in linear time [15]. [sent-155, score-0.091]
57 The solution of the outer optimization problem (4) is then straight-forward. [sent-156, score-0.063]
58 Consequently, the complexity of the algorithm is dominated by a matrix multiplication that is solved naively in O(m3 ), but faster algorithms working in O(m2. [sent-157, score-0.097]
59 First of all, MEUF is characterized by a much higher time complexity being O(m4 ) for the exact version. [sent-160, score-0.07]
60 The recommended approximate variant reduces this complexity to O(m3 ). [sent-161, score-0.053]
61 In addition, let us also remark that this complexity can be further decreased if the number of distinct values of sy with non-zero probability mass is smaller than m. [sent-163, score-0.146]
62 Moreover, the MEUF framework will not deliver an exact F-maximizer if the assumption of independence is violated. [sent-164, score-0.173]
63 Our approach needs m2 + 1 parameters, but then computes the maximizer exactly. [sent-167, score-0.081]
64 Our algorithm can also be tailored for finding an optimal threshold. [sent-169, score-0.063]
65 Instead of finding the top k elements in the k-th column, m it is enough to rely on the order of the marginal probabilities pi = s=1 pis . [sent-171, score-0.431]
66 As a result, there is no need to compute the entire matrix F; instead, only the elements that correspond to the k highest marginal probabilities for each column k are needed. [sent-172, score-0.244]
67 To this end, we train a classifier h(x) on a training set {(xi , y i )}N and perform inference for a given i=1 test vector x so as to deliver an optimal prediction under the F-measure (1). [sent-183, score-0.103]
68 We follow an approach similar to Conditional Random Fields (CRFs) [17, 18], which estimates the joint conditional distribution p(Y | x). [sent-185, score-0.09]
69 The underlying idea is to repeatedly apply the product rule of probability to the joint distribution of the labels Y = (Y1 , . [sent-187, score-0.153]
70 Learning in this framework can be considered as a procedure that relies on constructing probabilistic classifiers for estimating p(Yk = yk |x, y1 , . [sent-194, score-0.097]
71 To sample from the conditional joint distribution p(Y | x), one follows the chain and picks the value of label yk by tossing a biased coin with probabilities given by the k-th classifier. [sent-201, score-0.301]
72 Based on a sample of observations generated in this way, our GFM algorithm can be used to perform the optimal inference under F-measure. [sent-202, score-0.067]
73 By plugging the log-linear model into (9), it can be shown that pairwise dependencies between labels yi and yj can be modeled. [sent-204, score-0.182]
74 The first one (H) estimates marginal probabilities pi (x) and predicts 1 for labels with pi (x) ≥ 0. [sent-208, score-0.454]
75 ˆ The second method (MEUF) uses the estimates pi (x) for computing the F-measure by applying the ˆ MEUF method. [sent-210, score-0.107]
76 If the labels are independent, this method computes the F-maximizer exactly. [sent-211, score-0.087]
77 Finally, we use GFM and its variant that finds the optimal threshold (GFM-T). [sent-213, score-0.083]
78 Left: the performance as a function of sample size generated from independent distribution with pi = 0. [sent-232, score-0.152]
79 Center: similarly as above, but the distribution is defined according to (9), where all p(Yi = yi | y1 , . [sent-234, score-0.117]
80 Right: running times as a function of the number of j=1 2 labels with a sample size of 200. [sent-238, score-0.11]
81 For each dataset, we give the number of labels (m) and the size of training and test sets (in parentheses: training/test set). [sent-241, score-0.087]
82 We additionally present results of the binary relevance (BR) approach which trains an independent classifier for each label (we used the same base learner as in PCC). [sent-399, score-0.074]
83 The macro and micro F-measure are presented mainly as a reference. [sent-411, score-0.166]
84 7 5 Discussion The GFM algorithm can be considered for maximizing the macro F-measure, for example, in a similar setting as in [10], where a specific Bayesian on-line model is used. [sent-423, score-0.102]
85 In order to maximize the macro F-measure, the authors sample from the graphical model to find an optimal threshold. [sent-424, score-0.146]
86 The GFM algorithm may solve this problem optimally, since, as stated by the authors, the independence of labels is lost after integrating out the model parameters. [sent-425, score-0.167]
87 Theoretically, one may also consider a direct maximization of the micro F-measure with GFM, but the computational burden is rather high in this case. [sent-426, score-0.176]
88 The situation is slightly different in structured output prediction, where algorithms for instance-wise maximization of the F-measure do exist. [sent-431, score-0.18]
89 These include, for example, struct SVM [6], SEARN [8], and a specific variant of CRFs [7]. [sent-432, score-0.061]
90 Usually, these algorithms are based on additional assumptions, like label independence in struct SVM. [sent-433, score-0.176]
91 The GFM algorithm can also be easily tailored for maximizing the instance-wise F-measure in structured output prediction, in a similar way as presented above. [sent-434, score-0.131]
92 If the structured output classifier is able to model the joint distribution from which we can easily sample observations, then the use of the algorithm is straight-forward. [sent-435, score-0.157]
93 Surprisingly, in both papers [8] and [6], experimental results are reported in terms of micro Fmeasure, although the algorithms maximize the instance-wise F-measure on the training set. [sent-437, score-0.129]
94 Despite being related to each other, these two measures coincide only in the m specific case where i=1 (yi + hi ) is constant for all test examples. [sent-439, score-0.209]
95 For high variability in m i=1 (yi + hi ), a significant difference between the values of these two measures is to be expected. [sent-441, score-0.209]
96 The use of the GFM algorithm in binary classification seems to be superfluous, since in this case, the assumption of label independence is rather reasonable. [sent-442, score-0.176]
97 In this paper, we analyzed the problem of optimal predictive inference from the joint distribution under the F-measure. [sent-446, score-0.11]
98 Our GFM algorithm requires only a polynomial number of parameters of the joint distribution and delivers the exact solution in polynomial time. [sent-448, score-0.17]
99 From a theoretical perspective, GFM should be preferred to existing approaches, which typically perform threshold maximization on marginal probabilities, often relying on the assumption of (conditional) independence of labels. [sent-449, score-0.305]
100 Large margin methods for structured and interdependent output variables. [sent-475, score-0.068]
wordName wordTfidf (topN-words)
[('meuf', 0.573), ('gfm', 0.467), ('pcc', 0.225), ('hi', 0.183), ('mlc', 0.17), ('ey', 0.152), ('sy', 0.12), ('pis', 0.112), ('pi', 0.107), ('marburg', 0.106), ('hk', 0.096), ('yi', 0.095), ('maximization', 0.09), ('labels', 0.087), ('micro', 0.086), ('fik', 0.085), ('maximizer', 0.081), ('approx', 0.08), ('macro', 0.08), ('independence', 0.08), ('marginal', 0.079), ('br', 0.078), ('eyke', 0.075), ('probabilities', 0.074), ('yk', 0.074), ('classi', 0.066), ('pozna', 0.064), ('outer', 0.063), ('weiwei', 0.056), ('jansche', 0.056), ('krzysztof', 0.056), ('arg', 0.055), ('cheng', 0.051), ('thresholding', 0.046), ('ym', 0.045), ('exact', 0.044), ('joint', 0.044), ('structured', 0.044), ('inner', 0.043), ('amming', 0.042), ('coz', 0.042), ('dembczy', 0.042), ('ghent', 0.042), ('nference', 0.042), ('waegeman', 0.042), ('willem', 0.042), ('hamming', 0.041), ('tailored', 0.041), ('label', 0.04), ('highest', 0.04), ('supremum', 0.038), ('prec', 0.037), ('nowadays', 0.034), ('ski', 0.034), ('struct', 0.034), ('threshold', 0.034), ('binary', 0.034), ('del', 0.032), ('rec', 0.032), ('rely', 0.032), ('prediction', 0.032), ('polynomial', 0.03), ('thorsten', 0.029), ('checking', 0.029), ('referred', 0.028), ('daum', 0.028), ('hal', 0.028), ('elements', 0.027), ('logistic', 0.027), ('variant', 0.027), ('deliver', 0.027), ('routinely', 0.027), ('er', 0.027), ('max', 0.026), ('complexity', 0.026), ('measures', 0.026), ('acl', 0.025), ('naively', 0.025), ('hj', 0.025), ('harmonic', 0.025), ('matrix', 0.024), ('conditional', 0.024), ('output', 0.024), ('regret', 0.024), ('regarding', 0.023), ('sample', 0.023), ('cubic', 0.023), ('probabilistic', 0.023), ('crfs', 0.022), ('marginals', 0.022), ('algorithms', 0.022), ('distribution', 0.022), ('predictions', 0.022), ('maximizing', 0.022), ('inference', 0.022), ('optimal', 0.022), ('assumption', 0.022), ('integer', 0.022), ('maximize', 0.021), ('examined', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 33 nips-2011-An Exact Algorithm for F-Measure Maximization
Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier
Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification. 1
2 0.068080001 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
3 0.067850783 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
Author: Christoph H. Lampert
Abstract: We study multi-label prediction for structured output sets, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multilabel classification techniques are typically not applicable in this situation, because they require explicit enumeration of the label set, which is infeasible in case of structured outputs. Relying on techniques originally designed for single-label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneficial properties with single-label maximum-margin approaches, in particular formulation as a convex optimization problem, efficient working set training, and PAC-Bayesian generalization bounds. 1
4 0.063649878 302 nips-2011-Variational Learning for Recurrent Spiking Networks
Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner
Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation confirms the method’s applicability to learning both stationary and temporal spike patterns. 1
5 0.062697172 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
6 0.061191533 59 nips-2011-Composite Multiclass Losses
7 0.057705563 277 nips-2011-Submodular Multi-Label Learning
8 0.056846887 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
9 0.056807626 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
10 0.05587418 178 nips-2011-Multiclass Boosting: Theory and Algorithms
11 0.051790088 180 nips-2011-Multiple Instance Filtering
12 0.048071839 53 nips-2011-Co-Training for Domain Adaptation
13 0.047317263 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
14 0.046176277 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
15 0.045651685 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
16 0.045582462 28 nips-2011-Agnostic Selective Classification
17 0.045076203 258 nips-2011-Sparse Bayesian Multi-Task Learning
18 0.045065004 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
19 0.044564493 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
20 0.0435136 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
topicId topicWeight
[(0, 0.145), (1, 0.002), (2, -0.028), (3, -0.013), (4, 0.008), (5, 0.001), (6, -0.004), (7, -0.056), (8, -0.055), (9, 0.022), (10, -0.04), (11, -0.004), (12, 0.003), (13, 0.029), (14, -0.03), (15, -0.038), (16, -0.027), (17, -0.048), (18, 0.051), (19, 0.007), (20, -0.029), (21, 0.033), (22, 0.079), (23, -0.028), (24, 0.001), (25, 0.04), (26, -0.031), (27, -0.017), (28, -0.014), (29, -0.002), (30, 0.021), (31, -0.052), (32, 0.026), (33, -0.049), (34, 0.002), (35, -0.044), (36, -0.014), (37, -0.019), (38, -0.05), (39, -0.083), (40, -0.038), (41, -0.021), (42, -0.026), (43, -0.041), (44, 0.043), (45, 0.123), (46, 0.072), (47, 0.003), (48, 0.002), (49, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.88859177 33 nips-2011-An Exact Algorithm for F-Measure Maximization
Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier
Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification. 1
2 0.65779316 232 nips-2011-Ranking annotators for crowdsourced labeling tasks
Author: Vikas C. Raykar, Shipeng Yu
Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers–annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we formalize the notion of a spammer and define a score which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a high score close to one. 1 Spammers in crowdsourced labeling tasks Annotating an unlabeled dataset is one of the bottlenecks in using supervised learning to build good predictive models. Getting a dataset labeled by experts can be expensive and time consuming. With the advent of crowdsourcing services (Amazon’s Mechanical Turk being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see [8], [10], and [11] for some computer vision and natural language processing case studies). One drawback of most crowdsourcing services is that we do not have tight control over the quality of the annotators. The annotators can come from a diverse pool including genuine experts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good quality labels requestors typically get each instance labeled by multiple annotators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods that model and correct for the annotator biases [3, 9, 6, 7, 14] and/or task complexity [2, 13, 12]. In this paper we are interested in ranking annotators based on how spammer like each annotator is. In our context a spammer is a low quality annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator). Spammers can significantly increase the cost of acquiring annotations (since they need to be paid) and at the same time decrease the accuracy of the final consensus labels. A mechanism to detect and eliminate spammers is a desirable feature for any crowdsourcing market place. For example one can give monetary bonuses to good annotators and deny payments to spammers. The main contribution of this paper is to formalize the notion of a spammer for binary, categorical, and ordinal labeling tasks. More specifically we define a scalar metric which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a score close to one (see Figure 4). We summarize the multiple parameters corresponding to each annotator into a single score indicative of how spammer like the annotator is. While this spammer score was implicit for binary labels in earlier works [3, 9, 2, 6] the extension to categorical and ordinal labels is novel and is quite different from the accuracy computed from the confusion rate matrix. An attempt to quantify the quality of the workers based on the confusion matrix was recently made by [4] where they transformed the observed labels into posterior soft labels based on the estimated confusion 1 matrix. While we obtain somewhat similar annotator rankings, we differ from this work in that our score is directly defined in terms of the annotator parameters (see § 5 for more details). The rest of the paper is organized as follows. For ease of exposition we start with binary labels (§ 2) and later extend it to categorical (§ 3) and ordinal labels (§ 4). We first specify the annotator model used, formalize the notion of a spammer, and propose an appropriate score in terms of the annotator model parameters. We do not dwell too much on the estimation of the annotator model parameters. These parameters can either be estimated directly using known gold standard 1 or the iterative algorithms that estimate the annotator model parameters without actually knowing the gold standard [3, 9, 2, 6, 7]. In the experimental section (§ 6) we obtain rankings for the annotators using the proposed spammer scores on some publicly available data from different domains. 2 Spammer score for crowdsourced binary labels j Annotator model Let yi ∈ {0, 1} be the label assigned to the ith instance by the j th annotator, and let yi ∈ {0, 1} be the actual (unobserved) binary label. We model the accuracy of the annotator separately on the positive and the negative examples. If the true label is one, the sensitivity (true positive rate) αj for the j th annotator is defined as the probability that the annotator labels it as one. j αj := Pr[yi = 1|yi = 1]. On the other hand, if the true label is zero, the specificity (1−false positive rate) β j is defined as the probability that annotator labels it as zero. j β j := Pr[yi = 0|yi = 0]. Extensions of this basic model have been proposed to include item level difficulty [2, 13] and also to model the annotator performance based on the feature vector [14]. For simplicity we use the basic model proposed in [7] in our formulation. Based on many instances labeled by multiple annotators the maximum likelihood estimator for the annotator parameters (αj , β j ) and also the consensus ground truth (yi ) can be estimated iteratively [3, 7] via the Expectation Maximization (EM) algorithm. The EM algorithm iteratively establishes a particular gold standard (initialized via majority voting), measures the performance of the annotators given that gold standard (M-step), and refines the gold standard based on the performance measures (E-step). Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator. More precisely an annotator is a spammer if the probability j of observed label yi being one given the true label yi is independent of the true label, i.e., j j Pr[yi = 1|yi ] = Pr[yi = 1]. This means that the annotator is assigning labels randomly by flipping a coin with bias without actually looking at the data. Equivalently (1) can be written as j j Pr[yi = 1|yi = 1] = Pr[yi = 1|yi = 0] which implies αj = 1 − β j . (1) j Pr[yi = 1] (2) Hence in the context of the annotator model defined earlier a perfect spammer is an annotator for whom αj + β j − 1 = 0. This corresponds to the diagonal line on the Receiver Operating Characteristic (ROC) plot (see Figure 1(a)) 2 . If αj + β j − 1 < 0 then the annotators lies below the diagonal line and is a malicious annotator who flips the labels. Note that a malicious annotator has discriminatory power if we can detect them and flip their labels. In fact the methods proposed in [3, 7] can automatically flip the labels for the malicious annotators. Hence we define the spammer score for an annotator as S j = (αj + β j − 1)2 (3) An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 1 One of the commonly used strategy to filter out spammers is to inject some items into the annotations with known labels. This is the strategy used by CrowdFlower (http://crowdflower.com/docs/gold). 2 Also note that (αj + β j )/2 is equal to the area shown in the plot and can be considered as a non-parametric approximation to the area under the ROC curve (AUC) based on one observed point. It is also equal to the Balanced Classification Rate (BCR). So a spammer can also be defined as having BCR or AUC equal to 0.5. 2 Equal accuracy contours (prevalence=0.5) 0.9 1 Good Annotators Biased Annotators 0.8 0.9 Sensitivity Sensitivity ( αj ) Spammers 0.5 0.4 j 0.3 j 0.6 0.5 0.4 4 0. 5 0. 3 0. 7 0. 6 0. 4 0. 5 0. 2 0. 3 0. Malicious Annotators 0.2 0.4 0.6 1−Specificity ( βj ) 0.8 0.1 1 0. 0. 2 0. 3 0. 1 0. 0.6 0.5 1 0. 2 0. 2 0. 1 0. 0.4 3 1 0. 0. 2 0. 4 0. 0.2 4 0. 5 0. 0 0 1 3 0. 0.3 0.2 Biased Annotators 4 0.7 6 0. 4 0. 0.8 7 0.3 Area = (α +β )/2 0.2 0 0 0.9 0. 8 0. 8 0. 0.7 6 0. .5 0 5 0. 0.7 [ 1−βj, αj ] 0.6 0.1 6 0. 0.8 0.7 Equal spammer score contours 1 7 0. 8 0. 9 0. Sensitivity 1 (a) Binary annotator model 0.1 1 0. 2 0. 3 0. 0.2 0.4 0.6 1−Specificity 0.8 1 1 0. 0 0 (b) Accuracy 0.2 3 0. 4 0. 0.4 0.6 1−Specificity 5 0. .6 7 0 0. 8 0. 0.8 1 (c) Spammer score Figure 1: (a) For binary labels an annotator is modeled by his/her sensitivity and specificity. A perfect spammer lies on the diagonal line on the ROC plot. (b) Contours of equal accuracy (4) and (c) equal spammer score (3). Accuracy This notion of a spammer is quite different for that of the accuracy of an annotator. An annotator with high accuracy is a good annotator but one with low accuracy is not necessarily a spammer. The accuracy is computed as 1 j Accuracyj = Pr[yi = yi ] = j Pr[yi = 1|yi = k]Pr[yi = k] = αj p + β j (1 − p), (4) k=0 where p := Pr[yi = 1] is the prevalence of the positive class. Note that accuracy depends on prevalence. Our proposed spammer score does not depend on prevalence and essentially quantifies the annotator’s inherent discriminatory power. Figure 1(b) shows the contours of equal accuracy on the ROC plot. Note that annotators below the diagonal line (malicious annotators) have low accuracy. The malicious annotators are good annotators but they flip their labels and as such are not spammers if we can detect them and then correct for the flipping. In fact the EM algorithms [3, 7] can correctly flip the labels for the malicious annotators and hence they should not be treated as spammers. Figure 1(c) also shows the contours of equal score for our proposed score and it can be seen that the malicious annotators have a high score and only annotators along the diagonal have a low score (spammers). Log-odds Another interpretation of a spammer can be seen from the log odds. Using Bayes’ rule the posterior log-odds can be written as log j Pr[yi = 1|yi ] Pr[yi = j 0|yi ] = log j Pr[yi |yi = 1] j Pr[yi |yi = 0] + log p . 1−p Pr[y =1|y j ] p If an annotator is a spammer (i.e., (2) holds) then log Pr[yi =0|yi ] = log 1−p . Essentially the annotator j i i provides no information in updating the posterior log-odds and hence does not contribute to the estimation of the actual true label. 3 Spammer score for categorical labels Annotator model Suppose there are K ≥ 2 categories. We introduce a multinomial parameter αj = (αj , . . . , αj ) for each annotator, where c c1 cK K j αj := Pr[yi = k|yi = c] ck αj = 1. ck and k=1 αj ck The term denotes the probability that annotator j assigns class k to an instance given that the true class is c. When K = 2, αj and αj are sensitivity and specificity, respectively. 11 00 Who is a spammer? As earlier a spammer assigns labels randomly, i.e., j j Pr[yi = k|yi ] = Pr[yi = k], ∀k. 3 j j This is equivalent to Pr[yi = k|yi = c] = Pr[yi = k|yi = c ], ∀c, c , k = 1, . . . , K— which means knowing the true class label being c or c does not change the probability of the annotator’s assigned label. This indicates that the annotator j is a spammer if αj = αj k , ∀c, c , k = 1, . . . , K. ck c (5) Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer would have 0.50 0.50 0.50 all the rows of Aj equal, for example, Aj = 0.25 0.25 0.25 0.25 0.25 0.25 , for a three class categorical annotation problem. Essentially Aj is a rank one matrix of the form Aj = evj , for some column vector vj ∈ RK that satisfies vj e = 1, where e is column vector of ones. In the binary case we had this natural notion of spammer as an annotator for whom αj + β j − 1 was close to zero. One natural way to summarize (5) would be in terms of the distance (Frobenius norm) of the confusion matrix to the closest rank one approximation, i.e, S j := Aj − eˆj v 2 F, (6) where ˆj solves v ˆj = arg min Aj − evj v vj 2 F s.t. vj e = 1. (7) Solving (7) yields ˆj = (1/K)Aj e, which is the mean of the rows of Aj . Then from (6) we have v Sj = I− 1 ee K 2 Aj = F 1 K (αj − αj k )2 . ck c c < . . . < K. Annotator model It is conceptually easier to think of the true label to be binary, that is, yi ∈ {0, 1}. For example in mammography a lesion is either malignant (1) or benign (0) (which can be confirmed by biopsy) and the BIRADS ordinal scale is a means for the radiologist to quantify the uncertainty based on the digital mammogram. The radiologist assigns a higher value of the label if he/she thinks the true label is closer to one. As earlier we characterize each annotator by the sensitivity and the specificity, but the main difference is that we now define the sensitivity and specificity for j each ordinal label (or threshold) k ∈ {1, . . . , K}. Let αj and βk be the sensitivity and specificity k th respectively of the j annotator corresponding to the threshold k, that is, j j j αj = Pr[yi ≥ k | yi = 1] and βk = Pr[yi < k | yi = 0]. k j j Note that αj = 1, β1 = 0 and αj 1 K+1 = 0, βK+1 = 1 from this definition. Hence each annotator j j is parameterized by a set of 2(K − 1) parameters [αj , β2 , . . . , αj , βK ]. This corresponds to an 2 K empirical ROC curve for the annotator (Figure 2). 4 Who is a spammer? As earlier we define an an1 j k=1 notator j to be a spammer if Pr[yi = k|yi = 1] = 0.9 j k=2 0.8 Pr[yi = k|yi = 0] ∀k = 1, . . . , K. Note that from j 0.7 k=3 [ 1−β , α ] the annotation model we have 3 Pr[yi = k | yi = 0.6 j j j 1] = αk − αk+1 and Pr[yi = k | yi = 0] = 0.5 k=4 j j 0.4 βk+1 − βk . This implies that annotator j is a spam0.3 j j mer if αj − αj k k+1 = βk+1 − βk , ∀k = 1, . . . , K, 0.2 j j 0.1 which leads to αj + βk = αj + β1 = 1, ∀k. This 1 k j 0 0 0.2 0.4 0.6 0.8 1 means that for every k, the point (1 − βk , αj ) lies on k 1−Specificity ( β ) the diagonal line in the ROC plot shown in Figure 2. The area under the empirical ROC curve can be comFigure 2: Ordinal labels: An annotator is modK 1 puted as (see Figure 2) AUCj = 2 k=1 (αj + eled by sensitivity/specificity for each threshold. k+1 j j αj )(βk+1 − βk ), and can be used to define the folk lowing spammer score as (2AUCj − 1)2 to rank the different annotators. 3 Sensitivity ( αj ) 3 j 2 K (αj k+1 k=1 j S = + j αj )(βk+1 k − j βk ) −1 (9) With two levels this expression defaults to the binary case. An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 5 Previous work Recently Ipeirotis et.al. [4] proposed a score for categorical labels based on the expected cost of the posterior label. In this section we briefly describe their approach and compare it with our proposed score. For each instance labeled by the annotator they first compute the posterior (soft) label j j Pr[yi = c|yi ] for c = 1, . . . , K, where yi is the label assigned to the ith instance by the j th annotator and yi is the true unknown label. The posterior label is computed via Bayes’ rule as j j j Pr[yi = c|yi ] ∝ Pr[yi |yi = c]Pr[yi = c] = (αj )δ(yi ,k) pc , where pc = Pr[yi = c] is the prevack lence of class c. The score for a spammer is based on the intuition that the posterior label vector j j (Pr[yi = 1|yi ], . . . , Pr[yi = K|yi ]) for a good annotator will have all the probability mass concentrated on single class. For example for a three class problem (with equal prevalence), a posterior label vector of (1, 0, 0) (certain that the class is one) comes from a good annotator while a (1/3, 1/3, 1/3) (complete uncertainty about the class label) comes from spammer. Based on this they define the following score for each annotator 1 Score = N N K K j j costck Pr[yi = k|yi ]Pr[yi = c|yi ] j i=1 . (10) c=1 k=1 where costck is the misclassification cost when an instance of class c is classified as k. Essentially this is capturing some sort of uncertainty of the posterior label averaged over all the instances. Perfect workers have a score Scorej = 0 while spammers will have high score. An entropic version of this score based on similar ideas has also been recently proposed in [5]. Our proposed spammer score differs from this approach in the following aspects: (1) Implicit in the score defined above (10) j is the assumption that an annotator is a spammer when Pr[yi = c|yi ] = Pr[yi = c], i.e., the estimated posterior labels are simply based on the prevalence and do not depend on the observed labels. By j j Bayes’ rule this is equivalent to Pr[yi |yi = c] = Pr[yi ] which is what we have used to define our spammer score. (2) While both notions of a spammer are equivalent, the approach of [4] first computes the posterior labels based on the observed data, the class prevalence and the annotator j j j j This can be seen as follows: Pr[yi = k | yi = 1] = Pr[(yi ≥ k) AND (yi < k + 1) | yi = 1] = Pr[yi ≥ j j j j k | yi = 1] + Pr[yi < k + 1 | yi = 1] − Pr[(yi ≥ k) OR (yi < k + 1) | yi = 1] = Pr[yi ≥ k | yi = j j j 1] − Pr[yi ≥ k + 1 | yi = 1] = αj − αj . Here we used the fact that Pr[(yi ≥ k) OR (yi < k + 1)] = 1. k k+1 3 5 simulated | 500 instances | 30 annotators simulated | 500 instances | 30 annotators 1 12 0.8 Spammer Score 18 0.6 0.5 22 24 23 25 0.3 29 20 0.2 0.4 0.2 30 16 14 0.1 26 21 27 28 19 0 0 13 0 0.2 0.4 0.6 1−Specificity 0.8 1 500 500 500 500 500 500 500 500 500 500 0.4 0.6 500 500 500 1 0.7 500 500 500 500 500 500 500 500 500 500 500 500 500 500 3 1 500 500 500 2 8 510 7 17 4 9 27 8 30 6 3 28 7 10 2 23 22 26 24 5 1 21 29 25 14 12 17 11 18 20 19 15 16 13 4 0.8 Sensitivity 6 9 0.9 15 11 Annotator (a) Simulation setup (b) Annotator ranking Annotator rank (median) via accuracy simulated | 500 instances | 30 annotators Annotator rank (median) via Ipeirotis et.al.[4] simulated | 500 instances | 30 annotators 27 30 30 28 23 22 26 25 24 21 29 25 20 14 17 18 15 20 16 15 11 13 12 19 1 10 5 10 2 7 6 3 5 8 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (c) Comparison with accuracy 30 18 20 16 15 19 13 12 25 11 14 17 25 20 29 21 51 15 26 22 23 24 10 2 7 10 28 6 3 8 30 5 27 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (d) Comparison with Ipeirotis et. al. [4] Figure 3: (a) The simulation setup consisting of 10 good annotators (annotators 1 to 10), 10 spammers (11 to 20), and 10 malicious annotators (21 to 30). (b) The ranking of annotators obtained using the proposed spammer score. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. (c) and (d) Comparison of the median rank obtained via the spammer score with the rank obtained using (c) accuracy and (d) the method proposed by Ipeirotis et. al. [4]. parameters and then computes the expected cost. Our proposed spammer score does not depend on the prevalence of the class. Our score is also directly defined only in terms of the annotator confusion matrix and does not need the observed labels. (3) For the score defined in (10) while perfect annotators have a score of 0 it is not clear what should be a good baseline for a spammer. The authors suggest to compute the baseline by assuming that a worker assigns as label the class with maximum prevalence. Our proposed score has a natural scale with a perfect annotator having a score of 1 and a spammer having a score of 0. (4) However one advantage of the approach in [4] is that they can directly incorporate varied misclassification costs. 6 Experiments Ranking annotators based on the confidence interval As mentioned earlier the annotator model parameters can be estimated using the iterative EM algorithms [3, 7] and these estimated annotator parameters can then be used to compute the spammer score. The spammer score can then be used to rank the annotators. However one commonly observed phenomenon when working with crowdsourced data is that we have a lot of annotators who label only a very few instances. As a result the annotator parameters cannot be reliably estimated for these annotators. In order to factor this uncertainty in the estimation of the model parameters we compute the spammer score for 100 bootstrap replications. Based on this we compute the 95% confidence intervals (CI) for the spammer score for each annotator. We rank the annotators based on the lower limit of the 95% CI. The CIs are wider 6 Table 1: Datasets N is the number of instances. M is the number of annotators. M ∗ is the mean/median number of annotators per instance. N ∗ is the mean/median number of instances labeled by each annotator. Dataset Type N M M∗ N∗ bluebird binary 108 39 39/39 108/108 temp binary 462 76 10/10 61/16 Brief Description wsd categorical/3 177 34 10/10 52/20 sentiment categorical/3 1660 33 6/6 291/175 30 100 10 38 10/10 10/10 bird identification [12] The annotator had to identify whether there was an Indigo Bunting or Blue Grosbeak in the image. event annotation [10] Given a dialogue and a pair of verbs annotators need to label whether the event described by the first verb occurs before or after the second. 30/30 26/20 30 30 30 30 30 30 3 30 30 1 30 20 20 20 20 20 77 117 20 60 Spammer Score 0.4 10 7 9 8 6 5 0 2 0.2 13 31 10 23 29 1 2 4 6 8 9 14 15 17 22 32 5 18 16 19 11 12 20 21 24 25 26 27 28 30 33 34 7 3 0 0.6 20 20 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 20 20 20 20 17 17 40 20 20 100 Spammer Score 0.4 108 108 108 108 108 108 108 108 108 108 108 108 108 0.8 0.6 0.2 17 8 27 30 25 35 1 12 32 37 38 16 22 9 29 15 20 19 5 39 3 21 23 14 2 10 24 7 33 13 36 31 4 34 28 18 11 6 26 0.2 30 77 77 4 108 108 108 108 0.4 0 wosi | 30 instances | 10 annotators 1 0.8 108 108 0.6 108 108 108 Spammer Score 0.8 wsd | 177 instances | 34 annotators 1 80 177 157 177 157 bluebird | 108 instances | 39 annotators 1 word similarity [10] Numeric judgements of word similarity. affect recognition [10] Each annotator is presented with a short headline and asked to rate it on a scale [-100,100] to denote the overall positive or negative valence. 40 40 20 ordinal/[0 10] ordinal[-100 100] 20 20 20 wosi valence word sense disambiguation [10] The labeler is given a paragraph of text containing the word ”president” and asked to label one of the three appropriate senses. irish economic sentiment analysis [1] Articles from three Irish online news sources were annotated by volunteer users as positive, negative, or irrelevant. 20 20 20 20 20 60 20 20 20 40 40 100 0.4 Annotator Annotator 0 1 26 10 18 28 15 5 36 23 12 8 32 31 38 13 17 27 11 2 35 24 19 9 6 30 33 37 14 29 4 3 20 34 22 25 7 16 21 40 20 0.2 26 2 6 11 5 14 3 20 9 22 31 10 12 18 8 13 30 4 1 29 19 17 27 28 21 15 25 23 7 33 16 24 32 10 132 10 360 10 0 13 18 52 75 33 32 12 74 31 51 41 55 7 14 70 42 58 65 43 1 10 47 61 73 25 37 76 67 24 46 54 48 39 56 15 62 68 44 53 64 40 9 28 6 2 57 3 4 5 8 11 16 17 19 20 21 22 23 26 27 29 30 34 35 36 38 45 49 50 59 60 63 66 69 71 72 442 462 452 10 10 0.6 238 171 75 654 20 0.2 0.2 0 12 77 67 374 249 229 453 346 428 0.4 Spammer Score 43 175 119 541 525 437 0.8 917 104 284 0.4 0.6 1211 1099 10 Spammer Score 0.8 572 30 52 402 60 0.6 30 Spammer Score 0.8 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 60 40 20 15 7 7 11 12 35 29 1 87 10 10 10 10 10 10 10 12 1 30 10 10 10 10 10 10 10 10 10 10 10 10 10 valence | 100 instances | 38 annotators 20 22 10 10 10 10 sentiment | 1660 instances | 33 annotators 10 10 30 20 10 Annotator temp | 462 instances | 76 annotators 1 Annotator 10 50 10 10 40 10 70 350 80 40 100 192 190 40 32 60 70 20 20 40 80 20 50 50 50 30 Annotator Annotator Figure 4: Annotator Rankings The rankings obtained for the datasets in Table 1. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances. when the annotator labels only a few instances. For a crowdsourced labeling task the annotator has to be good and also label a reasonable number of instances in order to be reliably identified. Simulated data We first illustrate our proposed spammer score on simulated binary data (with equal prevalence for both classes) consisting of 500 instances labeled by 30 annotators of varying sensitivity and specificity (see Figure 3(a) for the simulation setup). Of the 30 annotators we have 10 good annotators (annotators 1 to 10 who lie above the diagonal in Figure 3(a)), 10 spammers (annotators 11 to 20 who lie around the diagonal), and 10 malicious annotators (annotators 21 to 30 who lie below the diagonal). Figure 3(b) plots the ranking of annotators obtained using the proposed spammer score with the annotator model parameters estimated via the EM algorithm [3, 7]. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% confidence interval (CI) obtained via bootstrapping are shown. The annotators are ranked based on the lower limit of the 95% CI. As can be seen all the spammers (annotators 11 to 20) have a low spammer score and appear at the bottom of the list. The malicious annotators have higher score than the spammers since we can correct for their flipping. The malicious annotators are good annotators but they flip their labels and as such are not spammers if we detect that they are malicious. Figure 3(c) compares the (median) rank obtained via the spammer score with the (median) rank obtained using accuracy as the score to rank the annotators. While the good annotators are ranked high by both methods the accuracy score gives a low rank to the malicious annotators. Accuracy does not capture the notion of a spammer. Figure 3(d) compares the ranking with the method proposed by Ipeirotis et. al. [4] which gives almost similar rankings as our proposed score. 7 21 23 10 6 35 4 34 1126 18 147 30 3 31 13 2436 33 25 5 2 20 15 39 19 15 20 28 22 299 12 37 16 38 10 32 1 5 27 25 35 30 8 17 0 0 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 40 bluebird | 108 instances | 39 annotators 40 1 6 34 112618 4 31 1013 7 30 2 28 21 5 20 15 39 19 20 15 22 37 16 299 12 38 10 5 8 17 0 0 27 25 35 30 0.6 0.5 35 32 2 0.4 36 11 13 31 24 10 33 28 21 26 18 0 0 40 34 15 19 39 0.1 (a) 22 37 20 38 29 9 0.2 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 6 4 16 0.3 32 1 7 30 25 1 3 14 3 27 5 0.7 24 33 14 36 23 25 12 8 0.9 17 0.8 35 Sensitivity Annotator rank (median) via accuracy bluebird | 108 instances | 39 annotators Annotator rank (median) via Ipeirotis et.al.[4] bluebird | 108 instances | 39 annotators 40 23 0.2 0.4 0.6 1−Specificity (b) 0.8 1 (c) Figure 5: Comparison of the rank obtained via the spammer score with the rank obtained using (a) accuracy and (b) the method proposed by Ipeirotis et. al. [4] for the bluebird binary dataset. (c) The annotator model parameters as estimated by the EM algorithm [3, 7]. 19 25 12 18 7 3 14 20 32 5 8 1 16 20 9 21 15 34 10 31 29 17 28 22 26 2315 5 2 0 0 4 6 13 10 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 30 25 16 19 7 25 8 9 27 14 3 28 17 18 32 5 10 4 2 10 6 1529 31 23 22 21 15 0 0 33 30 11 1 20 5 sentiment | 1660 instances | 33 annotators 24 35 12 20 24 34 26 13 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 33 7 30 15 17 25 28 2719 2223 20 8 1 4 1812 15 13 10 20 32 30 10 3 29 9 31 16 5 6 2 5 14 11 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score 25 21 Annotator rank (median) via Ipeirotis et.al.[4] 25 24 27 Annotator rank (median) via accuracy Annotator rank (median) via accuracy 30 sentiment | 1660 instances | 33 annotators wsd | 177 instances | 34 annotators 33 30 11 Annotator rank (median) via Ipeirotis et.al.[4] wsd | 177 instances | 34 annotators 35 7 30 15 19 17 27 25 21 25 8 12 4 18 20 24 15 20 33 10 3 13 9 28 1 29 23 10 1632 11 14 5 6 2 5 31 30 22 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score Figure 6: Comparison of the median rank obtained via the spammer score with the rank obtained using accuracy and he method proposed by Ipeirotis et. al. [4] for the two categorial datasets in Table 1. Mechanical Turk data We report results on some publicly available linguistic and image annotation data collected using the Amazon’s Mechanical Turk (AMT) and other sources. Table 1 summarizes the datasets. Figure 4 plots the spammer scores and rankings obtained. The mean and the 95% CI obtained via bootstrapping are also shown. The number at the top of the CI bar shows the number of instances annotated by that annotator. The rankings are based on the lower limit of the 95% CI which factors the number of instances labeled by the annotator into the ranking. An annotator who labels only a few instances will have very wide CI. Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide and hence ranked lower. Ideally we would like to have annotators with a high score and at the same time label a lot of instances so that we can reliablly identify them. The authors [1] for the sentiment dataset shared with us some of the qualitative observations regarding the annotators and they somewhat agree with our rankings. For example the authors made the following comments about Annotator 7 ”Quirky annotator - had a lot of debate about what was the meaning of the annotation question. I’d say he changed his labeling strategy at least once during the process”. Our proposed score gave a low rank to this annotator. Comparison with other approaches Figure 5 and 6 compares the proposed ranking with the rank obtained using accuracy and the method proposed by Ipeirotis et. al. [4] for some binary and categorical datasets in Table 1. Our proposed ranking is somewhat similar to that obtained by Ipeirotis et. al. [4] but accuracy does not quite capture the notion of spammer. For example for the bluebird dataset for annotator 21 (see Figure 5(a)) accuracy ranks it at the bottom of the list while the proposed score puts is in the middle of the list. From the estimated model parameters it can be seen that annotator 21 actually flips the labels (below the diagonal in Figure 5(c)) but is a good annotator. 7 Conclusions We proposed a score to rank annotators for crowdsourced binary, categorical, and ordinal labeling tasks. The obtained rankings and the scores can be used to allocate monetary bonuses to be paid to different annotators and also to eliminate spammers from further labeling tasks. A mechanism to rank annotators should be desirable feature of any crowdsourcing service. The proposed score should also be useful to specify the prior for Bayesian approaches to consolidate annotations. 8 References [1] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems (PAIS’10), 2010. [2] B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. [3] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. [4] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP’10), pages 64–67, 2010. [5] V. C. Raykar and S. Yu. An entropic score to rank annotators for crowdsourced labelling tasks. In Proceedings of the Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011. [6] V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), pages 889– 896, 2009. [7] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010. [8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. [9] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085–1092. 1995. [10] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast—but is it good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 254–263, 2008. [11] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. [12] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432. 2010. [13] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. [14] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 932–939, 2010. 9
3 0.6449579 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
Author: Luca Oneto, Davide Anguita, Alessandro Ghio, Sandro Ridella
Abstract: We derive here new generalization bounds, based on Rademacher Complexity theory, for model selection and error estimation of linear (kernel) classifiers, which exploit the availability of unlabeled samples. In particular, two results are obtained: the first one shows that, using the unlabeled samples, the confidence term of the conventional bound can be reduced by a factor of three; the second one shows that the unlabeled samples can be used to obtain much tighter bounds, by building localized versions of the hypothesis class containing the optimal classifier. 1
4 0.6402927 277 nips-2011-Submodular Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: In this paper we present an algorithm to learn a multi-label classifier which attempts at directly optimising the F -score. The key novelty of our formulation is that we explicitly allow for assortative (submodular) pairwise label interactions, i.e., we can leverage the co-ocurrence of pairs of labels in order to improve the quality of prediction. Prediction in this model consists of minimising a particular submodular set function, what can be accomplished exactly and efficiently via graph-cuts. Learning however is substantially more involved and requires the solution of an intractable combinatorial optimisation problem. We present an approximate algorithm for this problem and prove that it is sound in the sense that it never predicts incorrect labels. We also present a nontrivial test of a sufficient condition for our algorithm to have found an optimal solution. We present experiments on benchmark multi-label datasets, which attest the value of the proposed technique. We also make available source code that enables the reproduction of our experiments. 1
5 0.62116045 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
6 0.59740973 240 nips-2011-Robust Multi-Class Gaussian Process Classification
7 0.5772537 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
8 0.56032258 59 nips-2011-Composite Multiclass Losses
9 0.5596565 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
10 0.55843872 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
11 0.55800331 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions
12 0.54435098 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
13 0.540196 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
14 0.53932357 14 nips-2011-A concave regularization technique for sparse mixture models
15 0.5320825 288 nips-2011-Thinning Measurement Models and Questionnaire Design
16 0.52211207 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
17 0.50757056 4 nips-2011-A Convergence Analysis of Log-Linear Training
18 0.49839124 28 nips-2011-Agnostic Selective Classification
19 0.49837443 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
20 0.49830386 53 nips-2011-Co-Training for Domain Adaptation
topicId topicWeight
[(0, 0.026), (3, 0.295), (4, 0.032), (20, 0.027), (26, 0.014), (31, 0.066), (33, 0.034), (43, 0.07), (45, 0.13), (57, 0.045), (65, 0.016), (74, 0.054), (83, 0.054), (99, 0.049)]
simIndex simValue paperId paperTitle
1 0.76614994 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
Author: David Adametz, Volker Roth
Abstract: A Bayesian approach to partitioning distance matrices is presented. It is inspired by the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call fastTIWD) overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n3 ) in the TIWD to O(n2 ). Our experiments show that the cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to ‘mine’ large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters. 1
2 0.76451302 101 nips-2011-Gaussian process modulated renewal processes
Author: Yee W. Teh, Vinayak Rao
Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1
same-paper 3 0.73174042 33 nips-2011-An Exact Algorithm for F-Measure Maximization
Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier
Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification. 1
4 0.69402432 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
Author: Jun Zhu, Ning Chen, Eric P. Xing
Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.
5 0.54764521 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.54515868 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
7 0.54189986 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
8 0.54180646 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.54161316 271 nips-2011-Statistical Tests for Optimization Efficiency
10 0.54149115 258 nips-2011-Sparse Bayesian Multi-Task Learning
11 0.54031187 276 nips-2011-Structured sparse coding via lateral inhibition
12 0.53813279 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
13 0.53734052 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
14 0.53609765 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
15 0.5359205 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
16 0.53590238 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
17 0.53552234 265 nips-2011-Sparse recovery by thresholded non-negative least squares
18 0.53549278 244 nips-2011-Selecting Receptive Fields in Deep Networks
19 0.53538078 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
20 0.53521705 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation