nips nips2011 nips2011-198 knowledge-graph by maker-knowledge-mining

198 nips-2011-On U-processes and clustering performance

Source: pdf

Author: Stéphan J. Clémençcon

Abstract: Many clustering techniques aim at optimizing empirical criteria that are of the form of a U -statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to deﬁne a general statistical framework, relying on the theory of U -processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the √ excess of clustering risk is proved to be of the order OP (1/ n). Based on recent results related to the tail behavior of degenerate U -processes, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 On U -processes and clustering performance St´ phan Cl´ mencon∗ e e ¸ LTCI UMR Telecom ParisTech/CNRS No. [sent-1, score-0.328]

2 fr Abstract Many clustering techniques aim at optimizing empirical criteria that are of the form of a U -statistic of degree two. [sent-4, score-0.432]

3 Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. [sent-5, score-0.375]

4 It is the purpose of this paper to deﬁne a general statistical framework, relying on the theory of U -processes, for studying the performance of such clustering methods. [sent-6, score-0.371]

5 In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the √ excess of clustering risk is proved to be of the order OP (1/ n). [sent-7, score-1.044]

6 Based on recent results related to the tail behavior of degenerate U -processes, it is also shown how to establish tighter rate bounds. [sent-8, score-0.255]

7 Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. [sent-9, score-0.251]

8 1 Introduction In cluster analysis, the objective is to segment a dataset into subgroups, such that data points in the same subgroup are more similar to each other (in a sense that will be speciﬁed) than to those in other subgroups. [sent-10, score-0.051]

9 Given the wide range of applications of the clustering paradigm, numerous data segmentation procedures have been introduced in the machine-learning literature (see Chapter 14 in [HTF09] and Chapter 8 in [CFZ09] for recent overviews of ”off-the-shelf” clustering techniques). [sent-11, score-0.656]

10 It is the goal of this paper to establish a general statistical framework for investigating clustering performance. [sent-15, score-0.452]

11 The present analysis is based on the observation that many statistical criteria for measuring clustering accuracy are (symmetric) U -statistics (of degree two), functions of a matrix of dissimilarities between pairs of data points. [sent-16, score-0.406]

12 1 paradigm (ERM) can be extended to situations where natural√ estimates of the risk are U -statistics. [sent-22, score-0.216]

13 In this way, we establish here a rate bound of order OP (1/ n) for the excess of clustering risk of empirical minimizers under adequate complexity assumptions on the cells forming the partition candidates (the bias term is neglected in the present analysis). [sent-23, score-1.456]

14 A linearization technique, combined with sharper tail results in the case of degenerate U -processes is also used in order to show that tighter rate bounds can be obtained. [sent-24, score-0.316]

15 Finally, it is shown how to use the upper bounds established in this analysis in order to deal with the problem of automatic model selection, that of selecting the number of clusters in particular, through complexity penalization. [sent-25, score-0.186]

16 In section 2, the notations are set out, a formal description of cluster analysis, from the ”pairwise dissimilarity” perspective, is given and the main theoretical concepts involved in the present analysis are brieﬂy recalled. [sent-27, score-0.095]

17 In section 3, an upper bound for the performance of empirical minimization of the clustering risk is established in the context of general dissimilarity measures. [sent-28, score-0.903]

18 Section 4 shows how to reﬁne the rate bound previously obtained by means of a recent inequality for degenerate U -processes, while section 5 deals with automatic selection of the optimal number of clusters. [sent-29, score-0.26]

19 Concepts pertaining to this theory and involved in the subsequent analysis are next recalled. [sent-32, score-0.087]

20 random vectors, valued in a highdimensional feature space X , typically a subset of the euclidian space Rd with d >> 1, with common probability distribution µ(dx). [sent-40, score-0.043]

21 With no loss of generality, we assume that the feature space X coincides with the support of the distribution µ(dx). [sent-41, score-0.043]

22 2 Cluster analysis The goal of clustering techniques is to partition the data (X1 , . [sent-51, score-0.434]

23 When equipped with a (borelian) measure of dissimilarity D : X 2 → R∗ , the clustering task can be rigorously cast as the problem of minimizing the criterion + Wn (P) = 2 n(n − 1) K 2 D(Xi , Xj ) · I{(Xi , Xj ) ∈ Ck }, (1) k=1 1≤i 0 for all t > 0. [sent-55, score-0.533]

24 This includes the so-termed ”standard K-means” setup, where the dissimilarity measure coincides with 2 the square euclidian norm (in this case, p = 2 and Φ(t) = t2 for t ≥ 0). [sent-56, score-0.26]

25 It will be referred to as the clustering risk of the partition P, while its statistical counterpart (1) will be called the empirical clustering risk. [sent-62, score-1.05]

26 Optimal partitions of the feature space X are deﬁned as those that minimize W (P). [sent-63, score-0.114]

27 Suppose we are given a (hopefully sufﬁciently rich) class Π of partitions of the feature space X . [sent-65, score-0.114]

28 ∗ Here we consider minimizers of the empirical risk Wn over Π, i. [sent-66, score-0.375]

29 partitions Pn in Π such that ∗ Wn Pn = min Wn (P) . [sent-68, score-0.114]

30 P∈Π (3) The design of practical algorithms for computing (approximately) empirical clustering risk minimizers is beyond the scope of this paper (refer to [HTF09] for an overview of ”off-the-shelf” clustering methods). [sent-69, score-1.031]

31 Precisely, when non degenerate, it is asymptotically normal with limiting variance 4·Var(K(1) (X)) (refer to Chapter 5 in [Ser80] for an account of asymptotic analysis of U -statistics). [sent-85, score-0.112]

32 As shall be seen in section 4, the reduced variance property of U -statistics is crucial, when it comes to establish tight rate bounds. [sent-86, score-0.128]

33 , K} and placing ourselves in the situation where K ≥ 1 is less than X ’s cardinality, the U -statistic (1) is always non 3 degenerate, except in the (sole) case where X is made of K elements exactly and all P’s cells are singletons. [sent-91, score-0.164]

34 (6) x ∈Ck(x) As µ’s support coincides with X and the separation property is fulﬁlled by D, the quantity above is zero iff Ck(x) = {x}. [sent-96, score-0.077]

35 In the non degenerate case, notice ﬁnally that the asymptotic √ variance of n{Wn (P) − W (P)} is equal to 4 · Var(D(X, Ck(X) ), where we set D(x, C) = D(x, x )µ(dx ) for all x ∈ X and any measurable set C ⊂ X . [sent-97, score-0.272]

36 x ∈X By deﬁnition, a U -process is a collection of U -statistics, one may refer to [dlPG99] for an account of the theory of U -processes. [sent-98, score-0.043]

37 Echoing the role played by the theory of empirical processes in the study of the ERM principle in binary classiﬁcation, the control of the ﬂuctuations of the U -process Wn (P) − W (P) : P ∈ Π indexed by a set Π of partition candidates will naturally lie at the heart of the present analysis. [sent-99, score-0.34]

38 As shall be seen below, this can be achieved mainly by the means of the Hoeffding representations of U -statistics, see [Hoe48]. [sent-100, score-0.036]

39 3 A bound for the excess of clustering risk Here we establish an upper bound for the performance of an empirical minimizer of the clustering risk over a class ΠK of partitions of X with K ≥ 1 cells, K being ﬁxed here and supposed to be ∗ smaller than X ’s cardinality. [sent-101, score-1.552]

40 We denote by WK the clustering risk minimum over all partitions of X with K cells. [sent-102, score-0.658]

41 The following theorem reveals that the clustering performance of the empirical minimizer (3) is of √ the order OP (1/ n), when neglecting the bias term (depending on the richness of ΠK solely). [sent-104, score-0.503]

42 Theorem 1 Consider a class ΠK of partitions with K ≥ 1 cells and suppose that: • there exists B < ∞ such that for all P in ΠK , any C in P, sup(x,x )∈C 2 D(x, x ) ≤ B, • the expectation of the Rademacher average AK,n is of the order O(n−1/2 ). [sent-105, score-0.272]

43 For any empirical clustering risk minimizer Pn , we have with probability at least 1 − δ: ∗ ∗ ∀n ≥ 2, W (Pn ) − WK ≤ 4KE[AK,n ] + 2BK K ≤ c(B, δ) · √ + n 2 log(1/δ) + n ∗ inf W (P) − WK P∈ΠK ∗ inf W (P) − WK P∈ΠK , (8) for some constant c(B, δ) < ∞, independent from n and K. [sent-107, score-0.798]

44 ’s, as that involved in the Rademacher average (7): 1 Wn (P) = n! [sent-113, score-0.044]

45 The main point lies in the fact that standard techniques in empirical process theory can be then used to control Wn (P) − W (P) uniformly over ΠK under adequate hypotheses, see the proof in the Appendix for technical details. [sent-115, score-0.238]

46 We underline that, naturally, the complexity assumption is also a crucial ingredient of the result stated above, and more generally to clustering consistency results, see Example 1 in [BvL09]. [sent-116, score-0.366]

47 We also point out that the ERM approach is by no means the sole method to obtain error bounds in the clustering context. [sent-117, score-0.408]

48 Just like in binary classiﬁcation (see [KN02]), one may use a notion of stability of a clustering algorithm to establish such results, see [vL09, ST09] and the references therein. [sent-118, score-0.47]

49 Refer to [vLBD06, vLBD08] for error bounds proved through the stability approach. [sent-119, score-0.178]

50 Before showing how the bound for the excess of risk stated above can be improved, a few remarks are in order. [sent-120, score-0.347]

51 ) We point out that standard entropy metric arguments can be used in order to bound the expected value of the Rademacher average An , see [BBL05] for instance. [sent-122, score-0.038]

52 ) In the standard K-means approach, the dissimilarity measure is D(x, x ) = ||x − x ||2 and partition candidates are indexed by a collection c of distinct ”centers” 2 c1 , . [sent-126, score-0.369]

53 , CK } with Ck = {x ∈ X : ||x − ck ||2 = min1≤l≤K ||x − cl ||2 } for 1 ≤ k ≤ K (with adequate distance-tie breaking). [sent-132, score-0.34]

54 One may easily check that for this speciﬁc collection of partitions ΠK and this choice for the dissimilarity measure, the class FΠK is a VC major class with ﬁnite VC dimension, see section 19. [sent-133, score-0.288]

55 Additionally, it should be noticed than in most practical clustering procedures, center candidates are picked in a data-driven fashion, being taken as the averages of the observations lying in each cluster/cell. [sent-135, score-0.483]

56 In this respect, the M -estimation problem formulated here can be considered to a certain extent as closer to what is actually achieved by K-means clustering techniques in practice, than the usual formulation of the K-means problem (as an optimization problem over c = (c1 , . [sent-136, score-0.328]

57 ) Notice that, in practice, the measure D involved in (1) may depend on the data. [sent-141, score-0.044]

58 For scaling purpose, one could assign data-dependent weights ω = d 2 2 (ωi )1≤i≤d in a coordinatewise manner, leading to D(x, x ) = i=1 (xi − xi ) /σi for instance, 2 where σi denotes the sample variance related to the i-th coordinate. [sent-142, score-0.135]

59 Although the criterion reﬂecting the performance is not a U -statistic anymore, the theory we develop here can be straightforwardly used for investigating clustering accuracy in such a case. [sent-143, score-0.508]

60 Indeed, it is easy to control the difference d 2 2 between the latter and the U -statistic (1) with D(x, x ) = i=1 (xi − xi )2 /σi , the σi ’s denoting the theoretical variances of µ’s marginals, under adequate moment assumptions. [sent-144, score-0.278]

61 (11) i=1 This supremum clearly has smaller mean and variance than (7). [sent-152, score-0.069]

62 We also introduce the quantities: Z = M = (2) i j HC (Xi , Xj ) sup C∈P, P∈ΠK sup , U = i,j (2) i HC (Xi , Xj ) sup C∈P, P∈ΠK 1≤j≤n sup C∈P, P∈ΠK α: sup P j α2 i,j j (2) i αj HC (Xi , Xj ), . [sent-153, score-1.085]

63 i Theorem 2 Consider a class ΠK of partitions with K cells and suppose that: • there exists B < ∞ such that sup(x,x )∈C 2 D(x, x ) ≤ B for all P ∈ ΠK , C ∈ P. [sent-154, score-0.272]

64 (13) The result above relies on the moment inequality for degenerate U -processes proved in [CLV08]. [sent-157, score-0.33]

65 5 Model selection - choosing the number of clusters A crucial issue in data segmentation is to determine the number K of cells that exhibits the most the clustering phenomenon in the data. [sent-162, score-0.514]

66 Here we consider a complexity regularization method that avoids to have recourse to such techniques and uses a data-dependent penalty term based on the analysis carried out above. [sent-164, score-0.038]

67 of collections of partitions of the feature space X such that, for all K ≥ 1, the elements of ΠK are made of K cells and fulﬁll the assumptions of Theorem 1. [sent-168, score-0.272]

68 In order to avoid overﬁtting, consider the (data-driven) complexity penalty given by pen(n, K) = 3KE [AK,n ] + 27BK log K + n (2B log K)/n and the minimizer PK,n of the penalized empirical clustering risk, with b K = arg min Wn (PK,n ) + pen(n, K) K≥1 6 and Wn (PK,n ) = min Wn (P). [sent-169, score-0.49]

69 P∈ΠK (14) The next result shows that the partition thus selected nearly achieves the performance that would be obtained with the help of an oracle, revealing the value of the index K that minimizes E[PK,n ]−W ∗ , with W ∗ = inf P W (P). [sent-170, score-0.171]

70 Theorem 3 (A N ORACLE INEQUALITY ) Suppose that, for all K ≥ 1, the assumptions of Theorem 1 are fulﬁlled. [sent-171, score-0.036]

71 The excess of risk of empirical minimizers of the clustering risk is proved to be of the order OP (n−1/2 ) under mild assumptions on the complexity of the cells forming the partition candidates. [sent-176, score-1.448]

72 It is also shown how to reﬁne slightly this upper bound through a linearization technique and the use of recent inequalities for degenerate U -processes. [sent-177, score-0.264]

73 Although the improvement displayed here can appear as not very signiﬁcant at ﬁrst glance, our approach suggests that much sharper data-dependent bounds could be established this way. [sent-178, score-0.127]

74 Therefore, mimicking the argument of Corollary 3 in [CLV08], based on the so-termed ﬁrst Hoeffding’s representation of U -statistics (see Lemma A. [sent-185, score-0.062]

75 1 in [CLV08]), we may straightforwardly derive the lemma below. [sent-186, score-0.08]

76 Proposition 1 (U NIFORM DEVIATIONS ) Suppose that Theorem 1’s assumptions are fulﬁlled. [sent-187, score-0.036]

77 With probability at least 1 − δ, we have: ∀n ≥ 2, sup |Un (C) − u(C)| ≤ 2E[AK,n ] + B C∈P, P∈ΠK 2 log(1/δ) . [sent-189, score-0.217]

78 The argument follows in the footsteps of Corollary 3’s proof in [CLV08]. [sent-191, score-0.112]

79 We have: E exp λ · sup |Un (C) − u(C)| C ≤ E exp λ · sup |An (C) − u(C)| . [sent-195, score-0.506]

80 (18) C Now, using standard symmetrization and randomization tricks, one obtains that: ∀λ > 0, E exp λ · sup |An (C) − u(C)| ≤ E [exp (2λ · AK,n )] . [sent-196, score-0.253]

81 C (19) Observing that the value of AK,n cannot change by more than 2B/n when one of the ( i , Xi , Xi+ n/2 ) s is changed, while the others are kept ﬁxed, the standard bounded differences inequality argument applies and yields: E [exp (2λ · AK,n )] ≤ exp 2λ · E[AK,n ] + λ2 B 2 2n . [sent-197, score-0.16]

82 (20) Next, Markov’s inequality with λ = (t − 2E[AK,n ])/B 2 gives: P{supC |An (C) − u(C)| > t} ≤ exp(−n(t − 2E[AK,n ])2 /(2B 2 )). [sent-198, score-0.062]

83 The rate bound is ﬁnally established by combining bounds (16) and (17). [sent-200, score-0.122]

84 Proof of Theorem 2 (Sketch of) The theorem can be proved by using the decomposition (10), applying the argument above in order to control supP |Ln (P)| and the lemma below to handle the degenerate part. [sent-201, score-0.397]

85 The latter is based on a recent moment inequality for degenerate U -processes, proved in [CLV08]. [sent-202, score-0.33]

86 Lemma 2 (see Theorem 11 in [CLV08]) Suppose that Theorem 2’s assumptions are fulﬁlled. [sent-204, score-0.036]

87 There exists a universal constant C < ∞ such that for all δ ∈ (0, 1), we have with probability at least 1 − δ: ∀n ≥ 2, sup |Mn (P)| ≤ Kκ(n, δ). [sent-205, score-0.252]

88 P∈ΠK Proof of Theorem 3 The proof mimics the argument of Theorem 8. [sent-206, score-0.062]

89 We thus obtain that: ∀K ≥ 1, E W (PK,n ) − W ∗ ≤ E W (PK,n ) − W ∗ + pen(K, n) b + sup {W (P) − Wn (P)} − pen(n, k) E P∈Πk k≥1 . [sent-208, score-0.217]

90 + Reproducing the argument of Theorem 1’s proof, one may easily show that: ∀k ≥ 1, E sup {W (P) − Wn (P)} ≤ 2kE[Ak,n ]. [sent-209, score-0.279]

91 P∈Πk Thus, for all k ≥ 1, the quantity P{supP∈Πk {W (P) − Wn (P)} ≥ pen(n, k) + 2δ} is bounded by P sup {W (P) − Wn (P)} ≥ E P∈Πk sup {W (P) − Wn (P)} + (2B log k)/n + δ P∈Πk 27Bk log k −δ . [sent-210, score-0.468]

92 n By virtue of the bounded differences inequality (jumps being bounded by 2B/n), the ﬁrst term is bounded by exp(−nδ 2 /(2B 2 ))/k 2 , while the second term is bounded by, exp(−nδ/(9Bk))/k 3 as shown by Lemma 8. [sent-211, score-0.062]

93 Integrating over δ, one obtains: + P 3kE [Ak,n ] ≤ 2kE[Ak,n ] − E sup {W (P) − Wn (P)} − pen(n, k) P∈Πk ≤ (2B 2/n + 18B/n)/k 2 . [sent-213, score-0.217]

94 + Summing next the bounds thus obtained over k leads to the oracle inequality stated in the theorem. [sent-214, score-0.147]

95 A framework for statistical clustering with a constant time approximation algorithms for k-median clustering. [sent-228, score-0.328]

96 Nearest neighbor clustering: A baseline method for consistent clustering with arbitrary objective functions. [sent-241, score-0.328]

97 Local Rademacher complexities and oracle inequalities in risk minimization (with discussion). [sent-298, score-0.323]

98 Bootstrap conﬁdence intervals for the number of clusters in cluster analysis. [sent-304, score-0.115]

99 On the reliability of clustering stability in the large sample regime. [sent-331, score-0.413]

100 Estimating the number of clusters in a data set via the gap statistic. [sent-337, score-0.064]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hc', 0.394), ('clustering', 0.328), ('wn', 0.319), ('sup', 0.217), ('risk', 0.216), ('ck', 0.185), ('dissimilarity', 0.174), ('degenerate', 0.16), ('wk', 0.15), ('pen', 0.142), ('cells', 0.122), ('partitions', 0.114), ('rademacher', 0.114), ('partition', 0.106), ('xi', 0.1), ('excess', 0.093), ('adequate', 0.093), ('pn', 0.091), ('candidates', 0.089), ('un', 0.089), ('minimizers', 0.087), ('stability', 0.085), ('op', 0.084), ('kp', 0.081), ('forming', 0.081), ('dx', 0.08), ('von', 0.076), ('supc', 0.074), ('empirical', 0.072), ('ful', 0.071), ('hoeffding', 0.067), ('investigating', 0.067), ('vc', 0.067), ('inf', 0.065), ('clusters', 0.064), ('inequality', 0.062), ('cl', 0.062), ('argument', 0.062), ('remark', 0.06), ('establish', 0.057), ('biau', 0.056), ('erm', 0.056), ('luxburg', 0.056), ('moment', 0.055), ('proved', 0.053), ('minimizer', 0.052), ('xj', 0.052), ('cluster', 0.051), ('theorem', 0.051), ('mencon', 0.05), ('footsteps', 0.05), ('telecom', 0.05), ('suprema', 0.05), ('devroye', 0.046), ('dissimilarities', 0.046), ('oracle', 0.045), ('mn', 0.045), ('var', 0.045), ('involved', 0.044), ('established', 0.044), ('euclidian', 0.043), ('anymore', 0.043), ('sharper', 0.043), ('coincides', 0.043), ('theory', 0.043), ('non', 0.042), ('lemma', 0.041), ('sole', 0.04), ('bounds', 0.04), ('chapter', 0.039), ('straightforwardly', 0.039), ('complexity', 0.038), ('tighter', 0.038), ('bound', 0.038), ('shall', 0.036), ('exp', 0.036), ('assumptions', 0.036), ('suppose', 0.036), ('supp', 0.036), ('scatter', 0.036), ('averages', 0.036), ('universal', 0.035), ('variance', 0.035), ('asymptotic', 0.035), ('linearization', 0.035), ('quantity', 0.034), ('supremum', 0.034), ('statistics', 0.033), ('cf', 0.033), ('annals', 0.032), ('degree', 0.032), ('criterion', 0.031), ('minimization', 0.031), ('pairwise', 0.031), ('inequalities', 0.031), ('appendix', 0.03), ('therein', 0.03), ('control', 0.03), ('observations', 0.03), ('shamir', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 198 nips-2011-On U-processes and clustering performance

Author: Stéphan J. Clémençcon

2 0.17983538 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

3 0.16416098 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to ﬁnd a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efﬁcacy of our proposed approaches.

4 0.10515267 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefﬁcients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1

5 0.10379728 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo

Abstract: For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We ﬁrst apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms.

6 0.10112198 28 nips-2011-Agnostic Selective Classification

7 0.099191837 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.088924855 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

9 0.088275142 162 nips-2011-Lower Bounds for Passive and Active Learning

10 0.087862082 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.084526099 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

12 0.084019117 263 nips-2011-Sparse Manifold Clustering and Embedding

13 0.083552361 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

14 0.078994475 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

15 0.077239759 260 nips-2011-Sparse Features for PCA-Like Linear Regression

16 0.075282693 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

17 0.073651612 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

18 0.070855759 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

19 0.070759527 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

20 0.070531182 66 nips-2011-Crowdclustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.212), (1, -0.035), (2, -0.072), (3, -0.11), (4, -0.031), (5, 0.036), (6, -0.057), (7, -0.064), (8, 0.026), (9, -0.026), (10, -0.045), (11, 0.24), (12, 0.131), (13, -0.152), (14, 0.018), (15, -0.026), (16, -0.009), (17, -0.057), (18, 0.054), (19, 0.096), (20, 0.056), (21, -0.016), (22, -0.063), (23, 0.034), (24, -0.005), (25, 0.057), (26, 0.004), (27, 0.116), (28, 0.109), (29, -0.03), (30, 0.112), (31, -0.002), (32, 0.078), (33, 0.09), (34, 0.036), (35, 0.078), (36, 0.055), (37, -0.077), (38, 0.016), (39, -0.006), (40, -0.044), (41, -0.01), (42, -0.031), (43, -0.023), (44, 0.045), (45, -0.039), (46, -0.045), (47, -0.055), (48, -0.048), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97128099 198 nips-2011-On U-processes and clustering performance

Author: Stéphan J. Clémençcon

2 0.75218475 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

3 0.70563477 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

4 0.5483259 241 nips-2011-Scalable Training of Mixture Models via Coresets

Author: Dan Feldman, Matthew Faulkner, Andreas Krause

Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models ﬁtting the coreset will also provide a good ﬁt for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points sufﬁces for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efﬁciently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1

5 0.53955179 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

Author: Viren Jain, Srinivas C. Turaga, K Briggman, Moritz N. Helmstaedter, Winfried Denk, H. S. Seung

Abstract: An agglomerative clustering algorithm merges the most similar pair of clusters at every iteration. The function that evaluates similarity is traditionally handdesigned, but there has been recent interest in supervised or semisupervised settings in which ground-truth clustered data is available for training. Here we show how to train a similarity function by regarding it as the action-value function of a reinforcement learning problem. We apply this general method to segment images by clustering superpixels, an application that we call Learning to Agglomerate Superpixel Hierarchies (LASH). When applied to a challenging dataset of brain images from serial electron microscopy, LASH dramatically improved segmentation accuracy when clustering supervoxels generated by state of the boundary detection algorithms. The naive strategy of directly training only supervoxel similarities and applying single linkage clustering produced less improvement. 1

6 0.51857758 95 nips-2011-Fast and Accurate k-means For Large Datasets

7 0.51042062 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

8 0.50967348 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

9 0.50045878 162 nips-2011-Lower Bounds for Passive and Active Learning

10 0.48265925 69 nips-2011-Differentially Private M-Estimators

11 0.47484216 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

12 0.46849969 263 nips-2011-Sparse Manifold Clustering and Embedding

13 0.46718323 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies

14 0.46060729 13 nips-2011-A blind sparse deconvolution method for neural spike identification

15 0.45998663 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

16 0.45920792 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

17 0.45309803 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

18 0.44598487 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

19 0.44417337 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

20 0.43913898 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (4, 0.038), (20, 0.05), (26, 0.049), (31, 0.057), (33, 0.046), (43, 0.095), (45, 0.13), (48, 0.232), (57, 0.035), (74, 0.065), (83, 0.031), (99, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92819142 232 nips-2011-Ranking annotators for crowdsourced labeling tasks

Author: Vikas C. Raykar, Shipeng Yu

Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers–annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we formalize the notion of a spammer and deﬁne a score which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a high score close to one. 1 Spammers in crowdsourced labeling tasks Annotating an unlabeled dataset is one of the bottlenecks in using supervised learning to build good predictive models. Getting a dataset labeled by experts can be expensive and time consuming. With the advent of crowdsourcing services (Amazon’s Mechanical Turk being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see [8], [10], and [11] for some computer vision and natural language processing case studies). One drawback of most crowdsourcing services is that we do not have tight control over the quality of the annotators. The annotators can come from a diverse pool including genuine experts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good quality labels requestors typically get each instance labeled by multiple annotators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods that model and correct for the annotator biases [3, 9, 6, 7, 14] and/or task complexity [2, 13, 12]. In this paper we are interested in ranking annotators based on how spammer like each annotator is. In our context a spammer is a low quality annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator). Spammers can signiﬁcantly increase the cost of acquiring annotations (since they need to be paid) and at the same time decrease the accuracy of the ﬁnal consensus labels. A mechanism to detect and eliminate spammers is a desirable feature for any crowdsourcing market place. For example one can give monetary bonuses to good annotators and deny payments to spammers. The main contribution of this paper is to formalize the notion of a spammer for binary, categorical, and ordinal labeling tasks. More speciﬁcally we deﬁne a scalar metric which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a score close to one (see Figure 4). We summarize the multiple parameters corresponding to each annotator into a single score indicative of how spammer like the annotator is. While this spammer score was implicit for binary labels in earlier works [3, 9, 2, 6] the extension to categorical and ordinal labels is novel and is quite different from the accuracy computed from the confusion rate matrix. An attempt to quantify the quality of the workers based on the confusion matrix was recently made by [4] where they transformed the observed labels into posterior soft labels based on the estimated confusion 1 matrix. While we obtain somewhat similar annotator rankings, we differ from this work in that our score is directly deﬁned in terms of the annotator parameters (see § 5 for more details). The rest of the paper is organized as follows. For ease of exposition we start with binary labels (§ 2) and later extend it to categorical (§ 3) and ordinal labels (§ 4). We ﬁrst specify the annotator model used, formalize the notion of a spammer, and propose an appropriate score in terms of the annotator model parameters. We do not dwell too much on the estimation of the annotator model parameters. These parameters can either be estimated directly using known gold standard 1 or the iterative algorithms that estimate the annotator model parameters without actually knowing the gold standard [3, 9, 2, 6, 7]. In the experimental section (§ 6) we obtain rankings for the annotators using the proposed spammer scores on some publicly available data from different domains. 2 Spammer score for crowdsourced binary labels j Annotator model Let yi ∈ {0, 1} be the label assigned to the ith instance by the j th annotator, and let yi ∈ {0, 1} be the actual (unobserved) binary label. We model the accuracy of the annotator separately on the positive and the negative examples. If the true label is one, the sensitivity (true positive rate) αj for the j th annotator is deﬁned as the probability that the annotator labels it as one. j αj := Pr[yi = 1|yi = 1]. On the other hand, if the true label is zero, the speciﬁcity (1−false positive rate) β j is deﬁned as the probability that annotator labels it as zero. j β j := Pr[yi = 0|yi = 0]. Extensions of this basic model have been proposed to include item level difﬁculty [2, 13] and also to model the annotator performance based on the feature vector [14]. For simplicity we use the basic model proposed in [7] in our formulation. Based on many instances labeled by multiple annotators the maximum likelihood estimator for the annotator parameters (αj , β j ) and also the consensus ground truth (yi ) can be estimated iteratively [3, 7] via the Expectation Maximization (EM) algorithm. The EM algorithm iteratively establishes a particular gold standard (initialized via majority voting), measures the performance of the annotators given that gold standard (M-step), and reﬁnes the gold standard based on the performance measures (E-step). Who is a spammer? Intuitively, a spammer assigns labels randomly—maybe because the annotator does not understand the labeling criteria, does not look at the instances when labeling, or maybe a bot pretending to be a human annotator. More precisely an annotator is a spammer if the probability j of observed label yi being one given the true label yi is independent of the true label, i.e., j j Pr[yi = 1|yi ] = Pr[yi = 1]. This means that the annotator is assigning labels randomly by ﬂipping a coin with bias without actually looking at the data. Equivalently (1) can be written as j j Pr[yi = 1|yi = 1] = Pr[yi = 1|yi = 0] which implies αj = 1 − β j . (1) j Pr[yi = 1] (2) Hence in the context of the annotator model deﬁned earlier a perfect spammer is an annotator for whom αj + β j − 1 = 0. This corresponds to the diagonal line on the Receiver Operating Characteristic (ROC) plot (see Figure 1(a)) 2 . If αj + β j − 1 < 0 then the annotators lies below the diagonal line and is a malicious annotator who ﬂips the labels. Note that a malicious annotator has discriminatory power if we can detect them and ﬂip their labels. In fact the methods proposed in [3, 7] can automatically ﬂip the labels for the malicious annotators. Hence we deﬁne the spammer score for an annotator as S j = (αj + β j − 1)2 (3) An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 1 One of the commonly used strategy to ﬁlter out spammers is to inject some items into the annotations with known labels. This is the strategy used by CrowdFlower (http://crowdflower.com/docs/gold). 2 Also note that (αj + β j )/2 is equal to the area shown in the plot and can be considered as a non-parametric approximation to the area under the ROC curve (AUC) based on one observed point. It is also equal to the Balanced Classiﬁcation Rate (BCR). So a spammer can also be deﬁned as having BCR or AUC equal to 0.5. 2 Equal accuracy contours (prevalence=0.5) 0.9 1 Good Annotators Biased Annotators 0.8 0.9 Sensitivity Sensitivity ( αj ) Spammers 0.5 0.4 j 0.3 j 0.6 0.5 0.4 4 0. 5 0. 3 0. 7 0. 6 0. 4 0. 5 0. 2 0. 3 0. Malicious Annotators 0.2 0.4 0.6 1−Specificity ( βj ) 0.8 0.1 1 0. 0. 2 0. 3 0. 1 0. 0.6 0.5 1 0. 2 0. 2 0. 1 0. 0.4 3 1 0. 0. 2 0. 4 0. 0.2 4 0. 5 0. 0 0 1 3 0. 0.3 0.2 Biased Annotators 4 0.7 6 0. 4 0. 0.8 7 0.3 Area = (α +β )/2 0.2 0 0 0.9 0. 8 0. 8 0. 0.7 6 0. .5 0 5 0. 0.7 [ 1−βj, αj ] 0.6 0.1 6 0. 0.8 0.7 Equal spammer score contours 1 7 0. 8 0. 9 0. Sensitivity 1 (a) Binary annotator model 0.1 1 0. 2 0. 3 0. 0.2 0.4 0.6 1−Specificity 0.8 1 1 0. 0 0 (b) Accuracy 0.2 3 0. 4 0. 0.4 0.6 1−Specificity 5 0. .6 7 0 0. 8 0. 0.8 1 (c) Spammer score Figure 1: (a) For binary labels an annotator is modeled by his/her sensitivity and speciﬁcity. A perfect spammer lies on the diagonal line on the ROC plot. (b) Contours of equal accuracy (4) and (c) equal spammer score (3). Accuracy This notion of a spammer is quite different for that of the accuracy of an annotator. An annotator with high accuracy is a good annotator but one with low accuracy is not necessarily a spammer. The accuracy is computed as 1 j Accuracyj = Pr[yi = yi ] = j Pr[yi = 1|yi = k]Pr[yi = k] = αj p + β j (1 − p), (4) k=0 where p := Pr[yi = 1] is the prevalence of the positive class. Note that accuracy depends on prevalence. Our proposed spammer score does not depend on prevalence and essentially quantiﬁes the annotator’s inherent discriminatory power. Figure 1(b) shows the contours of equal accuracy on the ROC plot. Note that annotators below the diagonal line (malicious annotators) have low accuracy. The malicious annotators are good annotators but they ﬂip their labels and as such are not spammers if we can detect them and then correct for the ﬂipping. In fact the EM algorithms [3, 7] can correctly ﬂip the labels for the malicious annotators and hence they should not be treated as spammers. Figure 1(c) also shows the contours of equal score for our proposed score and it can be seen that the malicious annotators have a high score and only annotators along the diagonal have a low score (spammers). Log-odds Another interpretation of a spammer can be seen from the log odds. Using Bayes’ rule the posterior log-odds can be written as log j Pr[yi = 1|yi ] Pr[yi = j 0|yi ] = log j Pr[yi |yi = 1] j Pr[yi |yi = 0] + log p . 1−p Pr[y =1|y j ] p If an annotator is a spammer (i.e., (2) holds) then log Pr[yi =0|yi ] = log 1−p . Essentially the annotator j i i provides no information in updating the posterior log-odds and hence does not contribute to the estimation of the actual true label. 3 Spammer score for categorical labels Annotator model Suppose there are K ≥ 2 categories. We introduce a multinomial parameter αj = (αj , . . . , αj ) for each annotator, where c c1 cK K j αj := Pr[yi = k|yi = c] ck αj = 1. ck and k=1 αj ck The term denotes the probability that annotator j assigns class k to an instance given that the true class is c. When K = 2, αj and αj are sensitivity and speciﬁcity, respectively. 11 00 Who is a spammer? As earlier a spammer assigns labels randomly, i.e., j j Pr[yi = k|yi ] = Pr[yi = k], ∀k. 3 j j This is equivalent to Pr[yi = k|yi = c] = Pr[yi = k|yi = c ], ∀c, c , k = 1, . . . , K— which means knowing the true class label being c or c does not change the probability of the annotator’s assigned label. This indicates that the annotator j is a spammer if αj = αj k , ∀c, c , k = 1, . . . , K. ck c (5) Let Aj be the K × K confusion rate matrix with entries [Aj ]ck = αck —a spammer would have 0.50 0.50 0.50 all the rows of Aj equal, for example, Aj = 0.25 0.25 0.25 0.25 0.25 0.25 , for a three class categorical annotation problem. Essentially Aj is a rank one matrix of the form Aj = evj , for some column vector vj ∈ RK that satisﬁes vj e = 1, where e is column vector of ones. In the binary case we had this natural notion of spammer as an annotator for whom αj + β j − 1 was close to zero. One natural way to summarize (5) would be in terms of the distance (Frobenius norm) of the confusion matrix to the closest rank one approximation, i.e, S j := Aj − eˆj v 2 F, (6) where ˆj solves v ˆj = arg min Aj − evj v vj 2 F s.t. vj e = 1. (7) Solving (7) yields ˆj = (1/K)Aj e, which is the mean of the rows of Aj . Then from (6) we have v Sj = I− 1 ee K 2 Aj = F 1 K (αj − αj k )2 . ck c c < . . . < K. Annotator model It is conceptually easier to think of the true label to be binary, that is, yi ∈ {0, 1}. For example in mammography a lesion is either malignant (1) or benign (0) (which can be conﬁrmed by biopsy) and the BIRADS ordinal scale is a means for the radiologist to quantify the uncertainty based on the digital mammogram. The radiologist assigns a higher value of the label if he/she thinks the true label is closer to one. As earlier we characterize each annotator by the sensitivity and the speciﬁcity, but the main difference is that we now deﬁne the sensitivity and speciﬁcity for j each ordinal label (or threshold) k ∈ {1, . . . , K}. Let αj and βk be the sensitivity and speciﬁcity k th respectively of the j annotator corresponding to the threshold k, that is, j j j αj = Pr[yi ≥ k | yi = 1] and βk = Pr[yi < k | yi = 0]. k j j Note that αj = 1, β1 = 0 and αj 1 K+1 = 0, βK+1 = 1 from this deﬁnition. Hence each annotator j j is parameterized by a set of 2(K − 1) parameters [αj , β2 , . . . , αj , βK ]. This corresponds to an 2 K empirical ROC curve for the annotator (Figure 2). 4 Who is a spammer? As earlier we deﬁne an an1 j k=1 notator j to be a spammer if Pr[yi = k|yi = 1] = 0.9 j k=2 0.8 Pr[yi = k|yi = 0] ∀k = 1, . . . , K. Note that from j 0.7 k=3 [ 1−β , α ] the annotation model we have 3 Pr[yi = k | yi = 0.6 j j j 1] = αk − αk+1 and Pr[yi = k | yi = 0] = 0.5 k=4 j j 0.4 βk+1 − βk . This implies that annotator j is a spam0.3 j j mer if αj − αj k k+1 = βk+1 − βk , ∀k = 1, . . . , K, 0.2 j j 0.1 which leads to αj + βk = αj + β1 = 1, ∀k. This 1 k j 0 0 0.2 0.4 0.6 0.8 1 means that for every k, the point (1 − βk , αj ) lies on k 1−Specificity ( β ) the diagonal line in the ROC plot shown in Figure 2. The area under the empirical ROC curve can be comFigure 2: Ordinal labels: An annotator is modK 1 puted as (see Figure 2) AUCj = 2 k=1 (αj + eled by sensitivity/speciﬁcity for each threshold. k+1 j j αj )(βk+1 − βk ), and can be used to deﬁne the folk lowing spammer score as (2AUCj − 1)2 to rank the different annotators. 3 Sensitivity ( αj ) 3 j 2 K (αj k+1 k=1 j S = + j αj )(βk+1 k − j βk ) −1 (9) With two levels this expression defaults to the binary case. An annotator is a spammer if S j is close to zero. Good annotators have S j > 0 while a perfect annotator has S j = 1. 5 Previous work Recently Ipeirotis et.al. [4] proposed a score for categorical labels based on the expected cost of the posterior label. In this section we brieﬂy describe their approach and compare it with our proposed score. For each instance labeled by the annotator they ﬁrst compute the posterior (soft) label j j Pr[yi = c|yi ] for c = 1, . . . , K, where yi is the label assigned to the ith instance by the j th annotator and yi is the true unknown label. The posterior label is computed via Bayes’ rule as j j j Pr[yi = c|yi ] ∝ Pr[yi |yi = c]Pr[yi = c] = (αj )δ(yi ,k) pc , where pc = Pr[yi = c] is the prevack lence of class c. The score for a spammer is based on the intuition that the posterior label vector j j (Pr[yi = 1|yi ], . . . , Pr[yi = K|yi ]) for a good annotator will have all the probability mass concentrated on single class. For example for a three class problem (with equal prevalence), a posterior label vector of (1, 0, 0) (certain that the class is one) comes from a good annotator while a (1/3, 1/3, 1/3) (complete uncertainty about the class label) comes from spammer. Based on this they deﬁne the following score for each annotator 1 Score = N N K K j j costck Pr[yi = k|yi ]Pr[yi = c|yi ] j i=1 . (10) c=1 k=1 where costck is the misclassiﬁcation cost when an instance of class c is classiﬁed as k. Essentially this is capturing some sort of uncertainty of the posterior label averaged over all the instances. Perfect workers have a score Scorej = 0 while spammers will have high score. An entropic version of this score based on similar ideas has also been recently proposed in [5]. Our proposed spammer score differs from this approach in the following aspects: (1) Implicit in the score deﬁned above (10) j is the assumption that an annotator is a spammer when Pr[yi = c|yi ] = Pr[yi = c], i.e., the estimated posterior labels are simply based on the prevalence and do not depend on the observed labels. By j j Bayes’ rule this is equivalent to Pr[yi |yi = c] = Pr[yi ] which is what we have used to deﬁne our spammer score. (2) While both notions of a spammer are equivalent, the approach of [4] ﬁrst computes the posterior labels based on the observed data, the class prevalence and the annotator j j j j This can be seen as follows: Pr[yi = k | yi = 1] = Pr[(yi ≥ k) AND (yi < k + 1) | yi = 1] = Pr[yi ≥ j j j j k | yi = 1] + Pr[yi < k + 1 | yi = 1] − Pr[(yi ≥ k) OR (yi < k + 1) | yi = 1] = Pr[yi ≥ k | yi = j j j 1] − Pr[yi ≥ k + 1 | yi = 1] = αj − αj . Here we used the fact that Pr[(yi ≥ k) OR (yi < k + 1)] = 1. k k+1 3 5 simulated | 500 instances | 30 annotators simulated | 500 instances | 30 annotators 1 12 0.8 Spammer Score 18 0.6 0.5 22 24 23 25 0.3 29 20 0.2 0.4 0.2 30 16 14 0.1 26 21 27 28 19 0 0 13 0 0.2 0.4 0.6 1−Specificity 0.8 1 500 500 500 500 500 500 500 500 500 500 0.4 0.6 500 500 500 1 0.7 500 500 500 500 500 500 500 500 500 500 500 500 500 500 3 1 500 500 500 2 8 510 7 17 4 9 27 8 30 6 3 28 7 10 2 23 22 26 24 5 1 21 29 25 14 12 17 11 18 20 19 15 16 13 4 0.8 Sensitivity 6 9 0.9 15 11 Annotator (a) Simulation setup (b) Annotator ranking Annotator rank (median) via accuracy simulated | 500 instances | 30 annotators Annotator rank (median) via Ipeirotis et.al.[4] simulated | 500 instances | 30 annotators 27 30 30 28 23 22 26 25 24 21 29 25 20 14 17 18 15 20 16 15 11 13 12 19 1 10 5 10 2 7 6 3 5 8 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (c) Comparison with accuracy 30 18 20 16 15 19 13 12 25 11 14 17 25 20 29 21 51 15 26 22 23 24 10 2 7 10 28 6 3 8 30 5 27 9 4 0 0 5 10 15 20 25 Annotator rank (median) via spammer score 30 (d) Comparison with Ipeirotis et. al. [4] Figure 3: (a) The simulation setup consisting of 10 good annotators (annotators 1 to 10), 10 spammers (11 to 20), and 10 malicious annotators (21 to 30). (b) The ranking of annotators obtained using the proposed spammer score. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. (c) and (d) Comparison of the median rank obtained via the spammer score with the rank obtained using (c) accuracy and (d) the method proposed by Ipeirotis et. al. [4]. parameters and then computes the expected cost. Our proposed spammer score does not depend on the prevalence of the class. Our score is also directly deﬁned only in terms of the annotator confusion matrix and does not need the observed labels. (3) For the score deﬁned in (10) while perfect annotators have a score of 0 it is not clear what should be a good baseline for a spammer. The authors suggest to compute the baseline by assuming that a worker assigns as label the class with maximum prevalence. Our proposed score has a natural scale with a perfect annotator having a score of 1 and a spammer having a score of 0. (4) However one advantage of the approach in [4] is that they can directly incorporate varied misclassiﬁcation costs. 6 Experiments Ranking annotators based on the conﬁdence interval As mentioned earlier the annotator model parameters can be estimated using the iterative EM algorithms [3, 7] and these estimated annotator parameters can then be used to compute the spammer score. The spammer score can then be used to rank the annotators. However one commonly observed phenomenon when working with crowdsourced data is that we have a lot of annotators who label only a very few instances. As a result the annotator parameters cannot be reliably estimated for these annotators. In order to factor this uncertainty in the estimation of the model parameters we compute the spammer score for 100 bootstrap replications. Based on this we compute the 95% conﬁdence intervals (CI) for the spammer score for each annotator. We rank the annotators based on the lower limit of the 95% CI. The CIs are wider 6 Table 1: Datasets N is the number of instances. M is the number of annotators. M ∗ is the mean/median number of annotators per instance. N ∗ is the mean/median number of instances labeled by each annotator. Dataset Type N M M∗ N∗ bluebird binary 108 39 39/39 108/108 temp binary 462 76 10/10 61/16 Brief Description wsd categorical/3 177 34 10/10 52/20 sentiment categorical/3 1660 33 6/6 291/175 30 100 10 38 10/10 10/10 bird identiﬁcation [12] The annotator had to identify whether there was an Indigo Bunting or Blue Grosbeak in the image. event annotation [10] Given a dialogue and a pair of verbs annotators need to label whether the event described by the ﬁrst verb occurs before or after the second. 30/30 26/20 30 30 30 30 30 30 3 30 30 1 30 20 20 20 20 20 77 117 20 60 Spammer Score 0.4 10 7 9 8 6 5 0 2 0.2 13 31 10 23 29 1 2 4 6 8 9 14 15 17 22 32 5 18 16 19 11 12 20 21 24 25 26 27 28 30 33 34 7 3 0 0.6 20 20 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 20 20 20 20 17 17 40 20 20 100 Spammer Score 0.4 108 108 108 108 108 108 108 108 108 108 108 108 108 0.8 0.6 0.2 17 8 27 30 25 35 1 12 32 37 38 16 22 9 29 15 20 19 5 39 3 21 23 14 2 10 24 7 33 13 36 31 4 34 28 18 11 6 26 0.2 30 77 77 4 108 108 108 108 0.4 0 wosi | 30 instances | 10 annotators 1 0.8 108 108 0.6 108 108 108 Spammer Score 0.8 wsd | 177 instances | 34 annotators 1 80 177 157 177 157 bluebird | 108 instances | 39 annotators 1 word similarity [10] Numeric judgements of word similarity. affect recognition [10] Each annotator is presented with a short headline and asked to rate it on a scale [-100,100] to denote the overall positive or negative valence. 40 40 20 ordinal/[0 10] ordinal[-100 100] 20 20 20 wosi valence word sense disambiguation [10] The labeler is given a paragraph of text containing the word ”president” and asked to label one of the three appropriate senses. irish economic sentiment analysis [1] Articles from three Irish online news sources were annotated by volunteer users as positive, negative, or irrelevant. 20 20 20 20 20 60 20 20 20 40 40 100 0.4 Annotator Annotator 0 1 26 10 18 28 15 5 36 23 12 8 32 31 38 13 17 27 11 2 35 24 19 9 6 30 33 37 14 29 4 3 20 34 22 25 7 16 21 40 20 0.2 26 2 6 11 5 14 3 20 9 22 31 10 12 18 8 13 30 4 1 29 19 17 27 28 21 15 25 23 7 33 16 24 32 10 132 10 360 10 0 13 18 52 75 33 32 12 74 31 51 41 55 7 14 70 42 58 65 43 1 10 47 61 73 25 37 76 67 24 46 54 48 39 56 15 62 68 44 53 64 40 9 28 6 2 57 3 4 5 8 11 16 17 19 20 21 22 23 26 27 29 30 34 35 36 38 45 49 50 59 60 63 66 69 71 72 442 462 452 10 10 0.6 238 171 75 654 20 0.2 0.2 0 12 77 67 374 249 229 453 346 428 0.4 Spammer Score 43 175 119 541 525 437 0.8 917 104 284 0.4 0.6 1211 1099 10 Spammer Score 0.8 572 30 52 402 60 0.6 30 Spammer Score 0.8 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 60 40 20 15 7 7 11 12 35 29 1 87 10 10 10 10 10 10 10 12 1 30 10 10 10 10 10 10 10 10 10 10 10 10 10 valence | 100 instances | 38 annotators 20 22 10 10 10 10 sentiment | 1660 instances | 33 annotators 10 10 30 20 10 Annotator temp | 462 instances | 76 annotators 1 Annotator 10 50 10 10 40 10 70 350 80 40 100 192 190 40 32 60 70 20 20 40 80 20 50 50 50 30 Annotator Annotator Figure 4: Annotator Rankings The rankings obtained for the datasets in Table 1. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence intervals (CI) are shown—obtained from 100 bootstrap replications. The annotators are ranked based on the lower limit of the 95% CI. The number at the top of the CI bar shows the number of instances annotated by that annotator. Note that the CIs are wider when the annotator labels only a few instances. when the annotator labels only a few instances. For a crowdsourced labeling task the annotator has to be good and also label a reasonable number of instances in order to be reliably identiﬁed. Simulated data We ﬁrst illustrate our proposed spammer score on simulated binary data (with equal prevalence for both classes) consisting of 500 instances labeled by 30 annotators of varying sensitivity and speciﬁcity (see Figure 3(a) for the simulation setup). Of the 30 annotators we have 10 good annotators (annotators 1 to 10 who lie above the diagonal in Figure 3(a)), 10 spammers (annotators 11 to 20 who lie around the diagonal), and 10 malicious annotators (annotators 21 to 30 who lie below the diagonal). Figure 3(b) plots the ranking of annotators obtained using the proposed spammer score with the annotator model parameters estimated via the EM algorithm [3, 7]. The spammer score ranges from 0 to 1, the lower the score, the more spammy the annotator. The mean spammer score and the 95% conﬁdence interval (CI) obtained via bootstrapping are shown. The annotators are ranked based on the lower limit of the 95% CI. As can be seen all the spammers (annotators 11 to 20) have a low spammer score and appear at the bottom of the list. The malicious annotators have higher score than the spammers since we can correct for their ﬂipping. The malicious annotators are good annotators but they ﬂip their labels and as such are not spammers if we detect that they are malicious. Figure 3(c) compares the (median) rank obtained via the spammer score with the (median) rank obtained using accuracy as the score to rank the annotators. While the good annotators are ranked high by both methods the accuracy score gives a low rank to the malicious annotators. Accuracy does not capture the notion of a spammer. Figure 3(d) compares the ranking with the method proposed by Ipeirotis et. al. [4] which gives almost similar rankings as our proposed score. 7 21 23 10 6 35 4 34 1126 18 147 30 3 31 13 2436 33 25 5 2 20 15 39 19 15 20 28 22 299 12 37 16 38 10 32 1 5 27 25 35 30 8 17 0 0 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 40 bluebird | 108 instances | 39 annotators 40 1 6 34 112618 4 31 1013 7 30 2 28 21 5 20 15 39 19 20 15 22 37 16 299 12 38 10 5 8 17 0 0 27 25 35 30 0.6 0.5 35 32 2 0.4 36 11 13 31 24 10 33 28 21 26 18 0 0 40 34 15 19 39 0.1 (a) 22 37 20 38 29 9 0.2 5 10 15 20 25 30 35 Annotator rank (median) via spammer score 6 4 16 0.3 32 1 7 30 25 1 3 14 3 27 5 0.7 24 33 14 36 23 25 12 8 0.9 17 0.8 35 Sensitivity Annotator rank (median) via accuracy bluebird | 108 instances | 39 annotators Annotator rank (median) via Ipeirotis et.al.[4] bluebird | 108 instances | 39 annotators 40 23 0.2 0.4 0.6 1−Specificity (b) 0.8 1 (c) Figure 5: Comparison of the rank obtained via the spammer score with the rank obtained using (a) accuracy and (b) the method proposed by Ipeirotis et. al. [4] for the bluebird binary dataset. (c) The annotator model parameters as estimated by the EM algorithm [3, 7]. 19 25 12 18 7 3 14 20 32 5 8 1 16 20 9 21 15 34 10 31 29 17 28 22 26 2315 5 2 0 0 4 6 13 10 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 30 25 16 19 7 25 8 9 27 14 3 28 17 18 32 5 10 4 2 10 6 1529 31 23 22 21 15 0 0 33 30 11 1 20 5 sentiment | 1660 instances | 33 annotators 24 35 12 20 24 34 26 13 5 10 15 20 25 30 Annotator rank (median) via spammer score 35 33 7 30 15 17 25 28 2719 2223 20 8 1 4 1812 15 13 10 20 32 30 10 3 29 9 31 16 5 6 2 5 14 11 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score 25 21 Annotator rank (median) via Ipeirotis et.al.[4] 25 24 27 Annotator rank (median) via accuracy Annotator rank (median) via accuracy 30 sentiment | 1660 instances | 33 annotators wsd | 177 instances | 34 annotators 33 30 11 Annotator rank (median) via Ipeirotis et.al.[4] wsd | 177 instances | 34 annotators 35 7 30 15 19 17 27 25 21 25 8 12 4 18 20 24 15 20 33 10 3 13 9 28 1 29 23 10 1632 11 14 5 6 2 5 31 30 22 26 0 0 5 10 15 20 25 30 Annotator rank (median) via spammer score Figure 6: Comparison of the median rank obtained via the spammer score with the rank obtained using accuracy and he method proposed by Ipeirotis et. al. [4] for the two categorial datasets in Table 1. Mechanical Turk data We report results on some publicly available linguistic and image annotation data collected using the Amazon’s Mechanical Turk (AMT) and other sources. Table 1 summarizes the datasets. Figure 4 plots the spammer scores and rankings obtained. The mean and the 95% CI obtained via bootstrapping are also shown. The number at the top of the CI bar shows the number of instances annotated by that annotator. The rankings are based on the lower limit of the 95% CI which factors the number of instances labeled by the annotator into the ranking. An annotator who labels only a few instances will have very wide CI. Some annotators who label only a few instances may have a high mean spammer score but the CI will be wide and hence ranked lower. Ideally we would like to have annotators with a high score and at the same time label a lot of instances so that we can reliablly identify them. The authors [1] for the sentiment dataset shared with us some of the qualitative observations regarding the annotators and they somewhat agree with our rankings. For example the authors made the following comments about Annotator 7 ”Quirky annotator - had a lot of debate about what was the meaning of the annotation question. I’d say he changed his labeling strategy at least once during the process”. Our proposed score gave a low rank to this annotator. Comparison with other approaches Figure 5 and 6 compares the proposed ranking with the rank obtained using accuracy and the method proposed by Ipeirotis et. al. [4] for some binary and categorical datasets in Table 1. Our proposed ranking is somewhat similar to that obtained by Ipeirotis et. al. [4] but accuracy does not quite capture the notion of spammer. For example for the bluebird dataset for annotator 21 (see Figure 5(a)) accuracy ranks it at the bottom of the list while the proposed score puts is in the middle of the list. From the estimated model parameters it can be seen that annotator 21 actually ﬂips the labels (below the diagonal in Figure 5(c)) but is a good annotator. 7 Conclusions We proposed a score to rank annotators for crowdsourced binary, categorical, and ordinal labeling tasks. The obtained rankings and the scores can be used to allocate monetary bonuses to be paid to different annotators and also to eliminate spammers from further labeling tasks. A mechanism to rank annotators should be desirable feature of any crowdsourcing service. The proposed score should also be useful to specify the prior for Bayesian approaches to consolidate annotations. 8 References [1] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In Proceedings of the 6th Conference on Prestigious Applications of Intelligent Systems (PAIS’10), 2010. [2] B. Carpenter. Multilevel bayesian models of categorical data annotation. Technical Report available at http://lingpipe-blog.com/lingpipe-white-papers/, 2008. [3] A. P. Dawid and A. M. Skene. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28, 1979. [4] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP’10), pages 64–67, 2010. [5] V. C. Raykar and S. Yu. An entropic score to rank annotators for crowdsourced labelling tasks. In Proceedings of the Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011. [6] V. C. Raykar, S. Yu, L .H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), pages 889– 896, 2009. [7] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010. [8] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008. [9] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems 7, pages 1085–1092. 1995. [10] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast—but is it good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 254–263, 2008. [11] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08, pages 1–8, 2008. [12] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432. 2010. [13] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. 2009. [14] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), pages 932–939, 2010. 9

2 0.83994752 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

Author: Patrick O. Perry, Michael W. Mahoney

Abstract: Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the ﬁrst nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Deﬁnite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which 2 -regularized or 1 -regularized 2 -regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefﬁcient vector of the regression problem. Our framework will imply that the solutions to the MahoneyOrecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the ﬁrst nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm. 1

same-paper 3 0.83011538 198 nips-2011-On U-processes and clustering performance

Author: Stéphan J. Clémençcon

4 0.80608064 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

Author: Nico Goernitz, Christian Widmer, Georg Zeller, Andre Kahles, Gunnar Rätsch, Sören Sonnenburg

Abstract: We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output prediction often leads to difﬁcult inference problems and hence requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit additional information from related learning tasks by means of hierarchical regularization. Training SO models on the combined set of examples from multiple tasks can easily become infeasible for real world applications. To be able to solve the optimization problems underlying multitask structured output learning, we propose an efﬁcient algorithm based on bundle-methods. We demonstrate the performance of our approach in applications from the domain of computational biology addressing the key problem of gene ﬁnding. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach outperforms considered non-MTL methods. 1

5 0.80115086 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a ﬁxed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We deﬁne the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we deﬁne a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with inﬁnite Littlestone dimension learnable. 1

6 0.68650103 80 nips-2011-Efficient Online Learning via Randomized Rounding

7 0.68500113 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

8 0.67086077 186 nips-2011-Noise Thresholds for Spectral Clustering

9 0.66585177 22 nips-2011-Active Ranking using Pairwise Comparisons

10 0.66221148 265 nips-2011-Sparse recovery by thresholded non-negative least squares

11 0.6590578 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

12 0.65773636 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

13 0.65627289 29 nips-2011-Algorithms and hardness results for parallel large margin learning

14 0.65619266 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

15 0.65512198 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

16 0.65490592 251 nips-2011-Shaping Level Sets with Submodular Functions

17 0.65287626 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

18 0.65286112 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

19 0.65227491 109 nips-2011-Greedy Model Averaging

20 0.65177178 199 nips-2011-On fast approximate submodular minimization