jmlr jmlr2012 jmlr2012-12 knowledge-graph by maker-knowledge-mining

12 jmlr-2012-Active Clustering of Biological Sequences


Source: pdf

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. [sent-10, score-0.595]

2 We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. [sent-13, score-0.613]

3 Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. [sent-14, score-0.391]

4 In this work we initiate a study of clustering with limited distance information; in particular we consider clustering with a small number of one versus all queries. [sent-25, score-0.684]

5 Motivated by such scenarios, in this paper we consider the problem of clustering a data set with an unknown distance function, given only the capability to ask one versus all distance queries. [sent-41, score-0.485]

6 To formally analyze the correctness of our algorithm we assume that the distance function is a metric, and that our clustering problem satisfies a natural approximation stability property with respect to the k-median objective function for clustering. [sent-43, score-0.41]

7 For an objective function Φ (such as k-median), the (c, ε)-property assumes that any clustering that is a c-approximation of Φ is structurally close to some “target” clustering CT (has error of at most ε with respect to CT ). [sent-46, score-0.622]

8 204 ACTIVE C LUSTERING OF B IOLOGICAL S EQUENCES Our first main contribution is designing an algorithm that given the (1 + α, ε)-property for the kmedian objective finds an accurate clustering with probability at least 1−δ by using only O(k +ln 1 ) δ one versus all queries. [sent-48, score-0.412]

9 Our analysis requires that the clusters of the target clustering have size at least O(εn/α). [sent-49, score-0.442]

10 δ Our algorithm uses an active selection strategy to choose a small set of landmark points. [sent-56, score-0.415]

11 After selecting a small set of landmarks, we use a robust single-linkage clustering procedure that we call Expand-Landmarks, which constructs a clustering linking only the landmarks that have smin points in an r-ball around them, for an appropriate choice of smin and increasing estimates of r. [sent-60, score-1.452]

12 Our algorithm uses only the distances between landmarks and other points to compute a clustering. [sent-62, score-0.477]

13 We show that using our adaptive procedure it suffices to choose only O(k + ln 1 ) landmarks to compute an accurate clustering with probability at least 1 − δ. [sent-66, score-0.673]

14 If we use a nonδ adaptive selection strategy and simply choose landmarks uniformly at random, we must sample k a point from each ground truth cluster and therefore need at least O(k ln δ ) landmarks to find an n k accurate clustering. [sent-67, score-1.033]

15 Therefore if we simply choose landmarks uniformly at random, performance can degrade significantly if some clusters are much smaller than the average cluster size. [sent-69, score-0.658]

16 Our theoretic assumption does require that the ground truth clusters are large, but O(εn/α) can still be much smaller than the average cluster size, in which case our adaptive selection procedure gives a more significant improvement in runtime and query complexity of the algorithm. [sent-70, score-0.506]

17 A clustering instance is ε-separated if the cost of the optimal k-clustering is at most ε2 times the cost of the optimal clustering using k − 1 clusters. [sent-84, score-0.556]

18 However, as the constant factor of these approximations is at least 2, the proposed sampling methods do not necessarily yield clusterings close to the target clustering CT if the (c, ε)property holds only for some small constant c < 2, which is the interesting case in our setting. [sent-110, score-0.499]

19 Our landmark selection strategy is related to the farthest first traversal used by Dasgupta (2002). [sent-116, score-0.478]

20 To our knowledge, our work is the first to provide theoretical guarantees for active clustering (clustering by adaptively using only some of the distances between the objects) under a natural condition on the input data. [sent-123, score-0.388]

21 (2011) have provided another active clustering procedure with theoretical guarantees for hierarchical clustering under a different condition on the input data. [sent-125, score-0.593]

22 The k-median objective is to minimize Φ(C) = ∑k ∑x∈Ci d(x, ci ), where i=1 ci is the median of cluster Ci , which is the point y ∈ Ci that minimizes ∑x∈Ci d(x, y). [sent-133, score-0.646]

23 (2009), we define the distance between ′ as the fraction of points on which they disagree under the optimal matching of clusters in C and C C to clusters in C′ : 1 k dist(C,C′ ) = min ∑ |Ci −C′f (i) |, f ∈Fk n i=1 where Fk is the set of bijections f : {1, . [sent-146, score-0.434]

24 We assume that there exists some unknown relevant “target” clustering CT and given a proposed clustering C we define the error of C with respect to CT as dist(C,CT ). [sent-154, score-0.556]

25 Definition 1 We say that the instance (S, d) satisfies the (c, ε)-property for the k-median objective function with respect to the target clustering CT if any clustering of S that approximates OPTΦ within a factor of c is ε-close to CT , that is, Φ(C) ≤ c · OPTΦ ⇒ dist(C,CT ) < ε. [sent-157, score-0.614]

26 Each time we select a new landmark l, we use a one versus all distance query to get the distances between l and all other points in the data set, and update dmin (s) for each point s ∈ S. [sent-175, score-0.819]

27 We note that our algorithm uses only the distances between landmarks and other points to produce a clustering. [sent-181, score-0.477]

28 Expand-Landmarks then expands a ball Bl around each landmark l ∈ L. [sent-182, score-0.441]

29 We use the variable r to denote the radius of all the balls: for each landmark l ∈ L, Bl = {s ∈ S | d(s, l) ≤ r}. [sent-183, score-0.413]

30 We call a landmark l working if Bl contains at least smin points. [sent-185, score-0.662]

31 The algorithm maintains a graph GB = (VB , EB ), where vl ∈ VB represents the ball Bl around working landmark l, and two vertices are connected by an (undirected) edge if the corresponding / balls overlap on any point: (vl1 , vl2 ) ∈ EB iff Bl1 ∩ Bl2 = 0. [sent-186, score-0.688]

32 We emphasize that this graph considers only the balls that have at least smin points in them. [sent-187, score-0.44]

33 , sn } uniformly at random; L = L ∪ {l}; for each d(l, s) ∈ QUERY-ONE-VS-ALL(l, S) do if d(l, s) < dmin (s) then dmin (s) = d(l, s); end if end for end for return L; Bl l s Bl2 Bl1 l2 l1 Figure 1: Balls around landmarks are displayed, with the next point to be added to a ball labeled as s∗ . [sent-202, score-0.653]

34 We compute a set L′ that contains exactly one working landmark from each cluster Ci′ ∈ C′ (any working landmark is sufficient), and assign each point x ∈ S to the cluster corresponding to the closest landmark in L′ . [sent-209, score-1.56]

35 Because each cluster in the target clustering has at least (4 + 51/α)εn points, and the optimal k-median clustering C∗ differs from the target clustering by ε∗ n ≤ εn points, each cluster in C∗ must have at least (3 + 51/α)εn points. [sent-220, score-1.238]

36 In other words, the good points are those points that are close to their own cluster center and far from any other cluster center. [sent-223, score-0.558]

37 In addition, we will break up the good points into good sets Xi , where Xi is the set of the good points in the optimal cluster Ci∗ . [sent-224, score-0.475]

38 , c∗ }, then we can 1 2 k move them to the clusters corresponding to those centers, producing a clustering that is structurally far from CT , but whose objective value is close to OPT, violating the (1 + α, ε)-property. [sent-240, score-0.481]

39 We first show that almost surely the set of landmarks returned by Landmark-Selection has the property that each of the cluster cores has a landmark near it, which we refer to as the landmark spread property. [sent-245, score-1.317]

40 We conclude with the proof of the theorem, which argues that the clustering returned by the last step of our procedure is a further improved clustering that is very close to C∗ and CT . [sent-247, score-0.585]

41 We say that the landmark spread property holds if there is a landmark closer than 2dcrit to some point in each good set. [sent-249, score-0.854]

42 The following lemma proves that for q = 2b after selecting only iter = O(k + ln 1 ) points the chosen landmarks will have this property with probability at least 1 − δ. [sent-250, score-0.516]

43 δ Lemma 3 Given a set of landmarks L = Landmark-Selection (2b, 4k + 16 ln 1 , S), the landmark δ spread property is satisfied with probability at least 1 − δ. [sent-251, score-0.781]

44 As in Algorithm 2, we use dmin (s) to denote the minimum distance from point s to any landmark that has been selected so far: dmin (s) = minl∈L′ d(l, s), where L′ is the set of landmarks that have been selected so far. [sent-258, score-1.024]

45 If the former is true, the landmark spread property trivially holds. [sent-260, score-0.409]

46 Because each good set satisfies |Xi | ≥ 2b, it follows that there must be a landmark closer than 2dcrit to some point in each good set. [sent-270, score-0.491]

47 Lemma 4 The probability that fewer than k good points have been chosen as landmarks after t > 2k 2k 2 iterations of Landmark-Selection is less than e−t(1− t ) /4 . [sent-271, score-0.45]

48 Lemma 5 Given a set of landmarks L that satisfy the landmark spread property, Expand-Landmarks ′ ′ ′ with parameters smin = b + 1, and n′ = n − b returns a k-clustering C′ = {C1 ,C2 , . [sent-281, score-0.978]

49 If we let σ be a bijection mapping ′ each good set Xi to the cluster Cσ(i) containing points from Xi , the distance between c∗ and any i ′ working landmark l in Cσ(i) satisfies d(c∗ , l) < 5dcrit . [sent-285, score-0.797]

50 Lemma 7 argues that because there is a landmark near each good set, there is a value of r∗ < 4dcrit such that each Xi is contained in some ball of radius r∗ . [sent-287, score-0.551]

51 Moreover, because we only consider balls that have more than b points in them, and the number of bad points is at most b, each ball in GB must overlap some good set. [sent-288, score-0.435]

52 We refer to the clustering computed in line 6 of Expand-Landmarks as the clustering induced by GB , in which each cluster contains points in balls that are in the same component in GB . [sent-290, score-0.925]

53 Moreover, because the size of each good set satisfies |Xi | > b, and there are at most b points that are not clustered, each induced cluster must contain points from a distinct Xi (so we cannot split the good sets). [sent-310, score-0.429]

54 Clearly, for any working landmark l in Cσ(i) it must be the case that Bl overlaps Xi . [sent-313, score-0.471]

55 i i i ′ Therefore the distance between c∗ and any working landmark l ∈ Cσ(i) satisfies d(c∗ , l) < 5dcrit . [sent-316, score-0.495]

56 i i Lemma 6 A ball of radius r < 4dcrit cannot contain points from more than one good set Xi , and two balls of radius r < 4dcrit that overlap (intersect) different Xi cannot share any points. [sent-317, score-0.424]

57 Proof To prove the first part, consider a ball Bl of radius r < 4dcrit around landmark l. [sent-318, score-0.476]

58 To prove the second part, consider two balls Bl1 and Bl2 of radius r < 4dcrit around landmarks l1 and l2 . [sent-323, score-0.471]

59 Lemma 7 Given a set of landmarks L that satisfy the landmark spread property, there is some value of r∗ < 4dcrit such that each Xi is contained in some ball Bl around landmark l ∈ L of radius r∗ . [sent-330, score-1.208]

60 Proof For each good set Xi choose a point si ∈ Xi and a landmark li ∈ L that satisfy d(si , li ) < 2dcrit . [sent-331, score-0.468]

61 ′ Lemma 8 Suppose the distance between c∗ and any working landmark l in Cσ(i) satisfies d(c∗ , l) < i i 5dcrit . [sent-335, score-0.495]

62 Then given a point x ∈ Ci∗ that satisfies w2 (x) − w(x) ≥ 17dcrit , for any working landmark ′ ′ l1 ∈ Cσ(i) and any working landmark l2 ∈ Cσ( j=i) it must be the case that d(x, l1 ) < d(x, l2 ). [sent-336, score-0.832]

63 Proof [Theorem 2] After using Landmark-Selection to choose O(k + ln 1 ) points, with probability δ at least 1 − δ there is a landmark closer than 2dcrit to some point in each good set. [sent-344, score-0.494]

64 Given a set of ′ ′ ′ landmarks with this property, each cluster in the clustering C′ = {C1 ,C2 , . [sent-345, score-0.776]

65 ′ If we use σ to denote a bijection mapping each good set Xi to the cluster Cσ(i) containing points from ′ Xi , any working landmark l ∈ Cσ(i) is closer than 5dcrit to c∗ . [sent-354, score-0.739]

66 To find the detectable points using C′ , we choose some working landmark li from each Ci′ . [sent-358, score-0.574]

67 Lemma 8 argues j ′ that each detectable point in Ci∗ is closer to every working landmark in Cσ(i) than to any working ′′ and C∗ agree on all the detectable points. [sent-360, score-0.614]

68 Therefore using O(k + ln 1 ) landmarks we compute an accurate clustering with probability at δ least 1−δ. [sent-363, score-0.673]

69 Therefore to compute an accurate clustering with probability at least 1 − δ the runtime of our algorithm is O((k + ln 1 )n log n). [sent-368, score-0.39]

70 Moreover, we δ only consider the distances between the landmarks and other points, so we only use O(k + ln 1 ) one δ versus all distance queries. [sent-369, score-0.573]

71 We store points that have been clustered (points in balls of size at least smin ) in the set Clustered. [sent-374, score-0.501]

72 The representative landmark of s is the landmark l of the first large ball Bl that contains s. [sent-376, score-0.819]

73 To efficiently update the components of GB , we maintain a disjoint-set data structure U that contains sets corresponding to the connected components of GB , where each ball Bl is represented by landmark l. [sent-377, score-0.486]

74 During the execution of the algorithm the connected components of GB must correspond to the sets of U (where each ball Bl is represented by landmark l). [sent-383, score-0.486]

75 add(Cluster); 10: end for 11: return C; Lemma 9 If balls Bl1 and Bl2 are directly connected in GB , then landmarks l1 and l2 must be in the same set in U. [sent-403, score-0.481]

76 Both of these sources classify proteins by their evolutionary relatedness, therefore we can use their classifications as a ground truth to evaluate the clusterings produced by our algorithm and other methods. [sent-429, score-0.417]

77 Once we cluster a particular data set, we compare the clustering to the manual classification using the distance measure from the theoretical part of our work. [sent-439, score-0.563]

78 In addition, we compare clusterings to manual classifications using the F-measure, which is used in another study that clusters protein sequences (Paccanaro et al. [sent-441, score-0.438]

79 In addition, our algorithm uses three parameters (q, smin , n′ ) whose value is set in the proof based on α and ε, assuming that the clustering instance satisfies the (1 + α, ε)-property. [sent-450, score-0.524]

80 In our experiments we set q as a function of the average size of the ground truth clusters (ave-size), smin as a function of the size of the smallest ground truth cluster (min-size), and n′ as a function of the number of points in the data set. [sent-452, score-0.873]

81 Landmark-Clustering is most sensitive to the smin parameter, and will not report a clustering if smin is too small or too large. [sent-457, score-0.77]

82 We recommend trying several values of this parameter, in increasing or decreasing order, until one gets a clustering and none of the clusters are too large. [sent-458, score-0.415]

83 If the user gets a clustering where one of the clusters is very large, this likely means that several ground truth clusters have been merged. [sent-459, score-0.669]

84 This may happen because smin is too small causing balls of outliers to connect different cluster cores, or smin is too large causing balls intersecting different cluster cores to overlap. [sent-460, score-1.1]

85 We start with smin = min-size, and decrement it until we get exactly k clusters and none of the clusters are too large (larger than twice the size of the largest ground truth cluster). [sent-462, score-0.637]

86 Again, for some values of n′ the algorithm may not output a clustering, or output a clustering where some of the clusters are too large. [sent-470, score-0.415]

87 The value of q must be large enough to avoid repeatedly choosing outliers (if q = 1 we are likely to choose an outlier in each iteration), but small enough to quickly find a landmark near each cluster core. [sent-472, score-0.553]

88 If we set q = n, the algorithm selects landmarks uniformly at random, and we may need significantly more landmarks to choose one from each cluster core by chance. [sent-473, score-0.844]

89 When the data has the structure that follows from our assumptions, the non-adaptive selection strategy may require significantly more landmarks to cover all cluster cores (especially if the sizes of the ground truth clusters are not well-balanced). [sent-476, score-0.784]

90 As discussed earlier, to test our adaptive landmark selection strategy we compare our algorithm, which is labeled LandmarkClustering-Adaptive, with the same algorithm that chooses landmarks uniformly at random, which we refer to as Landmark-Clustering-Random. [sent-480, score-0.724]

91 landmarks L, |L| = d; embed each point in a d-dimensional space using distances to L; use k-means clustering in this space (with distances given by the Euclidean norm). [sent-503, score-0.747]

92 (2006) show that spectral clustering using sequence similarity data works well when applied to the proteins in SCOP. [sent-513, score-0.432]

93 1 0 0 A B 1 2 3 4 5 6 7 8 A dataset B 1 2 3 4 5 6 7 8 dataset (a) Comparison using fraction of misclassified points (b) Comparison using the F-measure Figure 4: Comparing the performance of Landmark-Clustering and spectral clustering on 10 data sets from SCOP. [sent-533, score-0.427]

94 This is likely because we are using a lot of landmarks (10 times the number of clusters), and selecting landmarks uniformly at random is sufficient to cover the dense groups of points. [sent-542, score-0.669]

95 Unfortunately for these data the algorithm has little success if we use fewer than 10k landmarks (it usually cannot find a clustering altogether), so we cannot test how the two selection strategies perform when we use fewer landmarks. [sent-543, score-0.601]

96 We then look at whether the sampled points are more similar to points in the same cluster than to points in other clusters. [sent-548, score-0.418]

97 In particular, in our experiments when we set n′ to n − smin + 1 as in Algorithm 1, the algorithm usually fails to report a clustering no matter what value of smin we try. [sent-560, score-0.77]

98 We proved that our algorithm yields accurate clusterings with only a small number of one versus all distance queries, given a natural assumption about the structure of the clustering instance. [sent-567, score-0.623]

99 1 F-measure The F-measure compares two clusterings C and C′ by matching each cluster in C to a cluster in C′ using a harmonic mean of Precision and Recall, and then computing a “per-point” average. [sent-592, score-0.544]

100 Min-sum clustering of protein seo quences with limited distance information. [sent-781, score-0.403]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('landmark', 0.378), ('landmarks', 0.323), ('clustering', 0.278), ('smin', 0.246), ('ci', 0.22), ('balcan', 0.2), ('clusterings', 0.194), ('scop', 0.184), ('cluster', 0.175), ('gb', 0.175), ('bl', 0.174), ('pfam', 0.174), ('voevodski', 0.164), ('clusters', 0.137), ('dmin', 0.122), ('balls', 0.113), ('equences', 0.113), ('iological', 0.113), ('oglin', 0.113), ('ct', 0.098), ('blast', 0.092), ('eng', 0.087), ('lustering', 0.087), ('proteins', 0.081), ('points', 0.081), ('distance', 0.079), ('lm', 0.079), ('distances', 0.073), ('paccanaro', 0.072), ('ia', 0.067), ('queries', 0.064), ('ball', 0.063), ('iter', 0.063), ('ostrovsky', 0.061), ('farthest', 0.061), ('clustered', 0.061), ('ground', 0.059), ('truth', 0.058), ('detectable', 0.055), ('overlaps', 0.055), ('xi', 0.053), ('dist', 0.051), ('opt', 0.051), ('overlap', 0.051), ('versus', 0.049), ('ln', 0.049), ('protein', 0.046), ('good', 0.046), ('connected', 0.045), ('finn', 0.041), ('glin', 0.041), ('heiko', 0.041), ('runtime', 0.04), ('traversal', 0.039), ('working', 0.038), ('query', 0.037), ('active', 0.037), ('teng', 0.035), ('dcrit', 0.035), ('murzin', 0.035), ('structurally', 0.035), ('radius', 0.035), ('database', 0.034), ('items', 0.033), ('cores', 0.032), ('alignments', 0.032), ('alignment', 0.031), ('objective', 0.031), ('compi', 0.031), ('kmedian', 0.031), ('konstantin', 0.031), ('spread', 0.031), ('manual', 0.031), ('sequences', 0.03), ('argues', 0.029), ('target', 0.027), ('pr', 0.027), ('altschul', 0.026), ('brenner', 0.026), ('coresets', 0.026), ('evolutionary', 0.025), ('similarity', 0.025), ('spectral', 0.024), ('sequence', 0.024), ('accurate', 0.023), ('triangle', 0.023), ('uniformly', 0.023), ('stability', 0.022), ('dataset', 0.022), ('li', 0.022), ('closer', 0.021), ('satis', 0.021), ('acids', 0.02), ('argmini', 0.02), ('awasthi', 0.02), ('badoiu', 0.02), ('blm', 0.02), ('coreset', 0.02), ('czumaj', 0.02), ('faloutsos', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 12 jmlr-2012-Active Clustering of Biological Sequences

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

2 0.062504634 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

Author: Blaine Nelson, Benjamin I. P. Rubinstein, Ling Huang, Anthony D. Joseph, Steven J. Lee, Satish Rao, J. D. Tygar

Abstract: Classifiers are often used to detect miscreant activities. We study how an adversary can systematically query a classifier to elicit information that allows the attacker to evade detection while incurring a near-minimal cost of modifying their intended malfeasance. We generalize the theory of Lowd and Meek (2005) to the family of convex-inducing classifiers that partition their feature space into two sets, one of which is convex. We present query algorithms for this family that construct undetected instances of approximately minimal cost using only polynomially-many queries in the dimension of the space and in the level of approximation. Our results demonstrate that nearoptimal evasion can be accomplished for this family without reverse engineering the classifier’s decision boundary. We also consider general ℓ p costs and show that near-optimal evasion on the family of convex-inducing classifiers is generally efficient for both positive and negative convexity for all levels of approximation if p = 1. Keywords: query algorithms, evasion, reverse engineering, adversarial learning

3 0.056272812 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efficiently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences

4 0.055071395 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

Author: Ofer Dekel, Claudio Gentile, Karthik Sridharan

Abstract: We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the effectiveness of our techniques on a real-world Internet search problem. Keywords: online learning, regret, label-efficient, crowdsourcing

5 0.052327219 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann

Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models

6 0.046982363 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

7 0.04687991 109 jmlr-2012-Stability of Density-Based Clustering

8 0.044370536 82 jmlr-2012-On the Necessity of Irrelevant Variables

9 0.040491614 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

10 0.03936848 13 jmlr-2012-Active Learning via Perfect Selective Classification

11 0.038370278 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

12 0.03829312 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

13 0.037518416 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

14 0.037174545 80 jmlr-2012-On Ranking and Generalization Bounds

15 0.03562789 91 jmlr-2012-Plug-in Approach to Active Learning

16 0.035001047 4 jmlr-2012-A Kernel Two-Sample Test

17 0.03488604 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

18 0.034153767 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

19 0.033063397 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

20 0.032783777 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.148), (1, 0.03), (2, 0.03), (3, -0.007), (4, -0.041), (5, 0.006), (6, 0.091), (7, 0.027), (8, 0.032), (9, 0.051), (10, -0.077), (11, -0.041), (12, -0.139), (13, -0.107), (14, -0.001), (15, 0.122), (16, 0.005), (17, -0.052), (18, 0.168), (19, -0.092), (20, 0.017), (21, -0.049), (22, -0.028), (23, -0.142), (24, -0.08), (25, -0.051), (26, -0.109), (27, -0.003), (28, -0.09), (29, 0.13), (30, 0.234), (31, 0.172), (32, 0.076), (33, -0.158), (34, 0.182), (35, 0.02), (36, 0.122), (37, 0.062), (38, 0.033), (39, 0.11), (40, 0.281), (41, 0.027), (42, -0.014), (43, -0.098), (44, 0.147), (45, -0.032), (46, 0.042), (47, -0.025), (48, -0.167), (49, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95755935 12 jmlr-2012-Active Clustering of Biological Sequences

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

2 0.50805312 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann

Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models

3 0.43284732 109 jmlr-2012-Stability of Density-Based Clustering

Author: Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman

Abstract: High density clusters can be characterized by the connected components of a level set L(λ) = {x : p(x) > λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T = λ L(λ). In this paper, we study the behavior of a density level set estimate L(λ) and cluster tree estimate T based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L(λ) and T as a function of h, and investigate the theoretical properties of these instability measures. Keywords: clustering, density estimation, level sets, stability, model selection

4 0.3619099 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

5 0.31177214 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efficiently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences

6 0.31125075 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

7 0.29077441 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

8 0.25110707 3 jmlr-2012-A Geometric Approach to Sample Compression

9 0.24955359 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

10 0.2437672 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

11 0.24124427 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

12 0.21612352 80 jmlr-2012-On Ranking and Generalization Bounds

13 0.20756137 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

14 0.20443223 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

15 0.19823916 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

16 0.17532144 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

17 0.17463183 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

18 0.16911927 4 jmlr-2012-A Kernel Two-Sample Test

19 0.16017725 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

20 0.15569973 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.01), (21, 0.02), (26, 0.037), (29, 0.025), (35, 0.573), (56, 0.018), (64, 0.012), (75, 0.032), (79, 0.021), (92, 0.052), (96, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86102033 12 jmlr-2012-Active Clustering of Biological Sequences

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

2 0.8415013 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword

3 0.33952212 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

4 0.32107633 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

Author: Blaine Nelson, Benjamin I. P. Rubinstein, Ling Huang, Anthony D. Joseph, Steven J. Lee, Satish Rao, J. D. Tygar

Abstract: Classifiers are often used to detect miscreant activities. We study how an adversary can systematically query a classifier to elicit information that allows the attacker to evade detection while incurring a near-minimal cost of modifying their intended malfeasance. We generalize the theory of Lowd and Meek (2005) to the family of convex-inducing classifiers that partition their feature space into two sets, one of which is convex. We present query algorithms for this family that construct undetected instances of approximately minimal cost using only polynomially-many queries in the dimension of the space and in the level of approximation. Our results demonstrate that nearoptimal evasion can be accomplished for this family without reverse engineering the classifier’s decision boundary. We also consider general ℓ p costs and show that near-optimal evasion on the family of convex-inducing classifiers is generally efficient for both positive and negative convexity for all levels of approximation if p = 1. Keywords: query algorithms, evasion, reverse engineering, adversarial learning

5 0.31863123 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

Author: Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski

Abstract: In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T 2 ) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly. Keywords: robust optimization, uncertain classification, kernel functions

6 0.31204888 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

7 0.30912653 109 jmlr-2012-Stability of Density-Based Clustering

8 0.29781908 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

9 0.29638883 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

10 0.2938064 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

11 0.29293743 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

12 0.29090554 103 jmlr-2012-Sampling Methods for the Nyström Method

13 0.28848884 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

14 0.28260568 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

15 0.27667826 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

16 0.27629918 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

17 0.27420709 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

18 0.273956 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

19 0.2735945 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

20 0.27237093 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection