jmlr jmlr2012 jmlr2012-12 jmlr2012-12-reference knowledge-graph by maker-knowledge-mining

12 jmlr-2012-Active Clustering of Biological Sequences

Source: pdf

Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia

Abstract: Given a point set S and an unknown metric d on S, we study the problem of efﬁciently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efﬁciently ﬁnd an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely ﬁts our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classiﬁcation. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efﬁcient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA

reference text

N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Advances in Neural Information Processing Systems, 2009. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms, 2007. P. Awasthi, A. Blum, and O. Sheffet. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 42nd ACM Symposium on Theory of Computing, 2010. M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via coresets. In Proceedings of the 34th ACM Symposium on Theory of Computing, 2002. M. F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approximation. In Proceedings of the 20th ACM-SIAM Symposium on Discrete Algorithms, 2009. S. Ben-David. A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2-3):243–257, 2007. S. E. Brenner, C. Chothia, and T. J. Hubbard. Assessing sequence comparison methods with reliable structurally identiﬁed distant evolutionary relationships. Proceedings of the National Academy of Sciences USA, 95(11):6073–6078, 1998. K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, 2010. D. Cheng, R. Kannan, S. Vempala, and G. Wang. A divide-and-merge methodology for clustering. ACM Transactions on Database Systems, 31(4):1499–1525, 2006. A. Czumaj and C. Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Structures and Algorithms, 30(1-2):226–256, 2007. S. Dasgupta. Performance guarantees for hierarchical clustering. In Jyrki Kivinen and Robert Sloan, editors, Computational Learning Theory, volume 2375 of Lecture Notes in Computer Science, pages 235–254. Springer Berlin / Heidelberg, 2002. B. Eriksson, G. Dasarathy, A. Singh, and R. Nowak. Active clustering: Robust and efﬁcient hierarchical clustering using adaptively selected similarities. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, 2011. C. Faloutsos and K. Lin. Fastmap: A fast algorithm for indexing, datamining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, 1995. D. Feldman and M. Langberg. A uniﬁed framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, 2011. 224 ACTIVE C LUSTERING OF B IOLOGICAL S EQUENCES R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy, and A. Bateman. The Pfam protein families database. Nucleic Acids Research, 38:D211–222, 2010. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985. A. Kumar, Y. Sabharwal, and S. Sen. Linear time algorithms for clustering problems in any dimensions. In Proceedings of the 32nd International Colloquium on Automata, Languages and Programming, 2005. N. Mishra, D. Oblinger, and L Pitt. Sublinear time approximate clustering. In Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, 2001. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classiﬁcation of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995. R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th IEEE Symposium on Foundations of Computer Science, 2006. A. Paccanaro, J. A. Casbon, and M. A. S. Saqi. Spectral clustering of protein sequences. Nucleic Acids Research, 34(5):1571–1580, 2006. F. Schalekamp, M. Yu, and A. van Zuylen. Clustering with or without the approximation. In Proceedings of the 16th Annual International Computing and Combinatorics Conference, 2010. L. Tang and M. Crovella. Virtual landmarks for the internet. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, 2003. K. Voevodski, M. F. Balcan, H. R¨ glin, S. Teng, and Y. Xia. Min-sum clustering of protein seo quences with limited distance information. In Proceedings of the 1st International Workshop on Similarity-Based Pattern Analysis and Recognition, 2011. D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Proceedings of the Colloquium on Numerical Taxonomy held in the University of St. Andrews, 1969. 225