nips nips2013 nips2013-94 nips2013-94-reference knowledge-graph by maker-knowledge-mining

94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

Source: pdf

Author: Maria-Florina Balcan, Steven Ehrlich, Yingyu Liang

Abstract: This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of ﬁnding a clustering with low cost to the problem of ﬁnding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms. 1

reference text

[1] R. Albert and A.-L. Barab´ si. Statistical mechanics of complex networks. Reviews of Modern a Physics, 2002. 8

[2] P. Awasthi and M. Balcan. Center based clustering: A foundational perspective. Survey Chapter in Handbook of Cluster Analysis (Manuscript), 2013.

[3] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and privacy. In Proceedings of the Conference on Learning Thoery, 2012.

[4] J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor databases. In Proceedings of the International Conference on Data Engineering, 2004.

[5] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, 2012.

[6] H. Daum´ III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Efﬁcient protocols for e distributed classiﬁcation and optimization. In Algorithmic Learning Theory, pages 154–168. Springer, 2012.

[7] S. Dutta, C. Gianella, and H. Kargupta. K-means clustering over peer-to-peer networks. In Proceedings of the International Workshop on High Performance and Distributed Mining, 2005.

[8] D. Feldman and M. Langberg. A uniﬁed framework for approximating and clustering data. In Proceedings of the Annual ACM Symposium on Theory of Computing, 2011.

[9] D. Feldman, A. Sugaya, and D. Rus. An effective coreset compression algorithm for large scale sensor networks. In Proceedings of the International Conference on Information Processing in Sensor Networks, 2012.

[10] G. Forman and B. Zhang. Distributed data clustering can be efﬁcient and exact. ACM SIGKDD Explorations Newsletter, 2000.

[11] S. Greenhill and S. Venkatesh. Distributed query processing for mobile surveillance. In Proceedings of the International Conference on Multimedia, 2007.

[12] M. Greenwald and S. Khanna. Power-conserving computation of order-statistics over sensor networks. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2004.

[13] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the Annual ACM Symposium on Theory of Computing, 2004.

[14] E. Januzaj, H. Kriegel, and M. Pfeiﬂe. Towards effective and efﬁcient distributed clustering. In Workshop on Clustering Large Data Sets in the IEEE International Conference on Data Mining, 2003.

[15] R. Kannan and S. Vempala. Nimble algorithms for cloud computing. arXiv preprint arXiv:1304.3162, 2013.

[16] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the Annual Symposium on Computational Geometry, 2002.

[17] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 2001.

[18] S. Li and O. Svensson. Approximating k-median via pseudo-approximation. In Proceedings of the Annual ACM Symposium on Theory of Computing, 2013.

[19] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. In Proceedings of the eleventh annual ACM-SIAM Symposium on Discrete Algorithms, 2000.

[20] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti. Characterizing webbased video sharing workloads. ACM Transactions on the Web, 2011.

[21] C. Olston, J. Jiang, and J. Widom. Adaptive ﬁlters for continuous queries over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.

[22] D. Tasoulis and M. Vrahatis. Unsupervised distributed clustering. In Proceedings of the International Conference on Parallel and Distributed Computing and Networks, 2004.

[23] Q. Zhang, J. Liu, and W. Wang. Approximate clustering on distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering, 2008. 9