nips nips2009 nips2009-234 nips2009-234-reference knowledge-graph by maker-knowledge-mining

234 nips-2009-Streaming k-means approximation

Source: pdf

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (ﬁtting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

reference text

[ADK09] Ankit Aggarwal, Amit Deshpande and Ravi Kannan: Adaptive Sampling for k-means Clustering. APPROX, 2009. [AL09] Nir Ailon and Edo Liberty: Correlation Clustering Revisited: The ”True” Cost of Error Minimization Problems. To appear in ICALP 2009. [AMS96] Noga Alon, Yossi Matias, and Mario Szegedy.: The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, pages 20–29, 1996. [AV06] David Arthur and Sergei Vassilvitskii: Worst-case and smoothed analyses of the icp algorithm, with an application to the k-means method. FOCS, 2006 [AV07] David Arthur and Sergei Vassilvitskii: k-means++: the advantages of careful seeding. SODA, 2007. [AGKM+04] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit: Local search heuristics for k-median and facility location problems. Siam Journal of Computing, 33(3):544–562, 2004. [AN07] A. Asuncion and D.J. Newman: UCI Machine Learning Repository. http://www.ics.uci.edu/∼mlearn/MLRepository.html, University of California, Irvine, School of Information and Computer Sciences, 2007. [BBG09] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta: Approximate Clustering without the Approximation. SODA, 2009. [BL08] S. Ben-David and U. von Luxburg: Relating clustering stability to properties of cluster boundaries. COLT, 2008 [CCP03] Moses Charikar and Liadan O’Callaghan and Rina Panigrahy: Better streaming algorithms for clustering problems. STOC, 2003. [CG99] Moses Charikar and Sudipto Guha: Improved combinatorial algorithms for the facility location and k-medians problem. FOCS, 1999. [CMTS02] M. Charikar, S. Guha , E Tardos, and D. Shmoys: A Constant Factor Approximation Algorithm for the k-Median Problem. Journal of Computer and System Sciences, 2002. [CR08] Kamalika Chaudhuri and Satish Rao: Learning Mixtures of Product Distributions using Correlations and Independence. COLT, 2008. [Das08] Sanjoy Dasgupta.: Course notes, CSE 291: Topics in unsupervised learning. http://wwwcse.ucsd.edu/ dasgupta/291/index.html, University of California, San Diego, Spring 2008. [Gon85] T. F. Gonzalez: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, pages 293–306, 1985. [GMMM+03] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan: Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering, 15(3): 515–528, 2003. [Ind99] Piotr Indyk: Sublinear Time Algorithms for Metric Space Problems. STOC, 1999. [JV01] K. Jain and Vijay Vazirani: Approximation Algorithms for Metric Facility Location and k-Median Problems Using the Primal-Dual Schema and Lagrangian Relaxation. Journal of the ACM. 2001. [KMNP+04] T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu: A Local Search Approximation Algorithm for k-Means Clustering, Computational Geometry: Theory and Applications, 28, 89-112, 2004. [LV92] J. Lin and J. S. Vitter: Approximation Algorithms for Geometric Median Problems. Information Processing Letters, 1992. [McG07] Andrew McGregor: Processing Data Streams. Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, 2007. [M05] S. Muthukrishnan: Data Streams: Algorithms and Applications, NOW Publishers, Inc., Hanover MA [ORSS06] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, Chaitanya Swamy: The effectiveness of Lloyd-type methods for the k-means problem. FOCS, 2006. [VW02] V. Vempala and G. Wang: A spectral algorithm of learning mixtures of distributions. pages 113–123, FOCS, 2002. 9