nips nips2011 nips2011-95 nips2011-95-reference knowledge-graph by maker-knowledge-mining

95 nips-2011-Fast and Accurate k-means For Large Datasets

Source: pdf

Author: Michael Shindler, Alex Wong, Adam W. Meyerson

Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.

reference text

[1] http://www.cs.uni-paderbom.de/en/fachgebiete/ag-bloemer/research/clustering/streamkmpp.

[2] Marcel R. Ackermann, Christian Lammersen, Marcus Martens, Christoph Raupach, Christian Sohler, and Kamil Swierkot. StreamKM++: A clustering algorithms for data streams. In ALENEX, 2010.

[3] Cham C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007. 8

[4] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. In NIPS, 2009.

[5] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. An efficient k-means clustering algorithm. In HPDM, 1998.

[6] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, January 2008.

[7] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. In SODA, 2007.

[8] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In STOC, 2001.

[9] Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming k-means on Well-Clusterable Data. In SODA, 201l.

[10] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In STOC, 2002.

[11] Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In STOC, 2003.

[12] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 2009. .

[13] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In SCG, 2007.

[14] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In STOC, 2005.

[15] A. Frank and A. Asuncion. DCI machine learning repository, 2010.

[16] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. In TDKE, 2003.

[17] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan 0' Callaghan. streams. In FOCS, 2000. Clustering data

[18] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In STOC, 2004.

[19] Anil Kumar Jain, M Narasimha Murty, and Patrick Joseph Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), September 1999.

[20] Tapas Kanungo, David Mount, Nathan Netanyahu, Christine Piatko, Ruth Silverman, and Angela Wu. A local search approximation algorithm for k-means clustering. In SCG, 2002.

[21] Stuart Lloyd. Least Squares Quantization in PCM. In Special issue on quantization, IEEE Transactions on Information Theory, 1982.

[22] James MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, 1967.

[23] Joel Max. Quantizing for minimum distortion. IEEE Transactions on Information Theory, 1960.

[24] Adam Meyerson. Online facility location. In FOCS, 200l.

[25] Rafail Ostrovsky, Yuval Rabani, Leonard Schulman, and Chaitanya Swamy. The Effectiveness of Lloyd-Type Methods for the k-Means Problem. In FOCS, 2006.

[26] Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, 2006.

[27] Dan Pelleg and Andrew Moore. Accelerating exact k-means algorithms with geometric reasoning. In KDD, 1999.

[28] Steven 1. Phillips. Acceleration of k-means and related clustering problems. In ALENEX, 2002.

[29] Andrea Vattani. k-means requires exponentially many iterations even in the plane. Discrete Computational Geometry, June 2011. 9