acl acl2011 acl2011-189 acl2011-189-reference knowledge-graph by maker-knowledge-mining

189 acl-2011-K-means Clustering with Feature Hashing

Source: pdf

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

reference text

Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671– 687, June. David Arthur and Sergei Vassilvitskii. 2007. k-means++ : The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035. Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. 2010. Random Projections for k-means Clustering. In Advances in Neural Information Processing Systems 23, number iii, pages 298–306. Kuzman Ganchev and Mark Dredze. 2008. Small Statistical Models by Random Feature Mixing. In Proceedings of the ACL08 HLT Workshop on Mobile Language Processing, pages 19–20. Stuart P. Lloyd. 1982. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2): 129–137. J MacQueen. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch u¨tze. 2008. Introduction to Information Re- trieval. Cambridge University Press. Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and S.V.N. Vishwanathan. 2009. Hash Kernels for Structured Data. Journal of Machine Learning Research, 10:2615–2637. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th International Conference on Machine Learning.