nips nips2012 nips2012-257 nips2012-257-reference knowledge-graph by maker-knowledge-mining

257 nips-2012-One Permutation Hashing

Source: pdf

Author: Ping Li, Art Owen, Cun-hui Zhang

Abstract: Minwise hashing is a standard procedure in the context of search, for efﬁciently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression conﬁrm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

reference text

[1] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Commun. ACM, volume 51, pages 117–122, 2008.

[2] Leon Bottou. http://leon.bottou.org/projects/sgd.

[3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998.

[4] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997.

[5] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions (extended abstract). In STOC, pages 106–112, 1977.

[6] Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithm, 55(1):58–75, 2005.

[7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871–1874, 2008.

[8] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003.

[9] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for ﬁnding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975.

[10] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998.

[11] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217–226, Pittsburgh, PA, 2006.

[12] Ping Li. Very sparse stable random projections for dimension reduction in lα (0 < α ≤ 2) norm. In KDD, San Jose, CA, 2007.

[13] Ping Li and Kenneth W. Church. Using sketches to estimate associations. In HLT/EMNLP, pages 708– 715, Vancouver, BC, Canada, 2005 (The full paper appeared in Commputational Linguistics in 2007).

[14] Ping Li, Kenneth W. Church, and Trevor J. Hastie. One sketch for all: Theory and applications of conditional random sampling. In NIPS, Vancouver, BC, Canada, 2008 (Preliminary results appeared in NIPS 2006).

[15] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections. In KDD, pages 287–296, Philadelphia, PA, 2006.

[16] Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian K¨ nig. Hashing algorithms for largeo scale learning. In NIPS, Granada, Spain, 2011.

[17] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice: Largeo scale batch and online learning and using GPUs for fast preprocessing with simple hash functions. Technical report.

[18] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In WWW, pages 671–680, Raleigh, NC, 2010. o

[19] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, BC, 2010.

[20] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML, pages 807–814, Corvalis, Oregon, 2007.

[21] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and S.V.N. Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 10:2615–2637, 2009.

[22] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, 2012.

[23] Josef Sivic and Andrew Zisserman. Video google: a text retrieval approach to object matching in videos. In ICCV, 2003.

[24] Simon Tong. Lessons learned developing a practical large scale machine learning system. http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html, 2008.

[25] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113–1120, 2009. 9