nips nips2010 nips2010-289 nips2010-289-reference knowledge-graph by maker-knowledge-mining

289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

Source: pdf

Author: Ping Li, Arnd Konig, Wenhao Gui

Abstract: Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way similarity is technically non-trivial. We develop the precise estimator which is accurate and very complicated; and we recommend a much simpliﬁed estimator suitable for sparse data. Our analysis shows that b-bit minwise hashing can normally achieve a 10 to 25-fold improvement in the storage space required for a given estimator accuracy of the 3-way resemblance. 1

reference text

[1] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005.

[2] M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262–271, Barcelona, Spain, 2009.

[3] A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997.

[4] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997.

[5] G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM, pages 95–106, Stanford, CA, 2008.

[6] O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classiﬁcation. 10(5):1055–1064, 1999.

[7] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002.

[8] S. Chaudhuri. An Overview of Query Optimization in Relational Systems. In PODS, pages 34–43, 1998.

[9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatior for similarity joins in data cleaning. In ICDE, 2006.

[10] S. Chaudhuri, V. Ganti, and R. Motwani. Robust identiﬁcation of fuzzy duplicates. In ICDE, pages 865–876, Tokyo, Japan, 2005.

[11] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219–228, Paris, France, 2009.

[12] K. Church. Approximate lexicography and web search. International Journal of Lexicography, 21(3):325–336, 2008.

[13] K. Church and P. Hanks. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22–29, 1991.

[14] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. on Knowl. and Data Eng., 13(1), 2001.

[15] F. Diaz. Integration of News Content into Web Results. In WSDM, 2009.

[16] Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classiﬁcation of dense implicit communities in the web graph. ACM Trans. Web, 3(2):1–36, 2009.

[17] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003.

[18] G. Forman, K. Eshghi, and J. Suermondt. Efﬁcient detection of large-scale redundancy in enterprise ﬁle systems. SIGOPS Oper. Syst. Rev., 43(1):84–91, 2009.

[19] M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, and A. C. K¨ nig. Blews: Using blogs to provide context for news articles. In AAAI o Conference on Weblogs and Social Media, 2008.

[20] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: the Complete Book. Prentice Hall, New York, NY, 2002.

[21] A. Gionis, D. Gunopulos, and N. Koudas. Efﬁcient and tunable similar set retrieval. In SIGMOD, pages 247–258, CA, 2001.

[22] S. Gollapudi and A. Sharma. An axiomatic approach for result diversiﬁcation. In WWW, pages 381–390, Madrid, Spain, 2009.

[23] M. Hein and O. Bousquet. Hilbertian metrics and positive deﬁnite kernels on probability measures. In AISTATS, pages 136–143, Barbados, 2005.

[24] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998.

[25] Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.

[26] Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007.

[27] N. Jindal and B. Liu. Opinion spam and analysis. In WSDM, pages 219–230, Palo Alto, California, USA, 2008.

[28] K. Kalpakis and S. Tang. Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Computer Communications, 31(10):1979–1992, 2008.

[29] A. C. K¨ nig, M. Gamon, and Q. Wu. Click-Through Prediction for News Queries. In SIGIR, 2009. o

[30] H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. In PVLDB, 2009.

[31] P. Li and K. W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics, 33(3):305– 354, 2007 (Preliminary results appeared in HLT/EMNLP 2005).

[32] P. Li, K. W. Church, and T. J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, BC, Canada, 2006.

[33] P. Li, K. W. Church, and T. J. Hastie. One sketch for all: Theory and applications of conditional random sampling. In NIPS, Vancouver, BC, Canada, 2008.

[34] P. Li, T. J. Hastie, and K. W. Church. Improving random projections using marginal information. In COLT, pages 635–649, Pittsburgh, PA, 2006.

[35] P. Li and A. C. K¨ nig. b-bit minwise hashing. In WWW, pages 671–680, Raleigh, NC, 2010. o

[36] Ludmila, K. Eshghi, C. B. M. III, J. Tucek, and A. Veitch. Probabilistic frequent itemset mining in uncertain databases. In KDD, pages 1087–1096, Paris, France, 2009.

[37] G. S. Manku, A. Jain, and A. D. Sarma. Detecting Near-Duplicates for Web-Crawling. In WWW, Banff, Alberta, Canada, 2007.

[38] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999.

[39] M. Najork, S. Gollapudi, and R. Panigrahy. Less is more: sampling the neighborhood graph makes salsa better and faster. In WSDM, pages 242–251, Barcelona, Spain, 2009.

[40] S. Sarawagi and A. Kirpal. Efﬁcient set joins on similarity predicates. In SIGMOD, pages 743–754, 2004.

[41] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. ACM Trans. Web, 2(1):1–28, 2008.

[42] X. Wang and C. Zhai. Mining term association patterns from search logs for effective query reformulation. In CIKM, pages 479–488, Napa Valley, California, USA, 2008.

[43] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classiﬁcation and clustering using hypergraphs. 2006. o