jmlr jmlr2013 jmlr2013-33 jmlr2013-33-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Reza Bosagh Zadeh, Ashish Goel
Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com. Keywords: cosine, Jaccard, overlap, dice, similarity, MapReduce, dimension independent
V. Abhishek and K. Hosanagar. Keyword generation for search engine advertising using semantic similarity between terms. In EC 2007, pages 89–94. ACM, 2007. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB 2006, pages 918–929. VLDB Endowment, 2006. R. Baraglia, G. De Francisci Morales, and C. Lucchese. Document similarity self-join with mapreduce. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 731–736. IEEE, 2010. 1624 D IMENSION I NDEPENDENT S IMILARITY C OMPUTATION MinHash DISCO Similarity 1.8 1.6 avg relative err 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 similarity threshold Figure 8: Average error for all pairs with similarity ≥ ε. DISCO MinHash Jaccard similarity error decreases for more similar pairs. We are computing error with MinHash here, not ground truth. D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In SIGKDD 2000, pages 407–416. ACM, 2000. D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007. A.Z. Broder. On the resemblance and containment of documents. In CCS 1997, pages 21–29. IEEE, 1997. A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157–1166, 1997. A. Campagna and R. Pagh. Finding associations and computing similarity via biased pair sampling. Knowledge and Information Systems, 31(3):505–526, 2012. M.S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC 2002, pages 380–388. ACM, 2002. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE 2006, pages 5–5. IEEE, 2006. S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In WWW 2005, pages 2–11. ACM, 2005. S.L. Chuang and L.F. Chien. Taxonomy generation for text segments: A practical web-based approach. ACM TOIS, 23(4):363–396, 2005. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. ISSN 0001-0782. 1625 B OSAGH Z ADEH AND G OEL T. Elsayed, J. Lin, and D.W. Oard. Pairwise document similarity in large collections with mapreduce. In ACL 2008, pages 265–268. Association for Computational Linguistics, 2008. R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In ACM SIGMOD 2003, pages 301–312. ACM, 2003. A.F. Gates, O. Natkovich, S. Chopra, P. Kamath, S.M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB 1999, pages 518–529. Morgan Kaufmann Publishers Inc., 1999. A Goel and K Munagala. Complexity measures for map-reduce, and comparison to parallel computing. Manuscript http://www.stanford.edu/˜ ashishg/papers/ mapreducecomplexity.pdf, 2012. P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. Wtf: The who to follow service at twitter. The WWW 2013 Conference, 2013. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC 1998, pages 604–613. ACM, 1998. J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR 2009, pages 155–162. ACM, 2009. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1099–1110. ACM, 2008. P. Pantel, E. Crestan, A. Borkovsky, A.M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP 2009. ACL, 2009. M. Regelson and D. Fain. Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions, volume 9623. Citeseer, 2006. M. Sahami and T.D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW 2006, pages 377–386. ACM, 2006. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In ACM SIGMOD 2004, pages 743–754. ACM, 2004. E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In SIGKDD 2005, pages 678–684. ACM, 2005. S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th International Conference on World Wide Web, pages 607–614. ACM, 2011. R.B. Zadeh and G. Carlsson. Dimension independent matrix square using mapreduce. Symposium on Theory of Computing, 2013. 1626