acl acl2010 acl2010-183 acl2010-183-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Van Durme ; Ashwin Lall
Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of OSDI. Wei Dong, Moses Charikar, and Kai Li. 2009. Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. In Proceedings of SIGIR. Michel X. Goemans and David P. Williamson. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. JACM, 42: 1115–1 145. Amit Goyal, Hal Daum e´ III, and Suresh Venkatasubramanian. 2009. Streaming for large scale NLP: Language Modeling. In Proceedings of NAACL. Sasa Petrovic, Miles Osborne, and Victor Lavrenko. 2010. Streaming First Story Detection with application to Twitter. In Proceedings of NAACL. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering. In Proceedings of ACL. David Talbot. 2009. Succinct approximate counting of skewed data. In Proceedings of IJCAI. Benjamin Van Durme and Ashwin Lall. 2009a. Probabilistic Counting with Randomized Storage. In Proceedings of IJCAI. Benjamin Van Durme and Ashwin Lall. 2009b. Streaming Pointwise Mutual Information. In Advances in Neural Information Processing Systems 22. 235