emnlp emnlp2012 emnlp2012-52 emnlp2012-52-reference knowledge-graph by maker-knowledge-mining

52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

Source: pdf

Author: Amit Goyal ; Hal Daume III ; Raul Guerra

Abstract: Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorithm with novel fast approximate nearest neighbor search algorithms (variants of PLEB). These algorithms return the approximate nearest neighbors quickly. We show our system’s efficiency in both intrinsic and extrinsic experiments. We further evaluate our fast search algorithms both quantitatively and qualitatively on two NLP applications.

reference text

Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pas ¸ca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In NAACL ’09: Proceedings of HLT-NAACL. Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci., 3 12:3–15, January. Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In In Proc. of 34th STOC, pages 380–388. ACM. K. Church and P. Hanks. 1989. Word Association Norms, Mutual Information and Lexicography. In Proceedings of ACL, pages 76–83, Vancouver, Canada, June. Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. In VLDB. Graham Cormode and S. Muthukrishnan. 2004. An improved data stream summary: The count-min sketch and its applications. J. Algorithms. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 600–609, Portland, Oregon, USA, June. Association for Computational Linguistics. Cristian Estan and George Varghese. 2002. New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev., 32(4). L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2002. Placing search in context: The concept revisited. In ACM Transactions on Information Systems. Zoubin Ghahramani and Katherine A. Heller. 2005. Bayesian Sets. In in Advances in Neural Information Processing Systems, volume 18. Amit Goyal and Hal Daum e´ III. 2011. Approximate scalable bounded space sketch for large data NLP. In Empirical Methods in Natural Language Processing (EMNLP). Amit Goyal, Hal Daum e´ III, and Suresh Venkatasubramanian. 2009. Streaming for large scale NLP: Language modeling. In NAACL. Amit Goyal, Graham Cormode, and Hal Daum e´ III. 2012. Sketch algorithms for estimating point queries in NLP. In Empirical Methods in Natural Language Processing (EMNLP). 1079 D. Graff. 2003. English Gigaword. Linguistic Data Consortium, Philadelphia, PA, January. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, STOC ’98, pages 604–613. ACM. Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008. Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of ACL-08: HLT, pages 1048–1056, Columbus, Ohio, June. Association for Computational Linguistics. Abby Levenberg, Chris Callison-Burch, and Miles Osborne. 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’ 10, pages 394–402. Association for Computational Linguistics. Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 287–296. ACM. Ping Li, Kenneth Ward Church, and Trevor Hastie. 2008. One sketch for all: Theory and application of conditional random sampling. In Neural Information Processing Systems, pages 953–960. Tara McIntosh and James R Curran. 2008. Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition. In Proceedings of the Australasian Language Technology Association Workshop 2008, pages 97–105, December. G.A. Miller and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1–28. Patrick Pantel, Eric Crestan, Arkady Borkovsky, AnaMaria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In Proceedings of EMNLP. Delip Rao and Deepak Ravichandran. 2009. Semisupervised polarity lexicon induction. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 675–682, Athens, Greece, March. Association for Computational Linguistics. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In Proceedings of ACL. H. Rubenstein and J.B. Goodenough. 1965. Contextual correlates of synonymy. Computational Linguistics, 8:627–633. Florin Rusu and Alin Dobra. 2007. Statistical analysis of sketch estimators. In SIGMOD ’07. ACM. M. Thelen and E. Riloff. for Learning Semantic 2002. A Bootstrapping Method Lexicons Using Extraction Pat- tern Contexts. In Proceedings ods Language Processing, pages in Natural 214–221 . and Patrick Pantel. 2010. From fremeaning: Vector space models of seman- Peter D. Turney quency to of the Empirical Meth- tics. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 37: 141. Peter Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and metaphorical sense identification through concrete and abstract context. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 680–690. Association for Computational Linguistics. Benjamin Van Durme and Ashwin Lall. 2009a. Probabilistic counting with randomized storage. In IJCAI’09: Proceedings of the 21st international jont conference on Artifical intelligence. Benjamin Van Durme and Ashwin Lall. 2009b. Streaming pointwise mutual information. In Advances in Neural Information Processing Systems 22. Benjamin Van Durme and Ashwin Lall. 2010. Online generation of locality sensitive hash signatures. In Proceedings of the ACL 2010 Conference Short Papers, pages 231–235, July. Benjamin Van Durme and Ashwin Lall. 2011. Efficient online locality sensitive hashing via reservoir counting. In Proceedings of the ACL 2011 Conference Short Papers, June. Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan, and Ryan McDonald. 2010. The viability ofwebderived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 777–785, Los Angeles, California, June. Association for Computational Linguistics. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1113–1 120. ACM. 1080