emnlp emnlp2011 emnlp2011-86 emnlp2011-86-reference knowledge-graph by maker-knowledge-mining

86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association


Source: pdf

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.


reference text

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In NAACL-HLT. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In WWW, pages 757–766. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguists, 32(1): 13–47. Hsin-Hsi Chen, Ming-Shun Lin, and Yu-Chuan Wei. 2006. Novel association measures using web search with double checking. In ACL. 1067 Kenneth Ward Church and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL, pages 76–83. L. R. Dice. 1945. Measures of the amount of ecological association between species. Ecology, 26:297–302. Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74. ESSLLI. 2008. Free association task at lexical semantics workshop esslli 2008. http : / /wordspace . col locat ions .de / doku .php/workshop : e s s l i a sk. l :t Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: the concept revisited. ACM Trans. Inf. Syst., 20(1): 116–13 1. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI. Robert Goldfarb and Harvey Halpern. 1984. Word association responses in normal adult subjects. Journal of Psycholinguistic Research, 13(1):37–55. T Hughes and D Ramage. 2007. Lexical semantic relatedness with random graph walks. In EMNLP. P. Jaccard. 1912. The distribution of the flora of the alpine zone. New Phytologist, 11:37–50. Svante Janson and Jan Vegelius. 1981. Measures of eco- logical association. Oecologia, 49:371–376. Mario Jarmasz. 2003. Rogets thesaurus as a lexical resource for natural language processing. Technical report, University of Ottowa. G. Kent and A. Rosanoff. 1910. A study of association in insanity. American Journal of Insanity, pages 3 17– 390. G. Kiss, C. Armstrong, R. Milroy, and J. Piper. 1973. An associative thesaurus of english and its computer analysis. In The Computer and Literary Studies, pages 379–382. Edinburgh University Press. T. Landauer and S. Dumais. 1997. The latent semantic analysis theory of acquisition, induction, and representation of knowledge. In Psychological Review, volume 104/2, pages 211–240. Sonya Liberman and Shaul Markovitch. 2009. Compact hierarchical explicit semantic representation. In Proceedings of the IJCAI 2009 Workshop on UserContributed Knowledge and Artificial Intelligence: An Evolving Synergy (WikiAI09), Pasadena, CA, July. G.A. Miller and W.G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1–28. David Milne and Ian H. Witten. 2008. An effective, lowcost measure of semantic relatedness obtained from wikipedia links. In ACL. D. Nelson, C. McEvoy, J. Walling, and J. Wheeler. 1980. The university of south florida homograph norms. Behaviour Research Methods and Instrumen- tation, 12: 16–37. A Ochiai. 1957. Zoogeografical studies on the soleoid fishes found in japan and its neighbouring regions-ii. Bulletin of the Japanese Society of Scientific Fisheries, 22. Pavel Pecina and Pavel Schlesinger. 2006. Combining association measures for collocation extraction. In ACL. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, October. W.A. Russell and J.J. Jenkins. 1954. The complete minnesota norms for responses to 100 words from the kent-rosanoff word association test. Technical report, Office of Naval Research and University of Minnesota. StopWordList. 2010. http : / / i .dc s .gla .ac . r uk / re s ource s / l ingui st i c_ut i s / st op_ l words . The Information Retrieval Group, University of Glasgow. Accessed: November 15, 2010. Michael Strube and Simone Paolo Ponzetto. 2006. Wikirelate! computing semantic relatedness using wikipedia. In AAAI, pages 1419–1424. T. Wandmacher, E. Ovchinnikova, and T. Alexandrov. 2008. Does latent semantic analysis reflect human associations? In European Summer School in Logic, Language and Information (ESSLLI’08). Justin Washtell and Katja Markert. 2009. A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. In EMNLP, pages 628–637. Katherine K. White and Lise Abrams. 2004. Free associations and dominance ratings of homophones for young and older adults. Behavior Research Methods, Instruments, & Computers, 36(3):408–420. Wikipedia. April 2008. http : / /www .wikipedia . org. Eric Yeh, Daniel Ramage, Chris Manning, Eneko Agirre, and Aitor Soroa. 2009. Wikiwalk: Random walks on wikipedia for semantic relatedness. In ACL workshop ”TextGraphs-4: Graph-based Methods for Natural Language Processing ”. 1068