acl acl2011 acl2011-320 acl2011-320-reference knowledge-graph by maker-knowledge-mining

320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

Source: pdf

Author: Dirk Hovy ; Chunliang Zhang ; Eduard Hovy ; Anselmo Penas

Abstract: Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicateargument structures like “quarterbacks throw passes to receivers”. Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.

reference text

Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 1 international conference on Computa7th tional linguistics-Volume 1, pages 86–90. Association for Computational Linguistics Morristown, NJ, USA. Thorsten Brants and Alex Franz, editors. 2006. The Google Web 1T 5-gram Corpus Version 1.1. Number LDC2006T13. Linguistic Data Consortium, Philadelphia. Samuel Brody. 2007. Clustering Clauses for HighLevel Relation Detection: An Information-theoretic Approach. In Annual Meeting-Association for Com- putational Linguistics, volume 45, page 448. Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC 2006. Citeseer. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38. Oren Etzioni, Michael Cafarella, Doug. Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91–134. James Fan, David Ferrucci, David Gondek, and Aditya Kalyanpur. 2010. Prismatic: Inducing knowledge from a large scale lexicalized relation resource. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 122–127, Los Angeles, California, June. Association for Computational Linguistics. Alvan R. Feinstein and Domenic V. Cicchetti. 1990. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43(6):543–549. Michael Fleischman, Namhee Kwon, and Eduard Hovy. 2003. Maximum entropy models for FrameNet classification. In Proceedings of EMNLP, volume 3. Danies Gildea and Dan Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245–288. Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29–48. Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 539–545. Association for Computational Linguistics. Jasper Wilson Holley and Joy Paul Guilford. 1964. A Note on the G-Index of Agreement. Educational and Psychological Measurement, 24(4):749. Rutu Mulkar-Mehta, James Allen, Jerry Hobbs, Eduard Hovy, Bernardo Magnini, and Christopher Manning, editors. 2010. Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading. Association for Computational Linguistics, Los Angeles, California, June. Thiago Pardo, Daniel Marcu, and Maria Nunes. 2006. Unsupervised Learning of Verb Argument Structures. Computational Linguistics and Intelligent Text Processing, pages 59–70. Anselmo Pe˜ nas and Eduard Hovy. 2010. Semantic enrichment of text with background knowledge. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 15–23, Los Angeles, California, June. Association for Computational Linguistics. Simone Paolo Ponzetto and Roberto Navigli. 2010. Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In Proceedings ofthe 48thAnnual Meeting of the Association for Computational Linguistics, pages 1522–1531. Association for Computational Linguistics. Alan Ritter, Mausam, and Oren Etzioni. 2010. A latent dirichlet allocation method for selectional preferences. In Proceedings ofthe 48th Annual Meeting ofthe Association for Computational Linguistics, pages 424–434, Uppsala, Sweden, July. Association for Computational Linguistics. Evan Sandhaus, editor. 2008. The New York Times Annotated Corpus. Number LDC2008T19. Linguistic Data Consortium, Philadelphia. Rion Snow, Brendan O’Connor, Dan Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good? Evaluating non-expert annotations for natural Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM. language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263. Association for Computational Linguistics. Stephanie Strassel, Dan Adams, Henry Goldberg, Jonathan Herr, Ron Keesing, Daniel Oblinger, Heather Simpson, Robert Schrag, and Jonathan Wright. 2010. The DARPA Machine Reading Program-Encouraging Linguistic and Reasoning Research with a Series of Reading Tasks. In Proceedings of LREC 2010. 1475