acl acl2011 acl2011-174 acl2011-174-reference knowledge-graph by maker-knowledge-mining

174 acl-2011-Insights from Network Structure for Text Mining

Source: pdf

Author: Zornitsa Kozareva ; Eduard Hovy

Abstract: Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network, in which words are the nodes and relations linking them are the edges. The results of computational network analysis, especially from the world wide web, are thus applicable. Surprisingly, these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining, how they explain some previously observed phenomena, and how they can be helpful for computational linguistics applications.

reference text

Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. pages 85–94. Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Comput. Netw., 33(1-6):309–320. Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. pages 101–1 10. Peng Chen, Huafeng Xie, Sergei Maslov, and Sid Redner. 2007. Finding scientific gems with google’s pagerank algorithm. Journal of Informetrics, 1(1):8–15, January. Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law distributions in empirical data. SIAM Rev., 51(4):661–703. Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence, 165(1):91–134, June. Linton Freeman. 1979. Centrality in social networks conceptual clarification. Social Networks, 1(3):215– 239. Michael Gasser and Linda B. Smith. 1998. Learning nouns and adjectives: A connectionist account. In Language and Cognitive Processes, pages 269–306. Demdre Gentner. 1981 . Some interesting differences between nouns and verbs. Cognition and Brain Theory, pages 161–178. Roxana Girju, Adriana Badulescu, and Dan Moldovan. 2003. Learning semantic constraints for the automatic discovery of part-whole relations. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 1–8. Marti Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539– 545. 1624 Boris Katz, Jimmy Lin, Daniel Loreto, Wesley Hilde- brandt, Matthew Bilotti, Sue Felshin, Aaron Fernandes, Gregory Marton, and Federico Mora. 2003. Integrating web-based and corpus-based techniques for question answering. In Proceedings of the twelfth text retrieval conference (TREC), pages 426–435. David Kempe, Jon Kleinberg, and E´va Tardos. 2003. Maximizing the spread of influence through a social network. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. Jon Kleinberg and Steve Lawrence. 2001. The structure of the web. Science, 29: 1849–1850. Zornitsa Kozareva and Eduard Hovy. 2010a. Learning arguments and supertypes of semantic relations using recursive patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pages 1482–1491, July. Zornitsa Kozareva and Eduard Hovy. 2010b. Not all seeds are equal: Measuring the quality of text mining seeds. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 618–626. Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008. Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics ACL-08: HLT, pages 1048–1056. Beth Levin and Harold Somers. 1993. English verb classes and alternations: A preliminary investigation. Lun Li, David Alderson, Reiko Tanaka, John C. Doyle, and Walter Willinger. 2005. Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications (Extended Version). Internet Mathematica, 2(4):431–523. Dekang Lin and Patrick Pantel. 2002. Concept discovery from text. In Proc. of the 19th international conference on Computational linguistics, pages 1–7. Mark E. Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical Review, 69(2). Mark Newman. 2003. Mixing patterns in networks. Physical Review E, 67. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web. Patrick Pantel and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. pages 113–120. Marius Pasca. 2004. Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 137–145. Marius Pasca. 2007. Weakly-supervised discovery of named entities using web search queries. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM2007, pages 683– 690. Bruno R. Preiss. 1999. Data structures and algorithms with object-oriented design patterns in C++. Filippo Radicchi, Santo Fortunato, Benjamin Markines, and Alessandro Vespignani. 2009. Diffusion of scientific credits and the ranking of scientists. In Phys. Rev. E 80, 056103. Deepack Ravichandran and Eduard H. Hovy. 2002. Learning surface text patterns for a question answering system. pages 41–47. Ellen Riloff and Jessica Shepherd. 1997. A corpus-based approach for building semantic lexicons. In Proceedings of the Empirical Methods for Natural Language Processing, pages 117–124. Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. pages 811–816. Peter Mark Roget. 1911. Roget’s thesaurus of English Words and Phrases. New York Thomas Y. Crowell company. Gert Sabidussi. 1966. The centrality index of a graph. Psychometrika, 3 1(4):581–603. Hassan Sayyadi and Lise Getoor. 2009. Future rank: Ranking scientific articles by predicting their future pagerank. In 2009 SIAM International Conference on Data Mining (SDM09). Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. pages 1297–1304. Stephen Soderland, Claire Cardie, and Raymond Mooney. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3), pages 233–272. Mark Steyvers and Joshua B. Tenenbaum. 2004. The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29:41–78. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 697–706. Partha Pratim Talukdar and Fernando Pereira. 2010. Graph-based weakly-supervised methods for information extraction and integration. pages 1473–1481 . Vishnu Vyas, Patrick Pantel, and Eric Crestan. 2009. Helping editors choose better seed sets for entity set 1625 expansion. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, pages 225–234. Dylan Walker, Huafeng Xie, Koon-Kiu Yan, and Sergei Maslov. 2006. Ranking scientific publications using a simple model of network traffic. December. Duncan Watts and Steven Strogatz. 1998. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442. Fabio Massimo Zanzotto, Marco Pennacchiotti, and Maria Teresa Pazienza. 2006. Discovering asymmetric entailment relations between verbs using selectional preferences. InACL-44: Proceedings ofthe 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 849–856. Dmitry Zelenko, Chinatsu Aone, Anthony Richardella, Jaz K, Thomas Hofmann, Tomaso Poggio, and John Shawe-taylor. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research 3.