emnlp emnlp2012 emnlp2012-125 emnlp2012-125-reference knowledge-graph by maker-knowledge-mining

125 emnlp-2012-Towards Efficient Named-Entity Rule Induction for Customizability

Source: pdf

Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya

Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.

reference text

S. Abiteboul, R. Hull, and V. Vianu. 1995. Foundations of Databases. Addison Wesley Publishing Co. Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In TIPSTER workshop. Mary Elaine Califf and Raymond J. Mooney. 1997. Applying ilp-based techniques to natural language information extraction: An experiment in relational learning. In IJCAI Workshop on Frontiers of Inductive Logic Programming. Mary Elaine Califf and Raymond J. Mooney. 1999. Relational learning ofpattern-match rules for information extraction. In AAAI. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick R. Reiss, and Shivakumar Vaithyanathan. 2010a. Systemt: an algebraic approach to declarative information extraction. In ACL. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan. 2010b. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP. Fabio Ciravegna. 2001. (lp)2, an adaptive algorithm for information extraction from web-related texts. In In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, and Wim Peters. 2011. Text Processing with GATE (Version 6). J. F ¨urnkranz and G. Widmer. 1994. Incremental reduced error pruning. pages 70–77. Johannes F ¨urnkranz. 1999. Separate-and-conquer rule learning. Artif. Intell. Rev., 13(1):3–54, February. B. R. Gaines and P. Compton. 1995. Induction of rippledown rules applied to modeling large databases. J. Intell. Inf. Syst., 5:21 1–228, November. IBM, 2012. IBM InfoSphere BigInsights - Annotation Query Language (AQL) reference. http : / /publ ib .boulde r . ibm . com/ infocent er /bigin s /v1r3 /t opi c / com . ibm . swg . im . info sphere .bigins ight s . doc / doc /bigins ight s_aql re f_con_ aql -overview .html . Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular expression learning for information extraction. In EMNLP. 138 Bin Liu, Laura Chiticariu, Vivian Chu, H. V. Jagadish, and Frederick R. Reiss. 2010. Automatic rule refinement for information extraction. Proc. VLDB Endow., 3:588–597. Diana Maynard, Kalina Bontcheva, and Hamish Cunningham. 2003. Towards a semantic extraction of named entities. In In Recent Advances in Natural Language Processing. Stephen Muggleton and C. Feng. 1992. Efficient induction in logic programs. In ILP. D. Nadeau and S. Sekine. 2007. A survey of named entity recognition and classification. Linguisticae Investigationes, 30:3–26. Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. 1997. Foundations of Inductive Logic Programming. Anup Patel, Ganesh Ramakrishnan, and Pushpak Bhattacharyya. 2009. Incorporating linguistic expertise using ilp for named entity recognition in data hungry indian languages. In ILP. Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An algebraic approach to rule-based information extraction. In ICDE. Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In AAAI. Stephen Soderland. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34:233–272. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: languageindependent named entity recognition. In HLTNAACL. Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3rd edition.