jmlr jmlr2013 jmlr2013-91 jmlr2013-91-reference knowledge-graph by maker-knowledge-mining

91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

Source: pdf

Author: Joachim Niehren, Jérôme Champavère, Aurélien Lemay, Rémi Gilleron

Abstract: Inference algorithms for tree automata that deﬁne node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justiﬁed by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction. Keywords: XML information extraction, XML schemas, interactive learning, tree automata, grammatical inference

reference text

Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75(2):87–106, 1987. Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In 28th International Conference on Very Large Data Bases (VLDB), pages 119–128, 2001. 961 N IEHREN , C HAMPAVÈRE , G ILLERON AND L EMAY Geert J. Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs from XML data. In 32nd International Conference on Very Large Data Bases (VLDB), pages 115–126, 2006. Julien Carme, Joachim Niehren, and Marc Tommasi. Querying unranked trees with stepwise tree automata. In 19th International Conference on Rewriting Techniques and Applications (RTA), pages 105–118, 2004. Julien Carme, Michal Ceresna, Oliver Frölich, Georg Gottlob, Tamir Hassan, Marcus Herzog, Wolfgang Holzinger, and Bernhard Krüpl. The Lixto project: exploring new frontiers of web data extraction. In 23rd International Information Systems Conference (BNCOD), pages 1–15, 2006. Julien Carme, Michal Ceresna, and Max Goebel. Query-Based learning of XPath expressions. In 8th International Colloquium on Grammatical Inference (ICGI), pages 342–343, 2006. Julien Carme, Rémi Gilleron, Aurélien Lemay, and Joachim Niehren. Interactive learning of node selecting tree transducers. Machine Learning, 66(1):33–67, 2007. Jérôme Champavère. Induction de requêtes guidée par schémas. PhD thesis, Université des Sciences et Technologies de Lille 1, 2010. Jérôme Champavère, Rémi Gilleron, Aurélien Lemay, and Joachim Niehren. Schema-Guided Induction of Monadic Queries. In 9th International Colloquium on Grammatical Inference (ICGI), pages 15–28, 2008. Jérôme Champavère, Rémi Gilleron, Aurélien Lemay, and Joachim Niehren. Efﬁcient inclusion checking for deterministic tree automata and XML schemas. Information and Computation, 207 (11):1181–1208, 2009. William W. Cohen, Matthew Hurst, and Lee S. Jensen. A ﬂexible learning system for wrapping tables and lists in HTML documents. In 11th International Conference on World Wide Web (WWW), pages 232–241, 2002. Hubert Comon, Max Dauchet, Remi Gilleron, Florent Jacquemard, Denis Lugiez, Christof Löding, Sophie Tison, and Marc Tommasi. Tree automata techniques and applications. Electronic book available at http://tata.gforge.inria.fr/. Revised 2007. François Coste, Daniel Fredouille, Christopher Kermorvant, and Colin de la Higuera. Introducing domain and typing bias in automata inference. In 7th International Colloquium on Grammatical Inference (ICGI), pages 115–126, 2004. Massimo Franceschet. XPathMark: An XPath benchmark for the XMark generated data. In 3rd International Conference on Database and XML (XSym), pages 129–143, 2005. Dayne Freitag and Andrew K. McCallum. Information extraction with HMMs and shrinkage. In AAAI Workshop on Machine Learning for Information Extraction, pages 31–36, 1999. Pedro García and Jose Oncina. Inference of recognizable tree sets. Technical Report DSIC-II/47/93, Universidad de Alicante, 1993. 962 Q UERY I NDUCTION WITH S CHEMA -G UIDED P RUNING S TRATEGIES Rémi Gilleron, Florent Jousse, Isabelle Tellier, and Marc Tommasi. XML document transformation with conditional random ﬁelds. In 5th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), pages 525–539, 2006. Rémi Gilleron, Patrick Marty, Marc Tommasi, and Fabien Torre. Interactive tuples extraction from semi-structured data. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 997–1004, 2006. Georg Gottlob and Christoph Koch. Monadic queries over tree-structured data. In 17th Annual IEEE Symposium on Logic in Computer Science (LICS), pages 189–202, 2002. Georg Gottlob, Erich Grädel, and Helmut Veith. Datalog LITE: a deductive query language with linear time model Checking. ACM Transactions on Computational Logics, 3(1):42–79, 2002. Nicholas Kushmerick. Wrapper induction: efﬁciency and expressiveness. Artiﬁcial Intelligence, 118(1-2):15–68, 2000. Aurélien Lemay, Joachim Niehren, and Rémi Gilleron. Learning n-ary node selecting tree transducers from completely annotated examples. In 8th International Colloquium on Grammatical Inference (ICGI), pages 253–267, 2006. Aurélien Lemay, Sebastian Maneth, and Joachim Niehren. A learning algorithm for top-down XML transformations. In 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 285–296, 2010. Wolfgang May. Information extraction and integration with F LORID: the Technical Report 131, Universität Freiburg, Institut für Informatik, 1999. MONDIAL case study. Michel Minoux. LTUR: a simpliﬁed linear-time unit resolution algorithm for Horn formulae and computer implementation. Information Processing Letters, 29(1):1–12, 1988. Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In 18th International Joint Conferences on Artiﬁcial Intelligence (IJCAI), pages 415–420, 2003. Jose Oncina and Miguel A. Varó. Using domain information during the learning of a subsequential transducer. In 3rd International Colloquium on Grammatical Inference (ICGI), pages 301–312, 1996. David Pinto, Andrew McCallum, Xing Lee, and W .Bruce Croft. Table extraction using conditional random ﬁelds. In 26th ACM SIGIR, pages 235–242, 2003. Stefan Raeymaekers. Information extraction from Web pages based on tree automata induction. PhD thesis, Katholieke Universiteit Leuven, 2008. Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction from web pages. Machine Learning, 71(2-3):155–183, 2008. 963 N IEHREN , C HAMPAVÈRE , G ILLERON AND L EMAY Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J. Carey, Ioana Manolescu, and Ralph Busse. XMark: A benchmark for XML data management. In 28th International Conference on Very Large Data Bases (VLDB), pages 974–985, 2002. Andrew J. Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. Taking the OXPath down the deep web. In 14th International Conference on Extending Database Technology (EDBT), pages 542–545, 2011. Andrew J. Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. OXPath: little language, little memory, great value. In WWW (Companion Volume), pages 261–264, 2011. Slawomir Staworko and Piotr Wieczorek. Learning twig and path queries. In 15th International Conference on Database Theory (ICDT), 2012. Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2D conditional random ﬁelds for web information extraction. In 22nd International Conference on Machine Learning (ICML), pages 1044–1051, 2005. 964