emnlp emnlp2011 emnlp2011-12 emnlp2011-12-reference knowledge-graph by maker-knowledge-mining

12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents

Source: pdf

Author: Yufan Guo ; Anna Korhonen ; Thierry Poibeau

Abstract: Documents Anna Korhonen Thierry Poibeau Computer Laboratory LaTTiCe, UMR8094 University of Cambridge, UK CNRS & ENS, France alk2 3 @ cam . ac .uk thierry .po ibeau @ ens . fr tific literature according to categories of information structure (or discourse, rhetorical, argumentative or Argumentative Zoning (AZ) analysis of the argumentative structure of a scientific paper has proved useful for a number of information access tasks. Current approaches to AZ rely on supervised machine learning (ML). – – Requiring large amounts of annotated data, these approaches are expensive to develop and port to different domains and tasks. A potential solution to this problem is to use weaklysupervised ML instead. We investigate the performance of four weakly-supervised classifiers on scientific abstract data annotated for multiple AZ classes. Our best classifier based on the combination of active learning and selftraining outperforms our best supervised classifier, yielding a high accuracy of 81% when using just 10% of the labeled data. This result suggests that weakly-supervised learning could be employed to improve the practical applicability and portability of AZ across different information access tasks.

reference text

Steven Abney. 2008. Semi-supervised learning for computational linguistics. Chapman & Hall / CRC. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6: 18 17– 1853. Klaus Brinker. 2006. On active learning in multi-label classification. In From Data and Information Analysis to Knowledge Engineering, pages 206–213. 281 Olivier Chapelle and Alexander Zien. 2005. Semisupervised classification by low density separation. J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46. Ronan Collobert, Fabian Sinz, Jason Weston, and L e´on Bottou. 2006. Trading convexity for scalability. In Proceedings of the 23rd international conference on Machine learning. J. R. Curran, S. Clark, and J. Bos. 2007. Linguistically motivated large-scale nlp with c&c; and boxer. In Proceedings of the ACL 2007 Demonstrations Session. Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput., 10: 1895–1923. Andrea Esuli and Fabrizio Sebastiani. 2009. Active learning strategies for multi-label text classification. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval. Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins Karolinska, Lin Sun, and Ulla Stenius. 2010. Identifying the information structure of scientific abstracts: an investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Ben Hachey and Claire Grover. 2006. Extractive summarisation of legal texts. Artif. Intell. Law, 14:305– 345. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The weka data mining software: an update. SIGKDD Explor. Newsl., 11: 10–18. T. Hastie and R. Tibshirani. 1998. Classification by pairwise coupling. Advances in Neural Information Processing Systems, 10. Hugo Hernault, Danushka Bollegala, and Mitsuru Ishizuka. 2011. Semi-supervised discourse relation classification with structural learning. In CICLing (1). K. Hirohata, N. Okazaki, S. Ananiadou, and M. Ishizuka. 2008. Identifying sections in scientific abstracts using conditional random fields. In Proceedings of 3rd International Joint Conference on Natural Language Processing. Steven C. H. Hoi, Rong Jin, and Michael R. Lyu. 2006. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web. F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. 2006. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In COLING/ACL. Carsten Lanquillon. 2000. Learning from labeled and unlabeled documents: A comparative study on semisupervised text classification. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery. David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 1 annual international ACM SIGIR confer7th ence on Research and development in information retrieval. M. Liakata, S. Teufel, A. Siddharthan, and C. Batche- lor. 2010. Corpora for the conceptualisation and zoning of scientific papers. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). J. Lin, D. Karakos, D. Demner-Fushman, and S. Khudanpur. 2006. Generative content models for structural analysis of medical abstracts. In Proceedings of BioNLP-06. G. S. Mann and A. Mccallum. 2007. Efficient computation of entropy gradient for semi-supervised conditional random fields. In HLT-NAACL. Andrew McCallum and Kamal Nigam. 1998. Employing em and pool-based active learning for text classification. In Proceedings of the Fifteenth International Conference on Machine Learning. A. K. McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2): 153–157. S. Merity, T. Murphy, and J. R. Curran. 2009. Accurate argumentative zoning with maximum entropy models. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. Y. Mizuta, A. Korhonen, T. Mullen, and N. Collier. 2006. Zone analysis in biology articles as a basis for information extraction. International Journal of Medical Informatics on Natural Language Processing in Biomedicine and Its Applications, 75(6):468–487. T. Mullen, Y. Mizuta, and N. Collier. 2005. A baseline feature set for learning rhetorical zones using full articles in the biomedical domain. Natural language processing and text mining, 7(1):52–58. Ion Muslea, Steven Minton, and Craig A. Knoblock. 2002. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the Nineteenth International Conference on Machine Learning. Jorge Nocedal. 1980. Updating Quasi-Newton Matrices with Limited Storage. Mathematics of Computation, 35(151):773–782. Bla Novak, Dunja Mladeni, and Marko Grobelnik. 2006. Text classification with active learning. In From Data and Information Analysis to Knowledge Engineering, pages 398–405. 282 J. C. Platt. 1999a. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classiers, pages 61–74. John C. Platt. 1999b. Using analytic qp and sparseness to speed training of support vector machines. In Proceedings of the 1998 conference on Advances in neural information processing systems II. Piyush Rai, Avishek Saha, Hal Daum e´, III, and Suresh Venkatasubramanian. 2010. Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing. P. Ruch, C. Boyer, C. Chichester, I. Tbahriti, A. Geissbuhler, P. Fabry, J. Gobeill, V. Pillet, D. RebholzSchuhmann, C. Lovis, and A. L. Veuthey. 2007. Using argumentation to extract key sentences from biomedical abstracts. Int J Med Inform, 76(2-3): 195–200. Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden markov models for information extraction. In Proceedings ofthe 4th International Conference on Advances in Intelligent Data Analysis. H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory. H. Shatkay, F. Pan, A. Rzhetsky, and W. J. Wilbur. 2008. Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics, 24(18):2086– 2093. F. Sinz, 2011. UniverSVM Support Vector Machine with Large Scale CCCP Functionality. http://www.kyb.mpg.de/bs/people/fabee/universvm.html. L. Sun and A. Korhonen. 2009. Improving verb clustering with automatically acquired selectional preference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. I. Tbahriti, C. Chichester, Frederique Lisacek, and P. Ruch. 2006. Using argumentation to retrieve articles with similar citations. Int J Med Inform, 75(6):488–495. S. Teufel and M. Moens. 2002. Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28:409–445. S. Teufel, A. Siddharthan, and C. Batchelor. 2009. Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics. In Proceedings of EMNLP. S. Tong and D. Koller. 2001. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66. Simon Tong and Daphne Koller. 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66. Vapnik. 1998. Statistical learning theory. Wiley, New York. Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and Zheng Chen. 2009. Effective multi-label active learning for text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. V. N. 283