acl acl2010 acl2010-117 acl2010-117-reference knowledge-graph by maker-knowledge-mining

117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms


Source: pdf

Author: Zhili Wu ; Katja Markert ; Serge Sharoff

Abstract: Prior use of machine learning in genre classification used a list of labels as classification categories. However, genre classes are often organised into hierarchies, e.g., covering the subgenres of fiction. In this paper we present a method of using the hierarchy of labels to improve the classification accuracy. As a testbed for this approach we use the Brown Corpus as well as a range of other corpora, including the BNC, HGC and Syracuse. The results are not encouraging: apart from the Brown corpus, the improvements of our structural classifier over the flat one are not statistically significant. We discuss the relation between structural learning performance and the visual and distributional balance of the label hierarchy, suggesting that only balanced hierarchies might profit from structural learning.


reference text

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, New York, NY, USA. ACM. Crammer, K. and Singer, Y. (2002). On the algorithmic implementation of multiclass kernelbased vector machines. J. Mach. Learn. Res., 2:265–292. Cristianini, N., Shawe-Taylor, J., and Kandola, J. (2002). On kernel target alignment. In Proceedings of the Neural Information Process- 757 ing Systems, NIPS’01, pages 367–373. MIT Press. Crowston, K., Kwasnik, B., and Rubleske, J. (2009). Problems in the use-centered development of a taxonomy of web genres. In Mehler, A., Sharoff, S., and Santini, M., editors, Genres on the Web: Computational Models and Empirical Studies. Springer, Berlin/New York. Dekel, O., Keshet, J., and Singer, Y. (2004). Large margin hierarchical classification. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 27, New York, NY, USA. ACM. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10: 1895–1923. Giesbrecht, E. and Evert, S. (2009). Part-of- Speech (POS) Tagging - a solved task? An evaluation of POS taggers for the Web as corpus. In Proceedings of the Fifth Web as Corpus Workshop (WAC5), pages 27–35, Donostia-San Sebastián. Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. CoRR, cmp-lg/9709008. Joachims, T. (1999). Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A., editors, Advances in Kernel Methods Support Vector Learning, pages 41–56. MIT Press. – Joachims, T., Finley, T., and Yu, C.-N. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1):27–59. Kanaris, I. and Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45:499–5 12. Karlgren, J. and Cutting, D. (1994). Recogniz- ing text genres with simple metrics using discriminant analysis. In Proc. of the 15th. International Conference on Computational Linguistics (COLING 94), pages 1071 1075, Kyoto, Japan. Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J., and Lin, C.-J. (2008). A sequential dual method for large scale multiclass linear svms. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 408–416, New York, NY, USA. ACM. – Kessler, B., Nunberg, G., and Schütze, H. (1997). Automatic detection of text genre. In Proceedings of the 35th ACL/8th EACL, pages 32–38. Kuˇ cera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press, Providence. Leacock, C. and Chodorow, M. (1998). Combin- ing local context and WordNet similarity for word sense identification, pages 305–332. In C. Fellbaum (Ed.), MIT Press. Lee, D. (2001). Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5(3):37–72. Lin, D. (1998). An information-theoretic definition of similarity. In ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, pages 296–304, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Meyer zu Eissen, S. and Stein, B. (2004). Genre classification of web pages. In Proceedings of the 27th German Conference on Artificial Intelligence, Ulm, Germany. Pedersen, T., Pakhomov, S. V. S., Patwardhan, S., and Chute, C. G. (2007). Measures of semantic similarity and relatedness in the biomed- ical domain. J. of Biomedical Informatics, 40(3):288–299. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In IJCAI’95: Proceedings of the 14th international joint conference on Artificial intelligence, pages 448–453, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. 758 Santini, M. (2007). Automatic Identification of Genre in Web Pages. PhD thesis, University of Brighton. Shao, K.-T. and Sokal, R. R. (1990). Tree balance. Systematic Zoology, 39(3):266–276. Sharoff, S., Wu, Z., and Markert, K. (2010). The Web library of Babel: evaluating genre collections. In Proc. of the Seventh Language Resources andEvaluation Conference, LREC 2010, Malta. Stubbe, A. and Ringlstetter, C. (2007). Recognizing genres. In Santini, M. and Sharoff, S., editors, Proc. Towards a Reference Corpus of Web Genres. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. , 6: 1453–1484. Vidulin, V., Luštrek, M., and Gams, M. (2007). Using genres to improve search engines. In Proc. Towards Genre-Enabled Search Engines: The Impact of NLP. RANLP-07. Webber, B. (2009). Genre distinctions for discourse in the Penn TreeBank. In Proc the 47th Annual Meeting of the ACL, pages 674– 682. Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138, Morristown, NJ, USA. Association for Computational Linguistics. 759