acl acl2013 acl2013-78 acl2013-78-reference knowledge-graph by maker-knowledge-mining

78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

Source: pdf

Author: Burak Kerim AkkuÅ� ; Ruket Cakici

Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.

reference text

Charu C. Aggarwal and Philip S. Yu. 2000. Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec., 29(2):70–81. M. Fatih Amasyalı and Banu Diri. 2006. Automatic Turkish text categorization in terms of author, genre and gender. In Proceedings of the 11th international conference on Applications of Natural Language to Information Systems, NLDB’06, pages 221–226, Berlin, Heidelberg. Springer-Verlag. Florian Beil, Martin Ester, and Xiaowei Xu. 2002. Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 436–442, New York, NY, USA. ACM. Fazlı Can, Seyit Ko c¸berber, Erman Bal ¸cık, Cihan Kaynak, H. C ¸a g˘da ¸s O¨calan, and Onur M. Vursava ¸s. 2008. Information retrieval on turkish texts. SIST, 59(3):407–421. JA- Ruket C ¸akıcı and Jason Baldridge. 2006. Projective and non-projective Turkish parsing. In Proceedings of the 5th International Treebanks and Linguistic Theories Conference, pages 43–54. O¨zlem C ¸etino gˇlu and Kemal Oflazer. 2006. Morphology-syntax interface for Turkish LFG. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 153–160, Stroudsburg, PA, USA. Association for Computational Linguistics. Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10: 1895– 1923. G ¨ul s¸en Eryi g˘it. 2012. The impact of automatic morphological analysis & disambiguation on dependency parsing of turkish. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 23-25 May. G ¨ul s¸en Eryi g˘it, Joakim Nivre, and Kemal Oflazer. 2008. Dependency parsing of Turkish. Comput. Linguist., 34(3):357–389, September. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3: 1289–1305, March. Dilek Z. Hakkani-T u¨r, Kemal Oflazer, and G ¨okhan T u¨r. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of the 18th conference on Computational linguistics - Volume 1, COLING ’00, pages 285–291, Stroudsburg, PA, USA. Association for Computational Linguistics. 7 Zelig Harris. 1970. Distributional structure. In Papers in Structural and Transformational Linguistics, pages 775–794. D. Reidel Publishing Company, Dordrecht, Holland. M. Ikonomakis, S. Kotsiantis, and V. Tampakas. 2005. Text classification: a recent overview. In Proceedings of the 9th WSEAS International Conference on Computers, ICCOMP’05, pages 1–6, Stevens Point, Wisconsin, USA. World Scientific and Engineering Academy and Society (WSEAS). Thorsten Joachims. 1997. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 143–15 1, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. D. E. Johnson, F. J. Oles, T. Zhang, and T. Goetz. 2002. A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J., 41(3):428– 437, July. Heui-Seok Lim. 2004. Improving kNN based text classification with well estimated parameters. In Nikhil R. Pal, Nikola Kasabov, Rajani K. Mudi, Srimanta Pal, and Swapan K. Parui, editors, Neural Information Processing, 11th International Conference, ICONIP 2004, Calcutta, India, November 22-25, 2004, Proceedings, volume 33 16 of Lecture Notes in Computer Science, pages 5 16–523. Springer. Tao Liu, Shengping Liu, and Zheng Chen. 2003. An evaluation on feature selection for text clustering. In In ICML, pages 488–495. Edward Loper and Steven Bird. 2002. Nltk: the natural language toolkit. In Proceedings of the ACL02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1, ETMTNLP ’02, 63–70, Stroudsburg, PA, USA. Association for Computational Linguistics. pages Christopher D. Manning and Hinrich Sch u¨tze. 1999. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch u¨tze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceesings of the Workshop on learning for text categorization, AAAI’98, pages 41–48. Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2): 153–157. Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’97, pages 67–73, New York, NY, USA. ACM. Kemal Oflazer. 1993. Two-level description of Turkish morphology. In Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, EACL ’93, pages 472– 472, Stroudsburg, PA, USA. Association for Computational Linguistics. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learn- 412–420, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. ing in Python. Journal of Machine Learning Research, 12:2825–2830. Has ¸im Sak, Tunga G ¨ung¨ or, and Murat Sara ¸clar. 2007. Morphological disambiguation of Turkish text with perceptron algorithm. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing ’07, pages 107–1 18, Berlin, Heidelberg. SpringerVerlag. Karl-Michael Schneider. 2005. Techniques for improving the performance of naive bayes for text classification. In In Proceedings of CICLing 2005, pages 682–693. Sam Scott and Stan Matwin. 1998. Text classification using WordNet hypernyms. In Workshop: Usage of WordNet in Natural Language Processing Systems, ACL’98, pages 45–52. James G. Shanahan and Norbert Roma. 2003. Boosting support vector machines for text classification through parameter-free threshold relaxation. In Proceedings of the twelfth international conference on Information and knowledge management, CIKM ’03, pages 247–254, New York, NY, USA. ACM. Karen Sparck Jones. 1988. A statistical interpretation of term specificity and its application in retrieval. In Peter Willett, editor, Document retrieval systems, pages 132–142. Taylor Graham Publishing, London, UK, UK. Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99, pages 42–49, New York, NY, USA. ACM. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 8