acl acl2010 acl2010-40 acl2010-40-reference knowledge-graph by maker-knowledge-mining

40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

Source: pdf

Author: Vipul Mittal

Abstract: In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph an- alyzer.

reference text

Akshar Bharati, Amba P. Kulkarni, and V Sheeba. 2006. Building a wide coverage Sanskrit morphological analyzer: A practical approach. The First National Symposium on Modelling and Shallow Parsing of Indian Languages, IIT-Bombay. Alan Prince and Paul Smolensky. 1993. Optimality Theory: Constraint Interaction in Generative Grammar. RuCCS Technical Report 2 at Center for Cognitive Science, Rutgers University, Piscataway. Amba Kulkarni and Devanand Shukla. 2009. Sanskrit Morphological analyzer: Some Issues. To appear in Bh.K Festschrift volume by LSI. Choochart Haruechaiyasak, Sarawoot Kongyoung, and Matthew N. Dailey. 2008. A Comparative Study on Thai Word Segmentation Approaches. ECTI-CON, Krabi. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A General and Efficient Weighted Finite-State Transducer Library. CIAA’07, Prague, Czech Republic. Deniz Yuret and Ergun Bic ¸ici. 2009. Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies. ACL-IJCNLP’09, Singapore. DINH Q. Thang, LE H. Phuong, NGUYEN T. M. Huyen, NGUYEN C. Tu, Mathias Rossignol, and VU X. Luong. 2008. Word Segmentation of Vietnamese Texts: a Comparison of Approaches. LREC’08, Marrakech, Morocco. G ´erard Huet. 2009. Formal structure of Sanskrit text: Requirements analysis for a mechanical Sanskrit processor. Sanskrit Computational Linguistics 1 & 2, pages 266-277, Springer-Verlag LNAI 5402. John C. J. Hoeks and Petra Hendriks. 2005. Optimality Theory and Human Sentence Processing: The Case of Coordination. Proceedings of the 27th Annual Meeting of the Cognitive Science Society, Erlbaum, Mahwah, NJ, pp. 959–964. Kenneth R. Beesley. 1998. Arabic morphology using only finite-state operations Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Montr ´eal, Qu´ ebec. Leonardo Badino. 2004. Chinese Text WordSegmentation Considering Semantic Links among Sentences. INTERSPEECH 2004 - ICSLP , Jeju, Korea. 90