emnlp emnlp2013 emnlp2013-26 emnlp2013-26-reference knowledge-graph by maker-knowledge-mining

26 emnlp-2013-Assembling the Kazakh Language Corpus


Source: pdf

Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov

Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.


reference text