emnlp emnlp2013 emnlp2013-26 emnlp2013-26-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Olzhas Makhambetov ; Aibek Makazhanov ; Zhandos Yessenbayev ; Bakhyt Matkarimov ; Islam Sabyrgaliyev ; Anuar Sharafudinov
Abstract: This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.