Building Corpora for Low-Resource Kenyan Languages
DOI:
https://doi.org/10.55492/v6i02.6747Keywords:
Natural language processing, Low-resource languages, African languages, Corpus building, Crowd sourcing language dataAbstract
Natural Language Processing is a crucial frontier in artificial intelligence, with broad application across public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This article presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year,
employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording and transcribing conver-
sations and translating the resulting text into Kiswahili, creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources
freely accessible on open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thereby facilitating ongoing contributions and
developer access to train models and develop Natural Language Processing applications.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Audrey Mbogho, Quin Awuor, Andrew Kipkebut, Lilian Wanzare, Vivian Oloo

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.