Building Corpora for Low-Resource Kenyan Languages

Authors

DOI:

https://doi.org/10.55492/v6i02.6747

Keywords:

Natural language processing, Low-resource languages, African languages, Corpus building, Crowd sourcing language data

Abstract

Natural Language Processing is a crucial frontier in artificial intelligence, with broad application across public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This article presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year,
employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording and transcribing conver-
sations and translating the resulting text into Kiswahili, creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources
freely accessible on open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thereby facilitating ongoing contributions and
developer access to train models and develop Natural Language Processing applications. 

Downloads

Published

2025-12-31

Issue

Section

Articles

How to Cite

Building Corpora for Low-Resource Kenyan Languages. (2025). Journal of the Digital Humanities Association of Southern Africa (DHASA), 6(2). https://doi.org/10.55492/v6i02.6747