Building Corpora for Low-Resource Kenyan Languages

Audrey Mbogho; Quin Awuor; Andrew Kipkebut; Lilian Wanzare; Vivian Oloo

doi:10.55492/v6i02.6747

Authors

Audrey Mbogho United States International University Africa
Quin Awuor United States International University Africa
Andrew Kipkebut Kabarak University
Lilian Wanzare Maseno University
Vivian Oloo Maseno University

DOI:

https://doi.org/10.55492/v6i02.6747

Keywords:

Natural language processing, Low-resource languages, African languages, Corpus building, Crowd sourcing language data

Abstract

Natural Language Processing is a crucial frontier in artificial intelligence, with broad application across public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This article presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year,
employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording and transcribing conver-
sations and translating the resulting text into Kiswahili, creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources
freely accessible on open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thereby facilitating ongoing contributions and
developer access to train models and develop Natural Language Processing applications.

Building Corpora for Low-Resource Kenyan Languages

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information