The  Analysis of the Sepedi-English Code-switched Radio News Corpus

Simon Ramalepe; Thipe I. Modipa; Marelie H. Davel

doi:10.55492/dhasa.v4i01.4444

Authors

Simon Ramalepe Computer Science Department, University of Limpopo
Thipe I. Modipa Computer Science Department, University of Limpopo & Centre for Artificial Intelligence Research (CAIR)
Marelie H. Davel Faculty of Engineering, North-West University & Centre for Artificial Intelligence Research (CAIR)

DOI:

https://doi.org/10.55492/dhasa.v4i01.4444

Keywords:

Code-switching, text generation, radio news, transformers, Sepedi

Abstract

Code-switching is a phenomenon that occurs mostly in multilingual countries where multilingual speakers often switch between languages in their conversations. The unavailability of large-scale code-switched corpora hampers the development and training of language models for the generation of code-switched text. In this study, we explore the initial phase of collecting and creating Sepedi-English code-switched corpus for generating synthetic news. Radio news and the frequency of code-switching on read news were considered and analysed. We developed and trained a Transformer-based language model using the collected code-switched dataset. We observed that the frequency of code-switched data in the dataset was very low at 1.1 %. We complemented our dataset with the news headlines dataset to create a new dataset. Although the frequency was still low, the model obtained the optimal loss rate of 2,361 with an accuracy of 66 %.

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information