Topic modelling to support English text selection for translation into South Africa's other official languages

Jocelyn Mazarura; Febe de Wet

doi:10.55492/dhasa.v4i01.4447

Topic modelling to support English text selection for translation into South Africa's other official languages

Authors

Jocelyn Mazarura Department of Statistics, University of Pretoria
Febe de Wet Department of Electrical and Electronic Engineer- ing, Stellenbosch University & School of Electrical, Electronic and Computer Engineering, North-West University

DOI:

https://doi.org/10.55492/dhasa.v4i01.4447

Keywords:

DHASA, topic modelling, text selection, translation, under-resourced languages

Abstract

Appropriate training data is a prerequisite for the development of natural language processing (NLP) techniques. Vast amounts of language data are typically required to develop NLP tools that perform at state-of-the-art level. Such abundant resources are currently only available in a few languages. The remaining languages have to find alternative ways to become ``NLP-enabled''. The aim of the study reported on here is to make more language data available to support NLP development in the official languages of South Africa. In this paper we present the idea of generating text data by means of translation. We also propose the use of topic modelling to identify text in a highly resourced source language that will yield meaningful translations in under-resourced target languages. More specifically, the paper describes how topic modelling was used to identify English Wikipedia articles that should be suitable for translation into South Africa's 10 other official languages.

Downloads

Published

2023-01-25

Issue

Vol. 4 No. 01 (2022): Proceedings of the 3rd workshop on Resources for African Indigenous Languages (RAIL)

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

How to Cite

Topic modelling to support English text selection for translation into South Africa’s other official languages. (2023). Journal of the Digital Humanities Association of Southern Africa (DHASA), 4(01). https://doi.org/10.55492/dhasa.v4i01.4447

Download Citation

Topic modelling to support English text selection for translation into South Africa's other official languages

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information