Generation of segmented isiZulu text

Authors

  • Sthembiso Mkhwanazi
  • Laurette Marais

DOI:

https://doi.org/10.55492/dhasa.v5i1.5034

Keywords:

Nguni languages, agglutinative languages, morphological segmentation, language models

Abstract

The complex morphology, conjunctive orthography and widespread occurrence of morphophonological alternation in the Nguni languages have
given rise to several efforts towards morphological
segmentation of tokens of Nguni languages. For
supervised methods, annotated data is required,
which currently exists as canonically segmented
data in the NCHLT corpus and surface segmented
data in the Ukwabelana corpus. In this paper, we
present a method and segmentation strategy based
on a computational grammar for isiZulu. The
grammar, which itself has some limitations in processing speed and robustness to unexpected input,
is used to create a new set of segmentations for the
tokens of the Ukwabelana corpus.

By training various models with the same architecture but on different datasets, we first show that our
approach enables us to match the performance of a
model trained on pre-existing data. We also show
that our approach provides the flexibility to determine a suitable segmentation strategy and to generate data that reflects this strategy.

Downloads

Published

2024-02-19

How to Cite

Generation of segmented isiZulu text. (2024). Journal of the Digital Humanities Association of Southern Africa , 5(1). https://doi.org/10.55492/dhasa.v5i1.5034