Canonical Segmentation and Syntactic Morpheme Tagging of Four Resource- scarce Nguni Languages

Authors

  • Martin Puttkammer Centre for Text Technology (CTexT®) North-West University, South Africa
  • Jakobus S. Du Toit Centre for Text Technology (CTexT®) North-West University, South Africa

DOI:

https://doi.org/10.55492/dhasa.v3i03.3818

Keywords:

morphological analysis, canonical segmentation, syntactic morpheme tagging

Abstract

Morphological analysis involves investigating the syntactic class of a word but can also extend to the decomposition and syntactic analysis of its underlying morpheme composition. This is especially relevant to languages with an agglutinative writing system where multiple linguistic words are expressed as a single orthographic word. In this paper, we propose a memory-based approach to canonical segmentation using a windowing approach to recover the uncondensed morphemes that differ from the surface form of a word. Additionally, we propose treating the syntactic labelling of morphemes as a sequence labelling task, similar to part of speech tagging. This approach leverages the internal morpheme composition of a word as local context in much the same way that the surrounding sentence of word serves in the disambiguation of its part-of-speech. Both tasks are modelled separately but performed sequentially by cascading the decomposed morphemes of a word into the task of syntactic labelling. When evaluated on four resource-scarce, conjunctively written Nguni languages, the proposed approach achieves an overall accuracy ranging between 82% and 92% which outperforms previously developed rule-based analysers for the same languages.

Downloads

Published

2022-02-24

How to Cite

Canonical Segmentation and Syntactic Morpheme Tagging of Four Resource- scarce Nguni Languages. (2022). Journal of the Digital Humanities Association of Southern Africa , 3(03). https://doi.org/10.55492/dhasa.v3i03.3818