Canonical Segmentation and Syntactic Morpheme Tagging of Four Resource- scarce Nguni Languages
DOI:
https://doi.org/10.55492/dhasa.v3i03.3818Keywords:
morphological analysis, canonical segmentation, syntactic morpheme taggingAbstract
Morphological analysis involves investigating the syntactic class of a word but can also extend to the decomposition and syntactic analysis of its underlying morpheme composition. This is especially relevant to languages with an agglutinative writing system where multiple linguistic words are expressed as a single orthographic word. In this paper, we propose a memory-based approach to canonical segmentation using a windowing approach to recover the uncondensed morphemes that differ from the surface form of a word. Additionally, we propose treating the syntactic labelling of morphemes as a sequence labelling task, similar to part of speech tagging. This approach leverages the internal morpheme composition of a word as local context in much the same way that the surrounding sentence of word serves in the disambiguation of its part-of-speech. Both tasks are modelled separately but performed sequentially by cascading the decomposed morphemes of a word into the task of syntactic labelling. When evaluated on four resource-scarce, conjunctively written Nguni languages, the proposed approach achieves an overall accuracy ranging between 82% and 92% which outperforms previously developed rule-based analysers for the same languages.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Journal of the Digital Humanities Association of Southern Africa
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.