Development of linguistically annotated parallel language resources for four South African languages

Authors

  • Tanja Gaustad Centre for Text Technology (CTexT®) North-West University, South Africa
  • Martin Puttkammer Centre for Text Technology (CTexT®) North-West University, South Africa

DOI:

https://doi.org/10.55492/dhasa.v3i03.3815

Keywords:

Resource development, Nguni languages, linguistically annotated corpora, parallel corpora, annotation procedure

Abstract

For this project, we collected and annotated data to develop language resources for the four official South African Nguni languages written with a conjunctive orthography. The data for these four languages is parallel to allow for comparative (computational) linguistic studies. The corpora have been annotated for three types of linguistic information (morphology, part-of-speech and lemma). The article focuses on the annotation procedure, design choices that were made along the way as well as the
quality control steps used. Hopefully this description will give some guidance for similar projects on under-resourced languages in the future.

Downloads

Published

2022-02-24

How to Cite

Development of linguistically annotated parallel language resources for four South African languages. (2022). Journal of the Digital Humanities Association of Southern Africa , 3(03). https://doi.org/10.55492/dhasa.v3i03.3815