Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora

Authors

DOI:

https://doi.org/10.55492/v6i02.6739

Keywords:

corpus development, under-resourced languages, domain-specific data, agriculture

Abstract

This paper presents new multilingual corpora from the agricultural domain for seven South African Languages, namely Afrikaans, English, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, and Setswana, based on the Pula/Imvula magazine. After pre-processing, the data has been automatically sentencized, tokenized, lemmatized and annotated with part-of-speech information using the services available at https://v-ctx-lnx7.nwu.ac.za/. The final resources comprising between 774k and 1,38M tokens per language are included on the Corpus Cooperative at North-West University (COCO@NWU) corpus platform at https://coco.nwu.ac.za/ as searchable corpora. In addition, the data can be made avail- able as text files for research purposes upon request. To highlight the value of this agricultural domain-specific data collection in relation to more general data, we also include some corpus-based statistics and comparisons with previous research.

Downloads

Published

2025-12-31

Issue

Section

Articles

How to Cite

Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora. (2025). Journal of the Digital Humanities Association of Southern Africa (DHASA), 6(2). https://doi.org/10.55492/v6i02.6739