Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora
DOI:
https://doi.org/10.55492/v6i02.6739Keywords:
corpus development, under-resourced languages, domain-specific data, agricultureAbstract
This paper presents new multilingual corpora from the agricultural domain for seven South African Languages, namely Afrikaans, English, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, and Setswana, based on the Pula/Imvula magazine. After pre-processing, the data has been automatically sentencized, tokenized, lemmatized and annotated with part-of-speech information using the services available at https://v-ctx-lnx7.nwu.ac.za/. The final resources comprising between 774k and 1,38M tokens per language are included on the Corpus Cooperative at North-West University (COCO@NWU) corpus platform at https://coco.nwu.ac.za/ as searchable corpora. In addition, the data can be made avail- able as text files for research purposes upon request. To highlight the value of this agricultural domain-specific data collection in relation to more general data, we also include some corpus-based statistics and comparisons with previous research.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Tanja Gaustad, Cindy McKellar, Martin J. Puttkammer

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.