Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora

Tanja Gaustad; Cindy McKellar; Martin J. Puttkammer

doi:10.55492/v6i02.6739

Authors

Tanja Gaustad North-West University
Cindy McKellar North-West University
Martin J. Puttkammer North-West University

DOI:

https://doi.org/10.55492/v6i02.6739

Keywords:

corpus development, under-resourced languages, domain-specific data, agriculture

Abstract

This paper presents new multilingual corpora from the agricultural domain for seven South African Languages, namely Afrikaans, English, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, and Setswana, based on the Pula/Imvula magazine. After pre-processing, the data has been automatically sentencized, tokenized, lemmatized and annotated with part-of-speech information using the services available at https://v-ctx-lnx7.nwu.ac.za/. The final resources comprising between 774k and 1,38M tokens per language are included on the Corpus Cooperative at North-West University (COCO@NWU) corpus platform at https://coco.nwu.ac.za/ as searchable corpora. In addition, the data can be made avail- able as text files for research purposes upon request. To highlight the value of this agricultural domain-specific data collection in relation to more general data, we also include some corpus-based statistics and comparisons with previous research.

Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information