New uses for old books:  Description of digitised corpora-based on the Setswana language collection in the WITS Cullen Africana Collection

Malebogo Thabong; Nina Lewin; Taariq Surtee

doi:10.55492/dhasa.v3i03.3819

New uses for old books

Description of digitised corpora-based on the Setswana language collection in the WITS Cullen Africana Collection

Authors

Malebogo Thabong The University of the Witwatersrand, Johannesburg
Nina Lewin The University of the Witwatersrand, Johannesburg
Taariq Surtee The University of the Witwatersrand, Johannesburg

DOI:

https://doi.org/10.55492/dhasa.v3i03.3819

Keywords:

Setswana, Library, digitisation, genres, corpus

Abstract

This paper described a corpus of 104 books separated from a larger collection of African Langaguge books. The books were catalogued into a standard library and archival metadata. A subset was digitised and cleaned. The books were then divided into five subsets and compared against each other and the entire Corpus. We have also created tables of collocates, words frequencies. We also performed basic statistics on those words(see tables in the appendix). We speculated that the Corpus as a whole could be roughly used as a general language register. We also give some examples of the characteristics of the genre subsets. The paper aims to introduce the Corpus to NPL researchers and offer it for further research.

Downloads

Published

2022-02-24

Issue

Vol. 3 No. 03 (2021): Proceedings of the 2nd workshop on Resources for African Indigenous Language (RAIL) at DHASA 2021

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

How to Cite

New uses for old books: Description of digitised corpora-based on the Setswana language collection in the WITS Cullen Africana Collection. (2022). Journal of the Digital Humanities Association of Southern Africa (DHASA), 3(03). https://doi.org/10.55492/dhasa.v3i03.3819

Download Citation