Towards Including South African Hansard Papers in the ParlaMint schema


  • Maciej Ogrodniczuk



parliamentary debates, parliamentary records, parliamentary corpora, ParlaMint


The ParlaMint project, a CLARIN flagship initiative, seeks to standardize the representation of parliamentary data across diverse languages and regions. Version 3.0 of ParlaMint encompasses corpora from 26 European countries and autonomous regions, available for download and search under the CC-BY license. These corpora adhere to a common XML encoding schema, ensuring interoperability. This study evaluates the feasibility of applying the ParlaMint schema to the proceedings of the Parliament of the Republic of South Africa. Through the conversion of a randomly selected parliamentary session, we scrutinize how various elements are modelled, delineating the steps required to initiate a comprehensive encoding endeavour.
The experiment starts with data retrieval by downloading Hansard records from the South African Parliament website. An English session was selected to streamline processing for non-South African researchers. The original format consisted of session headers, metadata, introductory messages, and debate records. Speeches were identified by uppercase headers and segmented into paragraphs.
Transcript conversion entailed extracting data from the PDF, eliminating technical elements, and ensuring continuity of utterances. Speaker names and functions were identified, and the session was transformed into ParlaMint-compliant TEI XML format. Meta-comments, including applause, laughter, and interruptions, were categorized based on typical phrases.
Quotations, marked with indentation in the original transcript, were manually encoded as TEI elements. Foreign-language fragments were treated as gaps, with English translations provided. Multi-paragraph foreign utterances were encoded paragraph by paragraph.
Speaker metadata was stored in separate XML files, listing organizations and individual speakers. Speaker names and roles were converted into XML IDs, and web pages were linked for additional information. Speaker type was designated based on metadata, distinguishing between chairs, guests, and regular speakers.
The encoded session comprised 95 utterances, with varying distributions among speakers. The proposed conversion process serves as a starting point for the larger endeavour of encoding South African parliamentary data in the ParlaMint schema. While not exhaustive, this study lays the groundwork for expanding the ParlaMint dataset to include African parliamentary records.




How to Cite

Towards Including South African Hansard Papers in the ParlaMint schema. (2024). Journal of the Digital Humanities Association of Southern Africa , 5(1).