Towards Including South African Hansard Papers in the ParlaMint schema

The ParlaMint project, a CLARIN flagship initiative, seeks to standardize the representation of parliamentary data across diverse languages and regions. Version 3.0 of ParlaMint encompasses corpora from 26 European countries and autonomous regions, available for download and search under the CC-BY license. These corpora adhere to a common XML encoding schema, ensuring inter-operability. This study evaluates the feasibility of applying the ParlaMint schema to the proceedings of the Parliament of the Republic of South Africa. Through the conversion of a randomly selected parliamentary session, we scrutinize how various elements are modelled, delineating the steps required to initiate a comprehensive encoding en-deavour. The experiment starts with data retrieval by down-loading Hansard records from the South African Parliament website. An English session was selected to streamline processing for non-South African researchers. The original format consisted of session headers, metadata, introductory messages, and debate records. Speeches were identified by uppercase headers and segmented into paragraphs. Transcript conversion entailed extracting data from the PDF, eliminating technical elements, and ensuring continuity of utterances. Speaker names and functions were identified, and the session was transformed into ParlaMint-compliant TEI XML format. Meta-comments, including applause, laugh-ter, and interruptions, were categorized based on typical phrases.


Introduction
Parliamentary debates possess unique content, structure, and language, making them significant subjects of study in fields like political science, sociology, history, discourse analysis, sociolinguistics, and information technology.The CLARIN research infrastructure has played a longstanding role in organizing work around parliamentary data.The development of previous recommendations informed the development of ParlaMint (Erjavec et al. 2023) -a CLARIN flagship project aimed at harmonizing the representation of parliamentary data across languages.ParlaMint in its current version (Erjavec et al. 2023a,b, Kuzman et al. 2023) contains corpora for 26 European countries and autonomous regions, openly available under the CC-BY license for download and search.
The corpora are intended to be interoperable by encoding them according to a common ParlaMint schema.The following paper intends to test the applicability of the ParlaMint schema to the proceedings of the Parliament of the Republic of South Africa.During the conversion of one randomly selected parliamentary session we analyse how various phenomena are modelled and propose the steps which need to be taken to start a full-scale encoding project.

Methodology
Based on previous work in ParlaMint, the detailed methodology of adding new corpora to the dataset was compiled.Most parts of the process are performed locally, i.e. by a project partner interested in adding their data to the ParlaMint infrastructure.The integration steps are carried out by the Par-laMint technical managers.
1. First, the parliamentary data and metadata are acquired by the interested party (most likely, a research organisation interested in creating the corpus).It may involve various methods, depending on the availability and format of the original data, e.g.scraping it from the parliamentary websites, obtaining it via parliamentary or third-party API or even retrieving it from an already maintained parliamentary corpus.
2. Then, the data is converted into the ParlaMint schema.It can be performed in many ways, also depending on the format, e.g. from HTML to basic TEI XML (TEI Consortium 2017) and then to the ParlaMint format, through XSLT stylesheets or by writing own scripts with heuristics for difficult parts.7. The documentation of the corpus is created using the template common for all ParlaMint corpora, with sections describing general corpus information, information on data sources, the process of data acquisition and the method used for linguistic annotation.

Data retrieval
The Parliament of South Africa is bicameral and comprises a National Assembly (400 seats) and a National Council of Provinces (90 seats).The Hansard of the sessions of both houses ("a substantially verbatim report -with repetitions and redundancies omitted and obvious mistakes corrected -of parliamentary proceedings" (Parliament of The Republic of South Africa 2023e) are available online and can be used to "view, copy, download to a local drive, print and distribute the content of this Even though the proceedings are published in various official languages of South Africa: Ndebele, Pedi, Sotho, Swati, Tsonga, Tswana, Venda, Xhosa, Zulu, and English (most of the records and all after 2015), the selected session data was in English which facilitated further processing by a non-South African researcher.At the same time, a quick look at other language Hansard reports showed the same structure as English which makes the current experiment applicable to other session data.
The selected file (HAN-MPS-BV14-2019-07-16.pdf) represented a session from July 16, 2019, and was available in the PDF format of 95 pages.
Other sources of the same session were also consulted: one by the Parliamentary Monitoring Group (Parliamentary Monitoring Group 2023) and another one by the People's Assembly website (Parliament of The Republic of South Africa 2023a) but they occurred derivative.

The original format
Each page of the file contained a header with the session identifier, date, page number and the total number of pages in the file: MPS-BV14 16 JULY 2019 Page: 1 of 95.
The file contained some metadata and introductory messages (see Fig. 1) which were followed by the record of the debate as a series of speeches.After the closing remarks about adjourning the session a link to "Announcements, Tablings and Committee Reports" for the day was included.
Each speech has the form of a header representing the speaker (name, function or both), output in up-percase, e.g.The HOUSE CHAIRPERSON (Mr M L D Ntombela) which is separated from the content with a colon.The speech text is split into paragraphs.
Figure 1: The first page of the retrieved transcript.
Meta-comments are given in square brackets.They can correspond to kinesic events such as [Applause.]but also regular comments such as The text can also contain foreign-language fragments, uttered without notification, as part of the speech.When a longer fragment is spoken in a language different from the main language of the document, it is marked in parentheses, e.g."Translation of isiZulu paragraph follows.)".The translation is given in square brackets.
Shorter fragments may not be marked at all, as in Mamelani.[Listen.]

Transcript conversion
The data were first extracted from the PDF file and saved in plain text format.Technical introductory notes and page headers were removed and since they could appear in the middle of an utterance, a simple heuristic was used: when the first utterance on a page was starting in lowercase, it was glued to the preceding utterance from the previous page, forming one continuous paragraph.
Since speaker names or functions are output in uppercase, the session was split into turns by looking at at least two uppercase fragments followed by a colon.Individual paragraphs forming one turn were extracted and output into target XML in the ParlaMint format -utterances modelled as <u> elements containing separate <seg>-ments with speakers identified in who attribute: <u who="#The_MINISTER_OF_BASIC_ EDUCATION" xml:id="ParlaMint-ZA_HAN-MPS-BV14-2019-07-16.u0" ana="#guest"> <seg xml:id="seg1">Hon Chairperson, let me acknowledge my Cabinet colleagues that are here, ... </seg> </u> Meta-comments (together with some distinctive pieces of text, not marked separately as metacomments) were converted into Parla-CLARIN elements according to their function, using simple heuristics, mostly based on typical phrases used in the text (see Table 2).
One issue to point out here is the distinction between notes about time when the meeting started and adjourned and the information about time expiration.Initially the latter had been modelled as time notes but eventually the decision was made to model them as vocal interruptions.It was motivated by the presence of other-than-English equivalents of the phrase "Time expired" which may suggest that it was uttered by the chair rather than added as a comment by the stenographer.
Apart from these removals, the text was not modified in any other way (including punctuation) which may help reproduce the original minutes from the ParlaMint-compliant XML.

Quotations
Quotations were marked in the original transcript with an indentation which was difficult to detect automatically.Since the ParlaMint schema does not allow the TEI <quote> element to appear in <seg>ments, the two quotations present in the session transcript were encoded manually as <note>s with a non-standard type quote: <seg xml:id="seg237"> I quote Phuti Seloba: <note type="quote">I can put my head on the block and say that all the outstanding toilets will be delivered before the end of 2014.
We just need our people to be patient with us!</note>

Other language fragments
Fragments in other languages are marked as gaps, with the English translation available as the main content.This is in line with ParlaMint guidelines (Erjavec et al. 2023) which motivate this decision in the following way: Sometimes a passage of the transcription is in a foreign language, and, especially as the corpus is to be linguistically annotated, the passage is best left out of the transcription proper.This can be achieved by encoding it as a gap in the transcription with the reason foreign, while the <desc> should contain the omitted text.
This way the content: Sifisa ukuncoma nokuwubonga uMnyango ngalolu hlelo oluhle abalwenzela izwe lakithi eNingizimu Afrika.[We wish to applaud and thank the department for this programme that they have put in place for our country, South Africa.]becomes represented as: <gap reason="foreign"> <desc xml:lang="zu">Sifisa ukuncoma nokuwubonga uMnyango ngalolu hlelo oluhle abalwenzela izwe lakithi eNingizimu Afrika.</desc></gap> <seg xml:id="seg224">We wish to applaud and thank the department for this programme that they have put in place for our country, South Africa.</seg>When multi-paragraph foreign utterances are encountered, they are modelled one paragraph by one.
What is unusual is that meta-comments related to time expiration, immediately following the longer foreign language fragment, can be also rendered in a language other than English, e.g."Kwaphela isikhathi."("Time expired" in Zulu).Such cases were modelled in a standard way, with original Zulu text kept for reference: <vocal type="interruption"> <desc xml:lang="zu">Kwaphela isikhathi.</desc><desc xml:lang="en">Time expired.</desc></vocal> The time expiration note is put after both fragments, in the original language and English.

Speaker metadata
Following the ParlaMint model, the information of the speakers (MPs and guests) was stored in a separate file (ParlaMint-ZA-listOrg.xml),listing organisations involved in the process.In our experiments only two organisations were involved: the South African Government (represented by the Minister of Basic Education and her deputy) and the National Assembly Chamber of the 27th South African Parliament: <org xml:id="government.ZA" role="government"> <orgName xml:lang="en" full="yes">South African Government</orgName> </org> <org xml:id="parliament" role="parliament" ana="#epc"> <orgName full="yes" xml:lang="en"> National Assembly Chamber </orgName> <listEvent> <event xml:id="epc.27"from="2019-05-22"> <label xml:lang="en">27th South African Parliament </label> </event> </listEvent> </org> The information about individual speakers were stored in ParlaMint-ZA-listPerson.xml file.There were 24 distinctive strings identifying speakers in the sample session.One of them was described as "An HON MEMBER" (original spelling) which corresponded to an unidentified speaker and his speech was encoded as a vocal message of some regular MP: <vocal type="clarification" ana="#regular"> <desc>It is a point of debate.</desc></vocal> One name (Patamedi Ronald Moroatshehla) was twice misspelled as Moroatsehla so eventually, 22 speakers were encoded in the person file.
The names and roles of the speakers used in the transcript were converted into XML IDs by replacing spaces with underscores, deleting courtesy titles (Mr, Ms, Mrs, Dr etc.) and removing brackets to maintain compliance with xml:id type model, e.g.
The HOUSE CHAIRPERSON (Mr M L D Ntombela) becomes The HOUSE CHAIRPERSON M L D Ntombela.The surnames and initials of the first names of the MPs were then used to match against relevant web pages on the parliament website (Parliament of The Republic of South Africa 2023b).Even though various type of information was available on the MPs' websites such as their contact and social media de-tails, political party and committee membership, parliament membership and committee membership history, political leadership background, education, interests and ambitions.Marital status could also be retrieved; while only "Ms" was used in the transcript, the website contains "Ms" and "Mrs" designations.They are not consistent for all MPscf.e.g. the information available for the chairperson Madala Louis David Ntombela (just basic data) and Nomsa Innocencia Tarabella Marchesi (very detailed description) so they were not included in the current experiment.
When only speaker roles were given in the text instead of person names (the Minister of Basic Education and the Deputy Minister of Basic Education), actual names had to be manually retrieved from another source (Wikipedia).Since they also had their records present at the parliament website, links to their pages were also included.When both the role and actual name were included, e.g.The HOUSE CHAIRPERSON (Mr M L D Ntombela), both expressions were used.
The utterance information was additionally modified based on speaker metadata by setting the speaker type to: • chair for all utterances marked as #The HOUSE CHAIRPERSON Mr M L D Ntombela • guest for all utterances of the Minister of Basic Education and her deputy • regular was kept for all other utterances.

Summary and conclusions
The encoded session comprised of 95 utterances.36 of them were by the chairperson, 16 by one of the MPs (Thlologelo Malatji), then we had two speakers with 7, one with 5 utterances, two with 3 and, three with 2 and twelve with only one utterance.
Table 2 shows the distribution of individual constructs.
The proposed conversion process was intended only to illustrate how various decisions could be taken in the full-scale project of encoding South African parliamentary data in the ParlaMint schema.It by no means offers a final and definitive decision-making model and will have to be adjusted in the future.We are aware that it might not cover all phenomena encountered in real-world data since the encoding experiment covered only one session, one house and one main language.Still, we believe it can pave the way towards widening the European ParlaMint data with a new continent.
website, or any part thereof for informational or reference purposes only and for non-commercial purposes" (Parliament of The Republic of South Africa 2023c).South African Hansard records can be downloaded from the Hansard papers section (Parliament of The Republic of South Africa 2023d) of the South African Parliament website.The sample session was selected by looking up the most recently produced one at the time of the experiment (August 8, 2023).

Table 1 :
Heuristics used to convert meta-comments into Parla-CLARIN element types

Table 2 :
Statistics of ParlaMint elements in the encoded session