Harnessing Google Translations to Develop a Readability Corpus for Sesotho: An Exploratory Study

This article addresses the scarcity of gold-standard annotated corpora for readability assessment in Sesotho, a low-resource language. As a solution, we propose using translated texts to construct a readability-labelled corpus. Specifically, we investigate the feasibility of using Google Translate to translate texts from Sesotho to English and then manually post-editing the texts. We then evaluate the effectiveness of the Google translations by comparing them to the human-post-edited versions. We utilised the Ghent University readability demo to extract the readability levels of both the Google translations and the human-post-edited translations. The translations are then evaluated us-ing three evaluation metrics, namely, BLEU, NIST, and RIBES scores. The translation evaluation re-sults reveal substantial similarities between the machine translations and the corresponding human-post-edited texts. Moreover, the results of the read-ability assessment and the comparison of text properties demonstrate a high level of consistency be-tween machine translations and human-post-edited texts. These findings suggest that Google Transla-tions show promise in addressing challenges in developing readability-labelled parallel datasets in low-resource languages like Sesotho, highlighting the potential of leveraging machine translation techniques to develop translated corpora for such languages. The evaluation of Google Translations in the context of educational texts in Sesotho and the demonstration of the feasibility and potential of using machine translations for enhancing readability in Sesotho will aid in the quest for developing Sesotho text readability measures.


Introduction
The investigation into text readability levels has been a subject of scholarly examination spanning over a century (De Clercq & Hoste 2016, Zamanian & Heydari 2012).However, there remains a notable dearth of research pertaining to text readability in African indigenous languages (Sibeko & De Clercq 2023, Sibeko & Van Zaanen 2021).To put it simply, readability pertains to the ease with which a given text can be read (Dahl 2004).
Text readability measures serve as valuable tools for estimating the level of ease associated with reading particular texts.These estimations hold significant importance, particularly in language learning, wherein language learners rely on appropriately levelled texts to enhance their reading proficiency (Sibeko 2023).Thus, the exploration of readability measures assumes great significance (Collins-Thompson 2014), particularly within the context of African indigenous languages.This is especially relevant in social contexts where texts are scarce and necessitate reuse for varied purposes.
While the available research has primarily concentrated on higher-resourced languages like English, there have been explorations of Sesotho text readability (Krige & Reid 2017, Reid et al. 2019, Sibanda 2019, 2014).However, these studies relied on English text readability measures since there are no known measures for measuring text readability in Sesotho (Sibeko & Van Zaanen 2021).To address the lack of readability measures, one possible solution is to adapt existing traditional readability measures to indigenous South African languages such as Sesotho (Leopeng 2019).
A gold-standard corpus with clear levels of text difficulty is needed to develop an automated readability model (Van Oosten et al. 2010, Franc ¸ois & Fairon 2012).In fact, previous studies have emphasised the necessity of gold-standard corpora when adapting text readability measures from high-resource lan-guages such as English to low-resource languages such as Sesotho (Van Oosten & Hoste 2011, Franc ¸ois & Fairon 2012, Curto et al. 2014, 2015).This article is part of a larger research project focused on developing readability measures for Sesotho.To achieve this goal, we require a goldstandard English-Sesotho parallel corpus that is annotated with text readability levels that can be used to train machine learning models.
In our search for this article, we found a lack of English-Sesotho-aligned corpora labelled for text readability, thus hindering the creation of goldstandard corpora (Sibeko 2023, p. 120).Consequently, a crucial aspect of our larger research project involves the development of a parallel English-Sesotho text readability corpus.
Unfortunately, while human-translated and edited texts would be ideal for creating a parallel corpus, they can be costly (Serhani et al. 2011).Given this constraint, we propose an alternative approach: the utilisation of translated texts to gauge text readability in Sesotho.
Essentially, our suggestion is to leverage machinetranslated texts to establish a gold-standard corpus.We, however, acknowledge that the effectiveness of using machine-translated texts to train models for adapting English traditional readability measures to Sesotho remains uncertain.Desirably, one should correlate the English Reading scores to reading scores on known Sesotho documents, however, we do not have access to data of that kind.
Overall, this article explores the feasibility of using Google-translated Sesotho source texts and English target texts in training models for the development of traditional readability measures in Sesotho.To achieve this, the article will: • examine the similarities between unedited English texts, which are machine-translated from Sesotho, and their human post-edited English counterparts, and • investigate whether the selected machinetranslated texts demonstrate consistent levels of readability when compared to the human post-edited versions.

Machine Translation
Machine translation (MT) involves the automated conversion of texts from one language to another.MT systems rely on extensive corpora containing pairs of translations between the source and target languages (Tsai 2019).In this article, we have chosen Google Translate (see translate.google.com),one of the most widely used machine translation service providers (Latief et al. 2020).
Although Google Translate benefits from a vast corpus of texts, it is still prone to errors that are unlikely to be made by human translators (Tsai 2019).Consequently, human translations remain more accurate (Way & Hearne 2011).As such, while Google Translate continues to improve its grammatical accuracy, its translation accuracy, especially for African languages, remains a concern (Patil & Davies 2014, Tsai 2019).
Nevertheless, several studies have demonstrated the utility of Google Translate for translating texts in various research domains, such as health (Patil & Davies 2014), English Foreign Language learning (Tsai 2019), academic language improvement (Groves & Mundt 2015), and English teaching (Medvedev 2016).In this article, we employ Google Translate to translate reading comprehension and summary texts (see (Sibeko & Van Zaanen 2022)) that were extracted from previous examination papers see Sibeko & Van Zaanen (2023) for an overview of the texts.

Translation evaluation
This article pays attention to three algorithms for comparing the machine translations to the postedited versions, namely, the Bilingual Evaluation Understudy (BLEU) score (Palotti et al. 2016), the National Institute of Standards and Technology (NIST) score (Doddington 2002), and the Rankbased Intuitive Bilingual Evaluation Score (RIBES) Isozaki et al. (2010).
It is important to note here that these algorithms were not meant for comparing machine translations to human post-edited translations but rather to compare machine translations to human translations.In this way, they are best suited to comparatively evaluating the precision of different machine translation systems or comparing machine translations to translations produced by professional human translators.For instance, the underlying principle of the BLEU score is that machine translations are expected to contain many n-grams that are similar to the human translations (Lin & Och 2004, Song et al. 2013).Consequently, these similarities are more easily observed in longer texts, such as full documents (Specia et al. 2010).That is, sentence-level comparisons may yield misleading results.Given the focus of this article on documentlevel comparisons, the metrics are anticipated to produce optimal results.
According to Long et al. (2017) RIBES and BLEU are popular in machine translation evaluation.In fact, the BLEU score has become the standard metric for evaluation since its introduction in the year 2002 (Song et al. 2013, Specia et al. 2010), and continues to be used widely across various languages other than English (Chauhan et al. 2021).

Text readability
This article examines the application of seven traditional text readability measures: Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog index, Simple Measure of Gobbledygook, Automatic Readability index, Lasbarhetsindex, Coleman-Liau index, and Dale-Chall index (Barry & Stevenson 1975, Coleman & Liau 1975, Dale & Chall 1948, Flesch 1948, Gunning 1969, 2003, Kincaid et al. 1975, Mc Laughlin 1969, Smith & Senter 1967).The seven text readability measures and their respective equations are presented in Table 1.Note that polysyllabic words in the SMOG index and complex words in the Gunning Fog index refer to words with more than two syllables (Londoner 1967, Mc Laughlin 1969, Hedman 2008, Christanti et al. 2017).
In the Dale-Chall index, difficult words are defined as words that are not featured in the list of 3000 common words compiled for the measure (MacDiarmid et al. 1998, Nyman et al. 1961, Stocker 1971).
The texts used in this article are concise and therefore eliminated the need for sampling and enabled an analysis of complete texts in measures such as Coleman-Liau which requires sampling.

Theoretical Underpinning
This article is based on Skopos theory, which views translations as purposeful actions guided by specific intentions (Vermeer 1989, Vermeer & Chesterman 2021, Nord 2018, 2016).According to the theory, translations are considered acceptable only when they align with their intended purposes (Vermeer 1989, Reiss & Vermeer 2014, Tanrikulu 2017, Vermeer & Chesterman 2021).In this approach, translators are bound by the purpose of the translation, limiting their freedom to follow unnecessary impulses (Vermeer 1989, Du 2012, Vermeer & Chesterman 2021).
While Skopos theory may deviate from traditional rules of translation (Koller 1995), it allows for translations that fulfil their intended purposes as the purpose of translation is considered the most important criterion for translators (Reiss & Vermeer 2014), and the usability thereof.Resultantly, assessing the usefulness of translations based on their fulfilment of purpose allows for mechanical translations that closely adhere to the form and formal properties of the source language, sometimes at the expense of meaning (Lu & Fang 2012, Tanrikulu 2017, Suzani & Khoub 2019).Conversely, the meaning conveyed in the source text may be less important than its formal properties (Tanrikulu 2017, Odinye 2019).However, machine translations have been associated with negative perceptions, such as producing senseless texts (Läubli & Orrego-Carmona 2017, Latief et al. 2020).
In this article, machine translations are compared to post-edited versions to investigate whether machine translations can be harnessed to develop a readability corpus for Sesotho.As such, although machine translations have been associated with negative perceptions, such as producing senseless texts (Läubli & Orrego-Carmona 2017, Latief et al. 2020), the machine translations are considered appropriate if they produce the same levels of readability as translations that employ human intervention such as our post-edited dataset.
The rest of this article presents the method in Section 2, the findings in Section 3, and the discussion and conclusions in Section 4.

The texts
This article utilises the Sesotho reading comprehension and summary writing texts from the corpus of grade twelve examination texts (Sibeko & Van Zaanen 2023).These texts are extracted from the first examination paper which focuses on reading comprehension, visual literacy, and language conventions (Department of Basic Education 2011a,b, Van der Walt 2018).The texts have been previously used in studies on text readability (Sibeko & Van Zaanen 2021), as well as linguistics complexity (Sibeko 2021).

Data processing
The data collected were pre-processed through tokenisation and sentence segmentation using Ucto (version 0.14).An overview of the Sesotho texts used in this article is presented in Table 2.Note that the disparity in the word count among examination texts across different years is due to the availability of examination papers.That is, certain years provided multiple examination opportunities, making all of the texts accessible for analysis.However, in some years, only one exam paper could be located, resulting in a limited number of texts for this investigation.
In generating the machine translation data set, the sentence markers, denoted as <utt>, were removed from the texts.Subsequently, each file underwent translation into English using Google Translate 1 , which facilitates whole-file translations.A comprehensive summary of the resulting machine translations is provided in Table 3.These translations form the first data set under discussion in this article.
For the post-edited data set, the machine translations underwent a post-editing process to consider cultural nuances and enhance translation quality.
The translation brief explicitly guided the postediting, limiting it to cases where meaning was lost.Consequently, word choices and grammatical constructions that did not affect the meaning were deliberately retained, even if alternative options were available.
By adhering to the editing brief and avoiding unnecessary modifications, the accuracy of the translations remained intact as this approach prioritised readability enhancement over imposing the personal preferences of translators.The summary of the post-edited data set is presented in Table 3.These post-edited versions of the machine translations form the second data set under scrutiny in this article.
Note that the differences in token counts between the machine translations and the human postedited texts, as presented in Table 3, may be due to various factors, including translationese (Baroni & Bernardini 2006, Toral 2019), over-translation, and a lack of contextual comprehension in the Google translations of idiomatic expressions (Sibeko & Lemeko 2023).As can be observed from Table 3, addressing these translation issues directly influences the final type-token ratios and results in higher typetoken ratios for the post-edited texts.This is primarily due to the fact that post-editing introduced idiomatic language (Sibeko & Lemeko 2023) which is missing from the unedited texts.

Data analysis
Two main steps were followed in the data analysis.The evaluation scores for comparing the data sets were computed using Rstudio (version 2023.06.0+421).The BLEU, NIST, and RIBES scores are cross-compared to get a sense of the similarities between the data sets.
The text readability scores were computed using the web-based readability tool developed by the Language and Translation Technology Team at Ghent University 2 .The results obtained through the LT3 readability demo were then extracted and organised into a spreadsheet for further analysis using RStudio.

Findings
The findings of this investigation are presented in two steps.The evaluation of the translations is presented, followed by the results obtained from the readability assessment.

Translation evaluation
The comparison of the data sets a commendable level of similarity.More specifically, as illustrated in Figure 1, the BLEU scores were generally high, ranging from 0.32 to 0.98.It is important to note that the BLEU score measures similarity on a scale from zero to one, where scores closer to zero indicate minimal similarity, while scores closer to one indicate substantial resemblance between the data sets under comparison.
Similar to the BLEU score, the RIBES scores range from zero to one.As can be observed from Figure 1, the RIBES scores are predominantly clustered around zero and 0.3, with a maximum score of 0.94.The mean score of 0.23 indicates a generally low level of similarity between the machine translations and the post-edited versions.
On the other hand, the NIST scores, represent the performance of machine translation systems on a scale from zero to ten. Figure 1 visually demonstrates that the NIST scores predominantly cluster around six and seven, with a minimum of 4.45 and a Furthermore, the Pearson correlation among the three measures was examined, and the results are presented in Table 4.As can be observed in Table 4, the correlation coefficient between BLEU and RIBES is 0.69 (p < .0001),indicating a moderate positive correlation.This suggests that there is some degree of similarity between these two measures, although they are not perfectly aligned.Consequently, an increase in the BLEU score is likely to be accompanied by an increase in the RIBES score.Additionally, the correlation coefficient between BLEU and NIST is 0.88 (p < .0001),indicating a strong positive correlation.In other words, this finding implies a high degree of association between the scores of these two metrics.It is important to note that the correlation coefficient of 0.88 indicates a relatively stronger relationship compared to the correlation between BLEU and RIBES.Finally, the correlation coefficient between NIST and RIBES is 0.60 (p < .0001),indicating a moderate positive correlation.
Overall, the scores for BLEU (M = 0.78, SD = 0.16) and NIST (M = 6.3,SD = 0.72) demonstrate a high similarity index between the two data sets.Conversely, the scores for RIBES (M = 0.25, SD = 0.28) illustrate some level of similarity between the RIBES and the NIST scores.

Text readability analysis
To get a sense of the distribution of data for each data set, density plots were employed.Similarities are observed in the distribution of data as observed in Figure 2. Additionally, it is clear that the data have similar spreads and consistently peak around the same locations.Additionally, the DCI and ARI measures show comparable directional biases.
Paired t-tests were used to compare the readability values of the two data sets and gain deeper insights into their behaviour.The descriptive outcomes of the t-tests are presented in Table 5, including means and standard deviations in brackets.Additionally, the ASL, which measures the average number of words in each sentence and the AWL which measures the average number of letters in each sentence are included.The post-edited data sets exhibit slightly higher means which are associated with harder-to-read texts, although there is some degree of variability observed within both data sets.Nevertheless, the results indicate that none of the measures displays statistically significant distinctions between the two data sets (p > .05).In this way, the findings suggest that there are no real disparities in text readability between the two data sets.
Additionally, there is close proximity in the mean values of average sentence lengths (ASL) (M machine translated = 15.59,M post-edited = 15.76,p = .78)and average word lengths (AWL) (M machine translated = 4.39, M post-edited = 4.45, p = .32).In fact, the mean differences between the two data sets are insignificant.

Discussion and Conclusions
This article aimed to investigate the feasibility of using texts translated from Sesotho to English using the Google Translate platform in the creation of a text readability corpus for Sesotho that is annotated with levels of text readability.For this, two sets of data were compiled, namely, the machine translations data set and the human-post-edited version of the machine translations data set.
First, the data sets were evaluated for their similarity based on the assumption that the machinetranslated data set would only be usable if it demonstrated a certain level of accuracy when we used the post-edited data set as the target standard.For this purpose, we utilised three metrics, namely, the RIBES, BLEU, and NIST scores.The comparison of the three sets of translation evaluation results suggests that the Google Translate-generated translations of the examination texts can be utilised in developing a readability corpus for Sesotho.This finding is based on the results for BLEU and NIST scores which indicate a higher precision.Note that the close correlation between NIST and BLEU is not surprising given that the NIST was based on BLEU scores.
Furthermore, this article aimed to investigate whether unedited translations generated by Google Translate would exhibit similar characteristics to human-post-edited versions, thereby justifying the levels of readability that are attached to the texts in the data set proposed for use in developing a readability corpus for Sesotho.To achieve this, the readability of the texts was evaluated using seven traditional readability measures along with sentence and word lengths.The analysis revealed no significant differences in the means of the various measures and text properties.
Based on the findings of this article, that the data sets are highly similar and that they produce texts of similar readability levels, there is justification for utilising translations generated by Google Translate from Sesotho to English in the training of traditional readability models for Sesotho.This conclusion is supported by the Skopos theory, which emphasises that translations should effectively serve their intended purpose.The use of translations generated through Google Translate aligns with the principle of Skopos theory.The comparison conducted between machine-generated translations and human-post-edited translations aimed to assess the similarities between these two translation types.
The results suggest that the machine translations can be deemed valid for the intended purpose of developing a readability corpus.
However, it is essential to note that the discussion in this article is limited, as it only covers the comparison between the post-edited translated texts and the machine translations.Consequently, it does not account for or extensively discuss translation issues which would have provided valuable insights.Furthermore, the section did not investigate whether translations produced by First Additional Language (FAL) 3 texts and Home Language (HL) texts exhibited the same level of accuracy or whether the text types contained in the data sets affected the accuracy of the results, as the data sets were considered in their entirety.
Finally, the findings of this article highlight the need for future research.For instance, while this article has demonstrated the consistency of readability levels between machine translations and humanpost-edited translations, it has not addressed the development of an aligned English-Sesotho corpus annotated with readability levels.This represents an avenue for further investigation and exploration.

Notes
1 Google Translate is an online machine translation platform accessible at https://translate.google.com.
3 The First Additional Language subject is intended for learners with lower language proficiencies than those in the Home Language classes.Note that previous research such as Sibeko (2021) and Sibeko & Van Zaanen (2021) has clearly shown significant differences in linguistic complexity and text readability levels between these language subjects.

Figure 1 :
Figure 1: The distribution of machine translation evaluation scores.

Figure 2 :
Figure 2: The distribution of readability scores for the machine translated (mt) and post-edited data sets.

Table 2 :
An overview of Sesotho source texts

Table 3 :
An overview of translated texts

Table 4 :
Pearson correlations between the different scores for BLEU, RIBES, and NIST.

Table 5 :
Results of the t-tests for the machine translated and the post-edited texts indicating the means, standard deviations and p-values.