Exploring ASR fine-tuning on limited domain-specific data for low-resource languages

The majority of South Africa’s eleven languages are low-resourced, posing a major challenge to Automatic Speech Recognition (ASR) development. Modern ASR systems require an extensive amount of data that is extremely difficult to find for low-resourced languages. In addition, available speech and text corpora for these languages predominantly revolve around government, political and biblical content. Consequently, this hinders the ability of ASR systems developed for these languages to perform well especially when evaluating data outside of these domains. To alleviate this problem, the Icefall Kaldi II toolkit introduced new trans-former model scripts, facilitating the adaptation of pre-trained models using limited adaptation data. In this paper, we explored the technique of using pre-trained ASR models in a domain where more data is available (government data) and adapted it to an entirely different domain with limited data (broadcast news data). The objective was to assess whether such techniques can surpass the accuracy of prior ASR models developed for these languages. Our results showed that the Conformer connectionist temporal classification (CTC) model obtained lower word error rates by a large margin in comparison to previous TDNN-F models evaluated on the same datasets. This research signifies a step forward in mitigating data scarcity challenges and enhancing ASR performance for low-resourced languages in South Africa.


Introduction
Automatic Speech Recognition, a model that converts spoken language into text, is becoming increasingly integrated into our daily lives.ASR nowadays plays a pivotal role in powering voice assistants, transcription and dictation services and most importantly has become a transformative tool that can enable communities to communicate with devices in several languages.However, the development of ASR has been largely skewed towards highresourced languages that often cater for Western cultures, creating an accessibility gap for communities communicating in low-resourced languages.In South Africa, this divide is pronounced where 10 of the 12 official languages of South Africa are low-resourced and face a critical shortage of labelled data for training effective ASR models Badenhorst & de Wet (2022).In addition, existing speech and text corpora for these languages are primarily centred around government, political and biblical content.This limited coverage in content leads to high error rates when attempting to transcribe content in domains outside of this scope.Our research aims to tackle this challenge by exploring methodologies where we can leverage existing data using adaptation and fine-tuning techniques to increase accuracy on out-of-domain content which will then contribute to creating new domain specific labelled datasets.For example, radio broadcast data is a largely untapped resource in South Africa which offers a dynamic and ever-growing pool of audio content in most of South Africa's languages.Leveraging this resource can pave the way for enhanced ASR models specifically catered to radio broadcast news.
In this work, we have focused on 3 South African languages namely: Afrikaans (Afr), isiZulu (Zul) and Sesotho (Sot).Each of these languages belong to different language families in South Africa.isiZulu is a Nguni language and Sesotho is one of the Sotho languages.Since there are distinct acoustic and linguistic differences between these languages, evaluating these languages will provide meaningful insights on whether the techniques explored in this paper would yield high quality transcriptions when evaluated on the remaining lowresourced languages in South Africa that generally fall under these two broad language families.

Previous Work
The feasibility of harvesting radio broadcast speech data for the development of ASR systems for South African languages was first evaluated in Badenhorst & de Wet (2021).In this work, a semiautomatic data harvesting procedure was proposed.Factorised time-delay neural network (TDNN-F) models were used to generate phone-level transcriptions of speech data harvested from different domains.The results showed that when evaluating on speech data from a new domain for Afrikaans, the phone error rates (PERs) of approximately 20% was measured.At these PERs, follow-up experiments confirmed a potential word transcription rate within the thirties for news data (Badenhorst & de Wet 2023).Transcription error rates in other languages however measured higher and so for the purpose of creating correctly-annotated speech data the first publication already concluded that better acoustic modelling techniques need to be sought.Badenhorst & de Wet (2023) further investigated the possibility of creating diverse speech resources from unannotated radio broadcast data using a refined version of the semiautomatic data harvesting procedure proposed in Badenhorst & de Wet (2021).The work focused on the adaptation of ASR models on two domains within broadcast data, namely news bulletins and radio dramas.Baseline models were trained using NCHLT (Barnard et al. 2014) data which mostly includes short read speech utterances.Adapting these baseline models to news and drama data, as expected, resulted in more transcription errors.The main reasons for this is that news data differs significantly from NCHLT speech data in that it contained much longer utterances.The radio drama data differed significantly to NCHLT data in speaking style as drama data corresponds more to conversational speech rather than read speech found in the NCHLT data.In addition, the drama data contained background noise.However, improvement in error rates were achieved by adapting acoustic models with less than 10 hours of manually annotated data from the same domain, for the speaking styles and acoustic conditions that are not represented in any of the existing speech corpora.

Training Data
The National Centre for Human Language Technology (NCHLT) speech corpus is the largest publicly available corpus available that includes text and speech data comprising of approximately 55 hours of short-form audio segments from approximately 200 speakers in all 11 official written languages of South Africa (Barnard et al. 2014).This dataset was used to train the baseline models in this work, referred to as pre-trained models.This dataset is comprised of mono-channel audio files with a sampling frequency of 16 kHz and 16-bit rate.This speech data contains clear, read speech prompts with minimal background noise.Table 1 summarizes the audio statistics of the NCHLT speech corpus.

Adaptation Data
The broadcast news dataset used as adaptation data for fine-tuning in this work was obtained as part of an on-going data harvesting project to record new audio data from South African radio stations.The corresponding transcriptions for this data were automatically transcribed using the data harvesting procedure proposed in Badenhorst & de Wet (2023), resulting in datasets comprising approximately 150 hours (for each language) of transcribed audio for six South African languages with coverage of news bulletins.The six languages include: Afrikaans, isiZulu, Sesotho, isiXhosa, Tshivenda, and Sesotho sa Leboa.
A limited subset of this automatically generated transcriptions was manually verified to correct for any transcription errors present which resulted in 15 hours of verified labelled audio data per language.
During manual verification each transcription was marked for inaccuracies within square brackets.For example, all unintelligible words were marked with "[?]", all words from other languages were marked with "['foreign'+transcription word]" and all partial or incomplete words were marked as word fragments.After manual verification, an analysis was performed on the output transcriptions to compile a list of all the transcribed inaccuracies present in the dataset.The results showed that the radio news readers frequently use English words in the broadcast, otherwise known as code-switching.Since the model development in this work focused only on language specific models comprising of a single language source, only the audio segments for which the corresponding transcriptions contained no inaccuracies were used.In this way, the adaptation data did not contain any foreign or unintelligible words.Table 2 summarizes the audio statistics for the broadcast news adaptation dataset.

Model Selection
Kaldi is an open source speech recognition toolkit for ASR and signal processing.In Badenhorst & de Wet (2019), TDNN-F models were trained following recipes from the Kaldi toolkit (Povey et al. 2011).The Kaldi toolkit has since been updated to Kaldi 2 (K2)

Conformer CTC 3:
A Conformer that implements a streaming mode that supports symbol delay penalty.The delay penalty is a method that lowers streaming latency while having minimal impact on recognition accuracy (Kang et al. 2022).
Models for each of the new versions of the Conformer CTC model were trained using the same NCHLT isiZulu dataset.Across the three models compared, the Conformer CTC 2 performed the best.The same model selection process was repeated for the NCHLT Afrikaans dataset, and the same trend as isiZulu emerged.For this reason, the Conformer CTC 2 model was chosen for the work that followed.To commence fine-tuning, the aim was to firstly obtain a pre-trained NCHLT baseline model that was most optimised for the broadcast news domain, so the test set that was used to evaluate the performance of each baseline epoch was one hour of broadcast news data.The epoch with the lowest PER was then selected.During fine-tuning, these pre-trained baseline models was further trained using the broadcast news adaptation data until the model converged again.

Test data
The news adaptation dataset given in Table 2 is small.Reserving a portion of the already limited dataset for use as a test set might not serve as a good indicator of the model's actual performance.Therefore, 5-fold cross validation was used to evaluate the performance of the Conformer CTC 2 model after fine-tuning.The news adaptation data was randomly split into 5 equal subsets.The NCHLT baseline model was fine-tuned through additional training on four of the subsets, with the fifth subset serving as the test set.The model was evaluated, and then the fine-tuning procedure was restarted with the same NCHLT baseline but with a different sub-set used as the test set.Repeat this process until each subset has served as the test set.The average recognition result over the five models served as the overall performance metric of the model.
A separate, smaller dataset of broadcast news data was also available for model evaluation.This is an older news dataset that was used in Badenhorst & de Wet (2023) to evaluate the performance of previous TDNN-F models.The source of this dataset is the same as the news adaptation data described in Table 2, but there is no overlap.In this paper, this older test set will only be used to compare the results of the new Conformer CTC 2 models against previously obtained TDNN-F results.

Baseline Models
To create a pre-trained NCHLT baseline model the Conformer CTC 2 recipe was applied to NCHLT training data.Table 5 shows the best recognition performance obtained for each baseline model.All word-error-rates and phone-error-rates in this paper were obtained from a 1-best decode, which extracts the best path from the decoding lattice as the decoding result.
During training it was observed that the amount of epochs required for NCHLT models to converge on broadcast news recognition in each language differed.The model for Afrikaans optimised after only 6 epochs, while isiZulu and Sesotho both required 17 epochs before the model started to over-fit towards NCHLT.

Fine-tuned Models
The best performing epoch from pre-training shown in Table 5 was chosen for further training by fine-tuning it on the broadcast news adaptation dataset for another 25 epochs.As mentioned in Section 3.3.1,a 5-fold cross validation strategy was employed to evaluate the performance of the model after fine-tuning on the broadcast news domain.Table 6 shows the best performance of each fine-tuned model evaluated on its respective test subset.
From the results presented in Table 6, it is evident that PER and WER improved after fine-tuning the baseline model on the broadcast news adaptation data.The standard deviation for PER and WER from each of the five subsets is small, which indicates that the model performance is consistent and stable.
In Table 7 below, we compare the PER and WER of the TDNN-F model trained in previous work (Badenhorst & de Wet 2023) against the fine-tuned Conformer CTC 2 model.The test set used here is an older news dataset described in Table 4 above.The Conformer CTC 2 models from Subset 2 of the 5-fold cross validation set from Table 6 was used for evaluation, because they achieved low PER for the three languages.

Impact of domain specific data
Experimental results above showed that fine-tuning a baseline NCHLT model on additional news adaptation data resulted in improved recognition when evaluated on broadcast news data.However, ASR models generally gain recognition improvements contained in NCHLT during pre-training, and thus performed poorly on unseen data outside of this domain.It was also observed that for Afrikaans the model converged the fastest whilst for isiZulu and Sesotho the model took longer to converge.
We hypothesise that there could be a number of reasons for the differences.First and foremost there are morphological differences between the languages themselves.Niesler et al. (2005) performed an analysis of the phone sets and phone transcriptions between four South African languages (Afrikaans, English, isiZulu and isiXhosa), and found that isiZulu was substantially more phonetically complex and diverse than Afrikaans.They concluded that speech recognition may be expected to be intrinsically more difficult for isiZulu and isiXhosa compared to Afrikaans and English, which could be why the Conformer CTC 2 model was able to learn the acoustic properties of Afrikaans more easily.
From previous work it was also evident that acoustic motivations exist for ASR performance differences given the NCHLT speech corpora.The TDNN-F 1d model analysis in Badenhorst & de Wet (2019) showed that Afrikaans NCHLT test PERs were approximately double that of isiZulu and Sesotho.In fact, the Sesotho PER was the second highest of all the languages.Therefore, acoustically, the NCHLT training data for isiZulu and Sesotho did not seem to be as effective even for in-domain NCHLT ASR compared to Afrikaans.It is not entirely clear why the error rates were so different, but in addition as shown in Table 1, isiZulu and Sesotho NCHLT training segments have longer average duration than Afrikaans.The lengthier segments could for instance include additional acoustic variability such as various background effects.
To improve transcription accuracy we investigated the ability of the model to generalise to other domains by exposing the model to a limited amount of adaptation data.Thus, we fine-tuned the baseline models by training the model further on the broadcast news adaptation data from which our test data is derived.Due to the limited size of this adaptation data, a 5-fold cross-validation method was utilised to evaluate the overall performance of the Conformer CTC 2 model.The results in Table 6 showed that improvements in the recognition metrics with low standard deviations were achieved across all three languages.The model from Subset 2 was selected for further evaluation on an older broadcast news test set in Table 7, and showed improved recognition performance over previous TDNN-F model results from Badenhorst & de Wet (2023).
An alternative explanation for the improved recognition metrics after fine-tuning could simply be that more training data was introduced to the baseline model.To investigate the impact of domain-specific data, the experiment described in Section 4.3 was performed.Creating and adapting a 38 hour baseline NCHLT model with 15 hours of in-domain and out-of-domain data showed that introducing the additional 15 hours of out-of-domain NCHLT data had a very small impact on the recognition metrics of the news test set.However, introducing 15 hours of in-domain news data successfully reduced the recognition metrics by half.

Conclusion
Developing ASR models for low-resource languages in South Africa is faced with challenges arising from the scarcity of data.It is even more challenging to develop ASR systems that are better fitfor-purpose in domains outside the ones in which existing data presides.The work presented in this paper explored methodologies to address this challenge, specifically investigating the viability of finetuning baseline models-initially trained on existing data for our low-resource languages-through limited adaptation data in targeted domains, with the aim of enhancing transcription accuracy.
In this pursuit, radio broadcast news data emerged as a valuable audio resource.The acquisition of correctly annotated transcriptions for this data proves to be a laborious and resource-intensive undertaking.By being able to generate accurate transcriptions of radio broadcast data, we anticipate a significantly positive impact on data harvesting endeav-ours in the future.As this, in turn, contributes to the development of automatically transcribed labelled data that can be used towards continuous advancements in ASR in the country.
New modelling techniques presented in this work such as Conformer CTC 2 together with additional fine-tuning on limited adaptation data has demonstrated promising outcomes in reducing transcription errors on radio broadcast data.This marks a meaningful stride towards successfully generating more accurate labelled corpora in our local languages, signifying progress in the ongoing development of localising ASR to our South African languages.

Table 1 :
Summary statistics for NCHLT data

Table 2 :
Summary statistics for broadcast news data after removing all segments containing inaccuracies

Table 3 :
Model hyper parameters during training Table 4 describes the audio statistics of this older news test set.

Table 5 :
Best phone-error-rate and word-error-rate for each baseline model evaluated on one hour of broadcast news test set