Exploring ASR fine-tuning on limited domain-specific data for low-resource languages
DOI:
https://doi.org/10.55492/dhasa.v5i1.5024Keywords:
automatic speech recognition, fine-tuning, Low-resource Languages, data harvesting, broadcast news dataAbstract
The majority of South Africa’s eleven languages are low-resourced, posing a major challenge to Automatic Speech Recognition (ASR) development. Modern ASR systems require an extensive amount of data that is extremely difficult to find for low-resourced languages. In addition, available speech and text corpora for these languages predominantly revolve around government, political and biblical content. Consequently, this hinders the ability of ASR systems developed for these languages to perform well especially when evaluating data outside of these domains. To alleviate this problem, the Icefall Kaldi II toolkit introduced new transformer model scripts, facilitating the adaptation of pre-trained models using limited adaptation data. In this paper, we explored the technique of using pre-trained ASR models in a domain where more data is available (government data) and adapted it to an entirely different domain with limited data (broadcast news data). The objective was to assess whether such techniques can surpass the accuracy of prior ASR models developed for these languages. Our results showed that the Conformer connectionist temporal classification (CTC) model obtained lower word error rates by a large margin in comparison to previous TDNN-F models evaluated on the same datasets. This research signifies a step forward in mitigating data scarcity challenges and enhancing ASR performance for low-resourced languages in South Africa.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Franco Mak, Avashna Govender, Jaco Badenhorst
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.