Multimodal Classification System for Hausa Using LLMs and Vision Transformers
DOI:
https://doi.org/10.55492/dhasa.v6i02.6737Keywords:
Hausa, multimodal data, question-answering, Transformer, visionAbstract
This paper presents a classification-based Vi-
sual Question Answering (VQA) system for the
Hausa language, integrating Large Language
Models (LLMs) and vision transformers. By
fine-tuning LLMs on monolingual Hausa text
and fusing their representations with those of
state-of-the-art vision encoders, our system pre-
dicts answers from a fixed vocabulary. Exper-
iments conducted on the HaVQA dataset, un-
der offline text–image augmentation regimes,
tailored to the specificity of Hausa as a low-
resource language, show that this augmentation
strategy yields the best performance over the
baseline, achieving 35.85% accuracy, 35.89%
WuPalmer similarity, and 15.32% F1-score.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ali Mijiyawa, Fatiha Sadat

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.