Multimodal Classification System for Hausa Using LLMs and Vision Transformers

Authors

  • Ali Mijiyawa
  • Fatiha Sadat

DOI:

https://doi.org/10.55492/dhasa.v6i02.6737

Keywords:

Hausa, multimodal data, question-answering, Transformer, vision

Abstract

This paper presents a classification-based Vi-
sual Question Answering (VQA) system for the
Hausa language, integrating Large Language
Models (LLMs) and vision transformers. By
fine-tuning LLMs on monolingual Hausa text
and fusing their representations with those of
state-of-the-art vision encoders, our system pre-
dicts answers from a fixed vocabulary. Exper-
iments conducted on the HaVQA dataset, un-
der offline text–image augmentation regimes,
tailored to the specificity of Hausa as a low-
resource language, show that this augmentation
strategy yields the best performance over the
baseline, achieving 35.85% accuracy, 35.89%
WuPalmer similarity, and 15.32% F1-score.

Downloads

Published

2025-12-31

Issue

Section

Articles

How to Cite

Multimodal Classification System for Hausa Using LLMs and Vision Transformers. (2025). Journal of the Digital Humanities Association of Southern Africa (DHASA), 6(2). https://doi.org/10.55492/dhasa.v6i02.6737