Steering a Standard Arab Language Processing Model Towards Accurate Saudi Dialect Sentiment Analysis Using Generative AI

Sulaiman Aftan, Yu Zhuang, Ahmad O. Aseeri, Habib Shah

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Sentiment analysis (SA) is crucial for many NLP applications across various domains. While Arabic is one of the world's major languages, high-quality NLP models developed for standard Arabic often underperform on regional dialects like the Saudi Dialect (SD) due to a lack of SD-specific training data. This paper presents a novel approach to adapting a high-resource language model, AraBERT, for low-resource dialect sentiment analysis by combining minimal SD data collection with generative AI. In the absence of openly accessible SD datasets, we augmented a small amount of collected SD data with GPT-generated SD data to fine-tune AraBERT for sentiment analysis in SD. Our contributions include (1) demonstrating the feasibility of low-effort data collection of a low-resource dialect for adapting existing high-resource NLP models and (2) leveraging GPT-generated data to augment collected data to enhance a high-resource language model for sentiment classification in a low-resource dialect, achieving significant improvements over the pre-trained high-resource model. These two contributions imply a potentially replicable approach that can serve as a template for future research in other low-resource NLP tasks. This paper presents a promising solution for enhancing model performance in low-resource dialects and has implications for similar under-resourced languages.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Conference on Big Data, BigData 2024
EditorsWei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5891-5900
Number of pages10
ISBN (Electronic)9798350362480
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Big Data, BigData 2024 - Washington, United States
Duration: 15 Dec 202418 Dec 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Big Data, BigData 2024

Conference

Conference2024 IEEE International Conference on Big Data, BigData 2024
Country/TerritoryUnited States
CityWashington
Period15/12/2418/12/24

Keywords

  • Generative AI
  • NLP
  • Saudi Dialect
  • Sentiment Analysis

Fingerprint

Dive into the research topics of 'Steering a Standard Arab Language Processing Model Towards Accurate Saudi Dialect Sentiment Analysis Using Generative AI'. Together they form a unique fingerprint.

Cite this