CIDAR: Culturally Relevant Instruction Dataset For Arabic

  • Zaid Alyafeai
  • , Khalid Almubarak
  • , Ahmed Ashraf
  • , Deema Alnuhait
  • , Saied Alshahrani
  • , Gubran A.Q. Abdulrahman
  • , Gamil Ahmed
  • , Qais Gawah
  • , Zead Saleh
  • , Mustafa Ghaleb
  • , Yousef Ali
  • , Maged S. Al-Shaibani

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Instruction tuning has emerged as a prominent methodology for teaching Large Language Models (LLMs) to follow instructions. However, current instruction datasets predominantly cater to English or are derived from English-dominated LLMs, leading to inherent biases toward Western culture. This bias negatively impacts non-English languages such as Arabic and the unique culture of the Arab region. This paper addresses this limitation by introducing CIDAR, the first open Arabic instruction-tuning dataset culturally aligned by native Arabic speakers. CIDAR contains 10,000 instruction and output pairs that represent the Arab region. We discuss the cultural relevance of CIDAR via the analysis and comparison to a few models fine-tuned on other datasets. Our experiments indicate that models fine-tuned on CIDAR achieve better cultural alignment compared to those fine-tuned on 30x more data. The dataset is available on HuggingFace https://huggingface.co/datasets/arbml/CIDAR.

Original languageEnglish
Title of host publication62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
EditorsLun-Wei Ku, Andre Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics (ACL)
Pages12878-12901
Number of pages24
ISBN (Electronic)9798891760998
StatePublished - 2024
Externally publishedYes
EventFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, Thailand
Duration: 11 Aug 202416 Aug 2024

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityHybrid, Bangkok
Period11/08/2416/08/24

Fingerprint

Dive into the research topics of 'CIDAR: Culturally Relevant Instruction Dataset For Arabic'. Together they form a unique fingerprint.

Cite this