Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

  • Aqil M. Azmi
  • , Rehab M. Alnefaie
  • , Hatim A. Aboalsamh

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Diacritic restoration (also known as diacritization or vowelization) is the process of inserting the correct diacritical markings into a text. Modern Arabic is typically written without diacritics, e.g., newspapers. This lack of diacritical markings often causes ambiguity, and though natives are adept at resolving, there are times they may fail. Diacritic restoration is a classical problem in computer science. Still, as most of the works tackle the full (heavy) diacritization of text, we, however, are interested in diacritizing the text using a fewer number of diacritics. Studies have shown that a fully diacritized text is visually displeasing and slows down the reading. This article proposes a system to diacritize homographs using the least number of diacritics, thus the name "light."There is a large class of words that fall under the homograph category, and we will be dealing with the class of words that share the spelling but not the meaning. With fewer diacritics, we do not expect any effect on reading speed, while eye strain is reduced. The system contains morphological analyzer and context similarities. The morphological analyzer is used to generate all word candidates for diacritics. Then, through a statistical approach and context similarities, we resolve the homographs. Experimentally, the system shows very promising results, and our best accuracy is 85.6%.

Original languageEnglish
Article number60
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume21
Issue number3
DOIs
StatePublished - May 2022
Externally publishedYes

Keywords

  • Arabic language
  • automatic diacritization
  • disambiguation
  • homographs
  • morphological analysis

Fingerprint

Dive into the research topics of 'Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts'. Together they form a unique fingerprint.

Cite this