Automatic minimal diacritization of Arabic texts

  • Rehab Alnefaie
  • , Aqil M. Azmi

Research output: Contribution to journalConference articlepeer-review

17 Scopus citations

Abstract

Modern Standard Arabic (MSA) is typically written without short vowels, which helps in clarifying the sense and meaning of the word. The short vowels are omitted since experienced Arabic readers can infer the meaning through the context. But there are cases where even the native Arabic speakers cannot resolve. The process of restoring the diacritical marks (short vowels) is known as diacritization. Most of the developed algorithms for diacritization fully restores all the markings, many of which are trivial or unnecessary. In this paper, we present a system that restores the diacritical markings where it is mostly needed, resolving the ambiguity. This is a more challenging problem than fully restoring all the diacritics. The system combines morphological analyzers and context similarities. The goal of the morphological analyzers is to generate all word candidates for the diacritics, and the model eliminates word ambiguity through a statistical approach and context similarities. Out of 80 paragraphs our system resolved 57 cases.

Original languageEnglish
Pages (from-to)169-174
Number of pages6
JournalProcedia Computer Science
Volume117
DOIs
StatePublished - 2017
Externally publishedYes
Event3rd International Conference on Arabic Computational Linguistics, ACLing 2017 - Dubai, United Arab Emirates
Duration: 5 Nov 20176 Nov 2017

Keywords

  • Ambiguity
  • Arabic language
  • Automatic vowelization
  • Diacritization
  • Morphological analysis
  • Statistical methods

Fingerprint

Dive into the research topics of 'Automatic minimal diacritization of Arabic texts'. Together they form a unique fingerprint.

Cite this