Preprocessing Arabic text on social media

Mohamed Osman Hegazi; Yasser Al-Dossari; Abdullah Al-Yahy; Abdulaziz Al-Sumari; Anwer Hilal

doi:10.1016/j.heliyon.2021.e06191

Preprocessing Arabic text on social media

Mohamed Osman Hegazi
, Yasser Al-Dossari
, Abdullah Al-Yahy
, Abdulaziz Al-Sumari
, Anwer Hilal

Computer Sciences

Prince Sattam Bin Abdulaziz University

Research output: Contribution to journal › Article › peer-review

68 Scopus citations

Abstract

Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

Original language	English
Article number	e06191
Journal	Heliyon
Volume	7
Issue number	2
DOIs	https://doi.org/10.1016/j.heliyon.2021.e06191
State	Published - Feb 2021

Keywords

Arabic text
Data analysis
Database
Document and text processing
Information extraction
Information retrieval
Knowledge discovery
Natural language processing
Sentiment analysis

Access to Document

10.1016/j.heliyon.2021.e06191

Cite this

@article{0601ba65bd0f4652a4c36d9b7415188a,

title = "Preprocessing Arabic text on social media",

abstract = "Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.",

keywords = "Arabic text, Data analysis, Database, Document and text processing, Information extraction, Information retrieval, Knowledge discovery, Natural language processing, Sentiment analysis",

author = "Hegazi, \{Mohamed Osman\} and Yasser Al-Dossari and Abdullah Al-Yahy and Abdulaziz Al-Sumari and Anwer Hilal",

note = "Publisher Copyright: {\textcopyright} 2021 The Authors",

year = "2021",

month = feb,

doi = "10.1016/j.heliyon.2021.e06191",

language = "English",

volume = "7",

journal = "Heliyon",

issn = "2405-8440",

publisher = "Elsevier Ltd",

number = "2",

}

TY - JOUR

T1 - Preprocessing Arabic text on social media

AU - Hegazi, Mohamed Osman

AU - Al-Dossari, Yasser

AU - Al-Yahy, Abdullah

AU - Al-Sumari, Abdulaziz

AU - Hilal, Anwer

PY - 2021/2

Y1 - 2021/2

N2 - Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

AB - Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

KW - Arabic text

KW - Data analysis

KW - Database

KW - Document and text processing

KW - Information extraction

KW - Information retrieval

KW - Knowledge discovery

KW - Natural language processing

KW - Sentiment analysis

UR - https://www.scopus.com/pages/publications/85100702366

U2 - 10.1016/j.heliyon.2021.e06191

DO - 10.1016/j.heliyon.2021.e06191

M3 - Article

AN - SCOPUS:85100702366

SN - 2405-8440

VL - 7

JO - Heliyon

JF - Heliyon

IS - 2

M1 - e06191

ER -

Preprocessing Arabic text on social media

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this