TY - JOUR
T1 - Preprocessing Arabic text on social media
AU - Hegazi, Mohamed Osman
AU - Al-Dossari, Yasser
AU - Al-Yahy, Abdullah
AU - Al-Sumari, Abdulaziz
AU - Hilal, Anwer
N1 - Publisher Copyright:
© 2021 The Authors
PY - 2021/2
Y1 - 2021/2
N2 - Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.
AB - Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.
KW - Arabic text
KW - Data analysis
KW - Database
KW - Document and text processing
KW - Information extraction
KW - Information retrieval
KW - Knowledge discovery
KW - Natural language processing
KW - Sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85100702366&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2021.e06191
DO - 10.1016/j.heliyon.2021.e06191
M3 - Article
AN - SCOPUS:85100702366
SN - 2405-8440
VL - 7
JO - Heliyon
JF - Heliyon
IS - 2
M1 - e06191
ER -