TY - GEN
T1 - Comparing Open Arabic Named Entity Recognition Tools
AU - Aldumaykhi, Abdullah
AU - Otai, Saad
AU - Alsudais, Abdulkareem
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The main objective of this paper is to compare and evaluate the performances of three open Arabic Named Entity Recognition (NER) tools: CAMeL, Hatmi, and Stanza. We collected a corpus consisting of 30 articles written in Modern Standard Arabic (MSA) and manually annotated all the entities of the person, organization, and location types at the article (document) level. Our results suggest a similarity between Stanza and Hatmi with the latter receiving the highest F1 score for the three entity types. However, CAMeL achieved the highest precision values for names of people and organizations. Following this, we implemented a 'merge' method that combined the results from the three tools and a 'vote' method that tagged named entities only when two of the three identified them as entities. Our results showed that merging achieved the highest overall F1 scores. Moreover, merging had the highest recall values while voting had the highest precision values for the three entity types. This indicates that merging is more suitable when recall is desired, while voting is optimal when precision is required. Finally, we collected a corpus of 21,635 articles related to COVID-19 and applied the merge and vote methods. Our analysis demonstrates the tradeoff between precision and recall for the two methods.
AB - The main objective of this paper is to compare and evaluate the performances of three open Arabic Named Entity Recognition (NER) tools: CAMeL, Hatmi, and Stanza. We collected a corpus consisting of 30 articles written in Modern Standard Arabic (MSA) and manually annotated all the entities of the person, organization, and location types at the article (document) level. Our results suggest a similarity between Stanza and Hatmi with the latter receiving the highest F1 score for the three entity types. However, CAMeL achieved the highest precision values for names of people and organizations. Following this, we implemented a 'merge' method that combined the results from the three tools and a 'vote' method that tagged named entities only when two of the three identified them as entities. Our results showed that merging achieved the highest overall F1 scores. Moreover, merging had the highest recall values while voting had the highest precision values for the three entity types. This indicates that merging is more suitable when recall is desired, while voting is optimal when precision is required. Finally, we collected a corpus of 21,635 articles related to COVID-19 and applied the merge and vote methods. Our analysis demonstrates the tradeoff between precision and recall for the two methods.
KW - Named Entity Recognition
KW - Natural Language Processing
KW - Platforms and Tools
KW - Software and Systems Reuse and Reusability
UR - https://www.scopus.com/pages/publications/85171854008
U2 - 10.1109/IRI58017.2023.00016
DO - 10.1109/IRI58017.2023.00016
M3 - Conference contribution
AN - SCOPUS:85171854008
T3 - Proceedings - 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science, IRI 2023
SP - 46
EP - 51
BT - Proceedings - 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science, IRI 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2023
Y2 - 4 August 2023 through 6 August 2023
ER -