An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Abdul Jabbar; Sajid Iqbal; Manzoor Ilahi Tamimy; Amjad Rehman; Saeed Ali Bahaj; Tanzila Saba

doi:10.1109/ACCESS.2023.3332710

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Abdul Jabbar
, Sajid Iqbal
, Manzoor Ilahi Tamimy
, Amjad Rehman
, Saeed Ali Bahaj
, Tanzila Saba

Management Information Systems

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

Original language	English
Pages (from-to)	133681-133702
Number of pages	22
Journal	IEEE Access
Volume	11
DOIs	https://doi.org/10.1109/ACCESS.2023.3332710
State	Published - 2023

Keywords

Text stemming
information retrieval (IR) systems
natural language processing (NLP)
stemmer evaluation
technological development
text classification

Access to Document

10.1109/ACCESS.2023.3332710

Cite this

@article{f9c0e6208e2d4ece9e8021ded5b803bd,

title = "An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems",

abstract = "The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.",

keywords = "Text stemming, information retrieval (IR) systems, natural language processing (NLP), stemmer evaluation, technological development, text classification",

author = "Abdul Jabbar and Sajid Iqbal and Tamimy, \{Manzoor Ilahi\} and Amjad Rehman and Bahaj, \{Saeed Ali\} and Tanzila Saba",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

doi = "10.1109/ACCESS.2023.3332710",

language = "English",

volume = "11",

pages = "133681--133702",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

AU - Jabbar, Abdul

AU - Iqbal, Sajid

AU - Tamimy, Manzoor Ilahi

AU - Rehman, Amjad

AU - Bahaj, Saeed Ali

AU - Saba, Tanzila

PY - 2023

Y1 - 2023

N2 - The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

AB - The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

KW - Text stemming

KW - information retrieval (IR) systems

KW - natural language processing (NLP)

KW - stemmer evaluation

KW - technological development

KW - text classification

UR - https://www.scopus.com/pages/publications/85177056493

U2 - 10.1109/ACCESS.2023.3332710

DO - 10.1109/ACCESS.2023.3332710

M3 - Article

AN - SCOPUS:85177056493

SN - 2169-3536

VL - 11

SP - 133681

EP - 133702

JO - IEEE Access

JF - IEEE Access

ER -

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this