Paraphrase detection for Urdu language text using fine-tune BiLSTM framework

Muhammad Ali Aslam; Khairullah Khan; Wahab Khan; Sajid Ullah Khan; Abdullah Albanyan; Shabbab Ali Algamdi

doi:10.1038/s41598-025-93260-6

Paraphrase detection for Urdu language text using fine-tune BiLSTM framework

Muhammad Ali Aslam, Khairullah Khan, Wahab Khan, Sajid Ullah Khan, Abdullah Albanyan, Shabbab Ali Algamdi

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language’s complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection’s intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu’s morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model’s superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

Original language	English
Article number	15383
Journal	Scientific Reports
Volume	15
Issue number	1
DOIs	https://doi.org/10.1038/s41598-025-93260-6
State	Published - Dec 2025

Keywords

BiLSTM
CNN
LSTM
NLP
Paraphrase detection
Urdu text

Access to Document

10.1038/s41598-025-93260-6

Cite this

@article{d6fb4fb42f404df39adbefda92de8ff6,

title = "Paraphrase detection for Urdu language text using fine-tune BiLSTM framework",

abstract = "Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language{\textquoteright}s complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection{\textquoteright}s intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu{\textquoteright}s morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model{\textquoteright}s superiority, achieving 94.14\% accuracy and outperforming state-of-the-art methods like CNN (83.43\%) and LSTM (88.09\%). Our model attains an impressive 95.34\% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.",

keywords = "BiLSTM, CNN, LSTM, NLP, Paraphrase detection, Urdu text",

author = "Aslam, \{Muhammad Ali\} and Khairullah Khan and Wahab Khan and Khan, \{Sajid Ullah\} and Abdullah Albanyan and Algamdi, \{Shabbab Ali\}",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2025.",

year = "2025",

month = dec,

doi = "10.1038/s41598-025-93260-6",

language = "English",

volume = "15",

journal = "Scientific Reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Paraphrase detection for Urdu language text using fine-tune BiLSTM framework

AU - Aslam, Muhammad Ali

AU - Khan, Khairullah

AU - Khan, Wahab

AU - Khan, Sajid Ullah

AU - Albanyan, Abdullah

AU - Algamdi, Shabbab Ali

N1 - Publisher Copyright: © The Author(s) 2025.

PY - 2025/12

Y1 - 2025/12

N2 - Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language’s complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection’s intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu’s morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model’s superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

AB - Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language’s complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection’s intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu’s morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model’s superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

KW - BiLSTM

KW - CNN

KW - LSTM

KW - NLP

KW - Paraphrase detection

KW - Urdu text

UR - http://www.scopus.com/inward/record.url?scp=105003940372&partnerID=8YFLogxK

U2 - 10.1038/s41598-025-93260-6

DO - 10.1038/s41598-025-93260-6

M3 - Article

C2 - 40316633

AN - SCOPUS:105003940372

SN - 2045-2322

VL - 15

JO - Scientific Reports

JF - Scientific Reports

IS - 1

M1 - 15383

ER -

Paraphrase detection for Urdu language text using fine-tune BiLSTM framework

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this