An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values

Kumarmangal Roy; Muneer Ahmad; Kinza Waqar; Kirthanaah Priyaah; Jamel Nebhen; Sultan S. Alshamrani; Muhammad Ahsan Raza; Ihsan Ali

doi:10.1155/2021/9953314

An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values

Kumarmangal Roy
, Muneer Ahmad
, Kinza Waqar
, Kirthanaah Priyaah
, Jamel Nebhen
, Sultan S. Alshamrani
, Muhammad Ahsan Raza
, Ihsan Ali

Research output: Contribution to journal › Article › peer-review

40 Scopus citations

Abstract

Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.

Original language	English
Article number	9953314
Journal	Complexity
Volume	2021
DOIs	https://doi.org/10.1155/2021/9953314
State	Published - 2021
Externally published	Yes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1155/2021/9953314

Cite this

@article{f6d1011305f44996bcd0f808e267ba37,

title = "An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values",

abstract = "Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98\% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.",

author = "Kumarmangal Roy and Muneer Ahmad and Kinza Waqar and Kirthanaah Priyaah and Jamel Nebhen and Alshamrani, \{Sultan S.\} and Raza, \{Muhammad Ahsan\} and Ihsan Ali",

note = "Publisher Copyright: {\textcopyright} 2021 Kumarmangal Roy et al.",

year = "2021",

doi = "10.1155/2021/9953314",

language = "English",

volume = "2021",

journal = "Complexity",

issn = "1076-2787",

publisher = "John Wiley and Sons Inc",

}

TY - JOUR

T1 - An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values

AU - Roy, Kumarmangal

AU - Ahmad, Muneer

AU - Waqar, Kinza

AU - Priyaah, Kirthanaah

AU - Nebhen, Jamel

AU - Alshamrani, Sultan S.

AU - Raza, Muhammad Ahsan

AU - Ali, Ihsan

PY - 2021

Y1 - 2021

N2 - Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.

AB - Diabetes is one of the most common metabolic diseases that cause high blood sugar. Early diagnosis of such a condition is challenging due to its complex interdependence on various factors. There is a need to develop critical decision support systems to assist medical practitioners in the diagnosis process. This research proposes developing a predictive model that can achieve a high classification accuracy of type 2 diabetes. The study consisted of two fundamental parts. Firstly, the study investigated handling missing data adopting data imputation, namely, median value imputation, K-nearest neighbor imputation, and iterative imputation. Consequently, the study validated the implications of these imputations using various classification algorithms, i.e., linear, tree-based, and ensemble algorithms, to see how each method affected classification accuracy. Secondly, Artificial Neural Network was employed to model the best performing imputed data, balanced with SMOTETomek ensuring each class is represented fairly. This approach provided the best accuracy of 98% on the test data, outperforming accuracies achieved in prior studies using the same dataset. The dataset used in this study is concerned with gender and population. As a prospect, the study recommends adopting a larger population sample without geographic boundaries. Additionally, as the developed Artificial Neural Network model did not undergo any specific hyperparameter tuning, it would be interesting to explore tuning on top of normalized data to optimize accuracy further.

UR - https://www.scopus.com/pages/publications/85111301850

U2 - 10.1155/2021/9953314

DO - 10.1155/2021/9953314

M3 - Article

AN - SCOPUS:85111301850

SN - 1076-2787

VL - 2021

JO - Complexity

JF - Complexity

M1 - 9953314

ER -

An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing Values

Abstract

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this