TY - JOUR
T1 - Explainable machine learning framework for assessing groundwater quality and trace element contamination in Eastern Saudi Arabia
AU - Aldrees, Ali
AU - Jibrin, Abdulhayat M.
AU - Dan’azumi, Salisu
AU - Mahmoud, Ismail A.
AU - Aliyu, Usman U.
AU - Abba, Sani I.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Groundwater quality in arid regions like Al Hassa, Saudi Arabia, is increasingly threatened by trace element contamination driven by human activity and natural geology. This study addresses the urgent need for data-driven tools to assess groundwater pollution in the region’s multi-aquifer system. Groundwater samples were analyzed for trace elements and main physicochemical parameters. Using supervised machine learning (ML) models—Linear Regression (LR), Random Forest (RF), K-Nearest Neighbors (KNN), and Gradient Boosting Machine (GBM)—the Water Pollution Index (WPI) was predicted as a holistic metric of contamination. The GBM model outperformed all others, achieving a training coefficient of determination (DC) of 0.9970 and a mean absolute error (MAE) of 0.0017. During testing, it maintained a high DC of 0.9372 and an MAE of 0.0063, confirming its strong generalization ability. SHapley Additive exPlanations (SHAP) were used to rank feature importance and enhance model transparency. The most influential variables for WPI prediction were chromium (Cr, SHAP = 0.0214), aluminum (Al, SHAP = 0.0136), and strontium (Sr, SHAP = 0.0053), followed by Fe (0.0031), V (0.0028), and Se (0.0017). Despite generally acceptable water quality, elements such as Cr and Fe exceeded safe limits in several samples. This study presents a transparent, high-performing framework for groundwater quality assessment in arid conditions. The integration of explainable ML offers clear, actionable insights into sustainable water management and environmental decision-making.
AB - Groundwater quality in arid regions like Al Hassa, Saudi Arabia, is increasingly threatened by trace element contamination driven by human activity and natural geology. This study addresses the urgent need for data-driven tools to assess groundwater pollution in the region’s multi-aquifer system. Groundwater samples were analyzed for trace elements and main physicochemical parameters. Using supervised machine learning (ML) models—Linear Regression (LR), Random Forest (RF), K-Nearest Neighbors (KNN), and Gradient Boosting Machine (GBM)—the Water Pollution Index (WPI) was predicted as a holistic metric of contamination. The GBM model outperformed all others, achieving a training coefficient of determination (DC) of 0.9970 and a mean absolute error (MAE) of 0.0017. During testing, it maintained a high DC of 0.9372 and an MAE of 0.0063, confirming its strong generalization ability. SHapley Additive exPlanations (SHAP) were used to rank feature importance and enhance model transparency. The most influential variables for WPI prediction were chromium (Cr, SHAP = 0.0214), aluminum (Al, SHAP = 0.0136), and strontium (Sr, SHAP = 0.0053), followed by Fe (0.0031), V (0.0028), and Se (0.0017). Despite generally acceptable water quality, elements such as Cr and Fe exceeded safe limits in several samples. This study presents a transparent, high-performing framework for groundwater quality assessment in arid conditions. The integration of explainable ML offers clear, actionable insights into sustainable water management and environmental decision-making.
KW - Explainable machine learning
KW - Groundwater quality
KW - SHAP analysis
KW - Trace elements
KW - Water pollution index
UR - https://www.scopus.com/pages/publications/105026289375
U2 - 10.1038/s41598-025-29598-8
DO - 10.1038/s41598-025-29598-8
M3 - Article
C2 - 41276649
AN - SCOPUS:105026289375
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 45333
ER -