TY - JOUR
T1 - Machine learning model for random forest acute oral toxicity prediction
AU - Elsayad, A. M.
AU - Zeghid, M.
AU - Elsayad, K. A.
AU - Khan, A. N.
AU - Baareh, A. K.M.
AU - Sadiq, A.
AU - Mukhtar, S. A.
AU - Ali, H. F.
AU - Abd El-kader, S.
N1 - Publisher Copyright:
© 2025 The author(s).
PY - 2025/12
Y1 - 2025/12
N2 - BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.
AB - BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.
KW - Explainable artificial intelligence
KW - Machine Learning (ML)
KW - Random forest (RF)
KW - Rational discovery kit (RDKit)
KW - Sustainable development goals (SDGs)
KW - Toxicity
UR - https://www.scopus.com/pages/publications/85212615483
U2 - 10.22034/gjesm.2025.01.02
DO - 10.22034/gjesm.2025.01.02
M3 - Article
AN - SCOPUS:85212615483
SN - 2383-3572
VL - 11
SP - 21
EP - 38
JO - Global Journal of Environmental Science and Management
JF - Global Journal of Environmental Science and Management
IS - 1
ER -