Machine learning model for random forest acute oral toxicity prediction

A. M. Elsayad; M. Zeghid; K. A. Elsayad; A. N. Khan; A. K.M. Baareh; A. Sadiq; S. A. Mukhtar; H. F. Ali; S. Abd El-kader

doi:10.22034/gjesm.2025.01.02

Machine learning model for random forest acute oral toxicity prediction

A. M. Elsayad
, M. Zeghid
, K. A. Elsayad
, A. N. Khan
, A. K.M. Baareh
, A. Sadiq
, S. A. Mukhtar
, H. F. Ali
, S. Abd El-kader

Electrical Engineering

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.

Original language	English
Pages (from-to)	21-38
Number of pages	18
Journal	Global Journal of Environmental Science and Management
Volume	11
Issue number	1
DOIs	https://doi.org/10.22034/gjesm.2025.01.02
State	Published - Dec 2025

Keywords

Explainable artificial intelligence
Machine Learning (ML)
Random forest (RF)
Rational discovery kit (RDKit)
Sustainable development goals (SDGs)
Toxicity

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.22034/gjesm.2025.01.02

Cite this

@article{bcd83ea4d5dc473a9a5fb1c850435211,

title = "Machine learning model for random forest acute oral toxicity prediction",

abstract = "BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.",

keywords = "Explainable artificial intelligence, Machine Learning (ML), Random forest (RF), Rational discovery kit (RDKit), Sustainable development goals (SDGs), Toxicity",

author = "Elsayad, \{A. M.\} and M. Zeghid and Elsayad, \{K. A.\} and Khan, \{A. N.\} and Baareh, \{A. K.M.\} and A. Sadiq and Mukhtar, \{S. A.\} and Ali, \{H. F.\} and \{Abd El-kader\}, S.",

note = "Publisher Copyright: {\textcopyright} 2025 The author(s).",

year = "2025",

month = dec,

doi = "10.22034/gjesm.2025.01.02",

language = "English",

volume = "11",

pages = "21--38",

journal = "Global Journal of Environmental Science and Management",

issn = "2383-3572",

publisher = "GJESM Publication ",

number = "1",

}

TY - JOUR

T1 - Machine learning model for random forest acute oral toxicity prediction

AU - Elsayad, A. M.

AU - Zeghid, M.

AU - Elsayad, K. A.

AU - Khan, A. N.

AU - Baareh, A. K.M.

AU - Sadiq, A.

AU - Mukhtar, S. A.

AU - Ali, H. F.

AU - Abd El-kader, S.

PY - 2025/12

Y1 - 2025/12

N2 - BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.

AB - BACKGROUND AND OBJECTIVES: The focus of this study is on the importance of reliable and precise forecasting of acute oral toxicity to bolster chemical safety and advance sustainable development goals, particularly sustainable development goals-3 (good health and well-being), sustainable development goals-6 (clean water and sanitation), and sustainable development goals-12 (responsible consumption and production). Traditional toxicity assessments are often time-consuming and costly, necessitating the exploration of more efficient approaches. The focus of this study is to establish the most efficient method for constructing reliable and precise models for toxicity prediction. METHODS: The random forests were evaluated, a robust ensemble method, for predicting acute oral toxicity using a comprehensive dataset from National Toxicology Program/Interagency Center for the Evaluation of Alternative Toxicological Methods and Environmental Protection Agency/National Center for Competency Testing, which presented significant class imbalance, 8 percent very toxic 92 percent not very toxic. To address this imbalance, strategies such as cost-sensitive learning and data resampling techniques, including both under sampling and oversampling, were utilized. A diverse set of two-dimensional molecular descriptors generated via rational discovery kit were used as input features, and model preprocessing involved normalization, validation, and feature selection. Hyper-parameter tuning was conducted using Bayesian optimization and cross-validation, while the performance of random forests was evaluated in comparison to gradient boosting, extreme gradient boosting, artificial neural networks, and the generalized linear model. FINDINGS: The random forests models, particularly those utilizing under sampling and cost-sensitive learning, demonstrated superior performance, achieving sensitivity of 0.81, Specificity of 0.85, accuracy of 0.85, and an area under the receiver operating characteristic curve of 0.89 on an independent test set. An examination of feature importance has shown that the primary molecular descriptors are those related to the Van der waals surface area and molecular quantum numbers. A surrogate decision tree developed from random forests predictions reached an area under the curve of 0.929. CONCLUSION: Random forest models effectively predicted acute oral toxicity, particularly when addressing class imbalance through cost-sensitive learning and resampling. leveraging explainable artificial intelligence techniques, including permutation feature importance, surrogate decision tree analysis and local interpretable model-agnostic explanations, this study identified key molecular descriptors driving toxicity. This advancement improves model interpretability and represents a significant step toward enhancing chemical safety while supporting sustainable development goals.

KW - Explainable artificial intelligence

KW - Machine Learning (ML)

KW - Random forest (RF)

KW - Rational discovery kit (RDKit)

KW - Sustainable development goals (SDGs)

KW - Toxicity

UR - https://www.scopus.com/pages/publications/85212615483

U2 - 10.22034/gjesm.2025.01.02

DO - 10.22034/gjesm.2025.01.02

M3 - Article

AN - SCOPUS:85212615483

SN - 2383-3572

VL - 11

SP - 21

EP - 38

JO - Global Journal of Environmental Science and Management

JF - Global Journal of Environmental Science and Management

IS - 1

ER -

Machine learning model for random forest acute oral toxicity prediction

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this