TY - JOUR
T1 - Enhancing lysine 2-hydroxyisobutyrylation site prediction using LightGBM and hybrid sequence features
AU - Elreify, Heba M.
AU - Abd El-Samie, Fathi E.
AU - Dessouky, Moawad I.
AU - Torkey, Hanaa
AU - El-Khamy, Said E.
AU - Shalaby, Wafaa A.
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Lysine 2-hydroxyisobutyrylation (Khib), a Post-Translational Modification (PTM), plays a pivotal role in regulating protein structure and function, with emerging evidence highlighting its significance in cellular metabolism, transcriptional regulation, and disease pathways. However, the experimental identification of Khib sites is hindered by labour-intensive methods and the dynamic nature of these modifications. To address this challenge, we propose a computational framework utilizing a Light Gradient Boosting Machine (LightGBM) for predicting Khib sites. 37-amino acid peptide sequences are represented using a hybrid feature set that combines Evolutionary Scale Modeling (ESM), Composition, Transition, Distribution (CTD), and AAindex descriptors. These features capture both the evolutionary and physicochemical properties of the protein sequences. Mutual information-based feature selection enhances model performance, while LightGBM outperforms alternative classifiers, including Support Vector Machines (SVM) and XGBoost. Validation on Homo sapiens, Toxoplasma gondii, and Oryza sativa datasets yielded Area under ROC Curve (AUC) values of 0.846, 0.836, and 0.788, respectively, surpassing existing predictors such as iLys-Khib and KhibPred. Additionally, sequence analysis revealed species-specific amino acid preferences surrounding Khib sites, providing insights into the biological determinants of this modification and advancing the prediction of Khib sites across species.
AB - Lysine 2-hydroxyisobutyrylation (Khib), a Post-Translational Modification (PTM), plays a pivotal role in regulating protein structure and function, with emerging evidence highlighting its significance in cellular metabolism, transcriptional regulation, and disease pathways. However, the experimental identification of Khib sites is hindered by labour-intensive methods and the dynamic nature of these modifications. To address this challenge, we propose a computational framework utilizing a Light Gradient Boosting Machine (LightGBM) for predicting Khib sites. 37-amino acid peptide sequences are represented using a hybrid feature set that combines Evolutionary Scale Modeling (ESM), Composition, Transition, Distribution (CTD), and AAindex descriptors. These features capture both the evolutionary and physicochemical properties of the protein sequences. Mutual information-based feature selection enhances model performance, while LightGBM outperforms alternative classifiers, including Support Vector Machines (SVM) and XGBoost. Validation on Homo sapiens, Toxoplasma gondii, and Oryza sativa datasets yielded Area under ROC Curve (AUC) values of 0.846, 0.836, and 0.788, respectively, surpassing existing predictors such as iLys-Khib and KhibPred. Additionally, sequence analysis revealed species-specific amino acid preferences surrounding Khib sites, providing insights into the biological determinants of this modification and advancing the prediction of Khib sites across species.
KW - ESM
KW - LightGBM
KW - Lysine 2-hydroxyisobutyrylation
KW - Mutual information
KW - Post-translational modification
UR - https://www.scopus.com/pages/publications/105021251636
U2 - 10.1007/s13721-025-00572-8
DO - 10.1007/s13721-025-00572-8
M3 - Article
AN - SCOPUS:105021251636
SN - 2192-6662
VL - 14
JO - Network Modeling Analysis in Health Informatics and Bioinformatics
JF - Network Modeling Analysis in Health Informatics and Bioinformatics
IS - 1
M1 - 139
ER -