
Dr Anthony Onoja
About
Biography
I hold a PhD in Data Science, as well as an MSc and a First Class BSc (Hons) in Statistics.
My work focuses on applying Artificial Intelligence (AI), Data Science, and statistical techniques to tackle critical challenges in health and medical sciences. Key areas of research include interpretable and explainable AI for chronic conditions, multimorbidity, patient stratification, biomarker identification, and multi-omics data analysis. I am also strongly committed to promoting reproducible and transparent research practices - an increasingly important priority in the era of generative AI.
Beyond research, I am actively engaged in mentoring and consultancy in the fields of Statistics, Data Science, and AI ethics and explainability.
Areas of specialism
University roles and responsibilities
- Research Fellow
My qualifications
ResearchResearch interests
My Research interest focuses on applying Artificial Intelligence (AI), Data Science, and statistical techniques to tackle critical challenges in health and medical sciences. Key areas of research include interpretable and explainable AI for chronic conditions, multimorbidity, patient stratification, biomarker identification, and multi-omics data analysis.
Research projects
In this project, we performed bioinformatics analyses such as statistical modelling and machine learning (supervised and unsupervised clustering) with data from the NURTuRE biorepository, which includes 2996 patients from the CKD cohort and 25 biomarkers measured in plasma and urine.
Research interests
My Research interest focuses on applying Artificial Intelligence (AI), Data Science, and statistical techniques to tackle critical challenges in health and medical sciences. Key areas of research include interpretable and explainable AI for chronic conditions, multimorbidity, patient stratification, biomarker identification, and multi-omics data analysis.
Research projects
In this project, we performed bioinformatics analyses such as statistical modelling and machine learning (supervised and unsupervised clustering) with data from the NURTuRE biorepository, which includes 2996 patients from the CKD cohort and 25 biomarkers measured in plasma and urine.
Sustainable development goals
My research interests are related to the following:




Publications
Highlights
- A. Onoja, A. Zahid, K. Elomaa, and N. Geifman, “An Interpretable Model for Predicting Acute Myocardial Infarction in Distinct Patient Profiles,” Stud Health Technol Inform, vol. 327, pp. 452–456, May 2025, doi: 10.3233/SHTI250378.
- Gerichten, J., Saunders, K., Bailey, M. J., Gethings, L. A., Onoja, A., Geifman, N., & Spick, M. (2024).
Challenges in Lipidomics Biomarker Identification: Avoiding the Pitfalls and Improving Reproducibility. Metabolites, 14(8), 461.
https://doi.org/10.3390/metabo14080461 - Onoja, A., Von Gerichten, J., Lewis, H.M., Bailey, M.J., Skene, D.J., Geifman, N., & Spick, M. (2023).
Meta-Analysis of COVID-19 Metabolomics Identifies Variations in Robustness of Biomarkers. International Journal of Molecular Sciences, 24(18), 14371. https://doi.org/10.3390/ijms241814371 - Ramone, T., Romei, C., Ciampi, R., Casalini, R., Valetto, A., Bertini, V., Raimondi, F., Onoja, A., et al. (2023).
Chromosomal alterations in sporadic medullary thyroid carcinoma and correlation with outcome. Endocrine-Related Cancer, 30(9). - Onoja, A., & Raimondi, F. (2023). https://doi.org/10.1530/ERC-22-0251
Interpretability from a New Lens: Integrating Stratification and Domain Knowledge for Biomedical Applications. arXiv preprint arXiv:2303.09322. - Onoja, A., Picchiotti, N., Fallerini, C., et al. (2022).
An Explainable Model of Host Genetic Interactions Linked to COVID-19 Severity. Communications Biology, 5(1), 1133. https://doi.org/10.1038/s42003-022-04073-6 - Onoja, A., Ogundare, O.C., & Tejuoso, F.E. (2022).
An Optimization of Outpatients’ Waiting Time and Health-Related Risks. Journal of Natural Sciences and Mathematics of UT, 7(13–14), 21–35. https://doi.org/10.21203/rs.3.rs-1565090/v1 - Onoja, A., Raimondi, F., & Nanni, M. (2023).
An Explainable Host Genetic Severity Predictor Model for COVID-19 Patients. medRxiv preprint. https://doi.org/10.1101/2023.03.06.23286869 - Onoja, A.A., Babasola, O.L., & Ojiambo, V. (2018).
Application of Discriminant Analysis in the Classification of Food Security Status. International Journal of Scientific & Engineering Research, 9(2), 1474–1489. - Onoja, A.A., Babasola, O.L., & Ojiambo, V. (2018).
Chi-Square Automatic Interaction Detection Modeling of the Effects of Social Media Networks on Students' Academic Performance. Journal of Statistics and Mathematical Sciences, 4, 32–39. DOI: 10.9790/487X-2007024351 - Babasola, O.L., Irakoze, I., & Onoja, A.A. (2018).
Valuation of European Options within the Black-Scholes Framework Using the Hermite Polynomial. Journal of Scientific and Engineering Research, 5(2), 200–213. - Onoja, A., Kembe, M.M., Bwebum, C.D., & Obilikwu, P. (2022).
An Integrated Big Data Model to Salvage Nigeria's Insecurity Challenges. Nigerian Annals of Pure and Applied Sciences, 5(1), 125–136. DOI: 10.5281/zenodo.6962868
The global COVID-19 pandemic resulted in widespread harms but also rapid advances in vaccine development, diagnostic testing, and treatment. As the disease moves to endemic status, the need to identify characteristic biomarkers of the disease for diagnostics or therapeutics has lessened, but lessons can still be learned to inform biomarker research in dealing with future pathogens. In this work, we test five sets of research-derived biomarkers against an independent targeted and quantitative Liquid Chromatography-Mass Spectrometry metabolomics dataset to evaluate how robustly these proposed panels would distinguish between COVID-19-positive and negative patients in a hospital setting. We further evaluate a crowdsourced panel comprising the COVID-19 metabolomics biomarkers most commonly mentioned in the literature between 2020 and 2023. The best-performing panel in the independent dataset-measured by F1 score (0.76) and AUROC (0.77)-included nine biomarkers: lactic acid, glutamate, aspartate, phenylalanine, & beta;-alanine, ornithine, arachidonic acid, choline, and hypoxanthine. Panels comprising fewer metabolites performed less well, showing weaker statistical significance in the independent cohort than originally reported in their respective discovery studies. Whilst the studies reviewed here were small and may be subject to confounders, it is desirable that biomarker panels be resilient across cohorts if they are to find use in the clinic, highlighting the importance of assessing the robustness and reproducibility of metabolomics analyses in independent populations.
Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC-MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in biomarker identification. In this work, we illustrate the reproducibility gap for two open-access lipidomics platforms, MS DIAL and Lipostar, finding just 14.0% identification agreement when analyzing identical LC-MS spectra using default settings. Whilst the software platforms performed more consistently using fragmentation data, agreement was still only 36.1% for MS2 spectra. This highlights the critical importance of validation across positive and negative LC-MS modes, as well as the manual curation of spectra and lipidomics software outputs, in order to reduce identification errors caused by closely related lipids and co-elution issues. This curation process can be supplemented by data-driven outlier detection in assessing spectral outputs, which is demonstrated here using a novel machine learning approach based on support vector machine regression combined with leave-one-out cross-validation. These steps are essential to reduce the frequency of false positive identifications and close the reproducibility gap, including between software platforms, which, for downstream users such as bioinformaticians and clinicians, can be an underappreciated source of biomarker identification errors.
This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution. The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960. The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.
Understanding the COVID-19 severity and why it differs significantly among patients is a thing of concern to the scientific community. The major contribution of this study arises from the use of a voting ensemble host genetic severity predictor (HGSP) model we developed by combining several state-of-the-art machine learning algorithms (decision tree-based models: Random Forest and XGBoost classifiers). These models were trained using a genetic Whole Exome Sequencing (WES) dataset and clinical covariates (age and gender) formulated from a 5-fold stratified cross-validation computational strategy to randomly split the dataset to overcome model instability. Our study validated the HGSP model based on the 18 features (i.e., 16 identified candidate genetic variants and 2 covariates) identified from a prior study. We provided post-hoc model explanations through the ExplainerDashboard - an open-source python library framework, allowing for deeper insight into the prediction results. We applied the Enrichr and OpenTarget genetics bioinformatic interactive tools to associate the genetic variants for plausible biological insights, and domain interpretations such as pathways, ontologies, and disease/drugs. Through an unsupervised clustering of the SHAP feature importance values, we visualized the complex genetic mechanisms. Our findings show that while age and gender mainly influence COVID-19 severity, a specific group of patients experiences severity due to complex genetic interactions.
Somatic Copy Number Alterations (SCNA) involving either a whole chromosome or just one of the arms, or even smaller parts have been described in about 88% of human tumors. This study investigated the SCNA profile in 40 well-characterized sporadic medullary thyroid carcinomas by comparative genomic hybridization array. We found that 26/40(65%) cases had at least one SCNA. The prevalence of SCNA, and in particular of chromosome 3 and 10, was significantly higher in cases with a RET somatic mutation. Similarly, SCNA of chromosomes 3, 9, 10 and 16 were more frequent in cases with a worse outcome and an advanced disease. By the pathway enrichment analysis, we found a mutually exclusive distribution of biological pathways in metastatic, biochemically persistent and cured patients. In particular, we found gain of regions involved in the intracellular signaling and loss of regions involved in DNA repair and TP53 pathways in the group of metastatic patients. Gain of regions involved in cell cycle and senescence were observed in patients with biochemical disease. Finally, gain of regions associated to the immune system and loss of regions involved in the apoptosis pathway were observed in cured patients suggesting a role of specific SCNA and corresponding altered pathways in the outcome of sporadic MTC.
The use of machine learning (ML) techniques in the biomedical field has become increasingly important, particularly with the large amounts of data generated by the aftermath of the COVID-19 pandemic. However, due to the complex nature of biomedical datasets and the use of black-box ML models, a lack of trust and adoption by domain experts can arise. In response, interpretable ML (IML) approaches have been developed, but the curse of dimensionality in biomedical datasets can lead to model instability. This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) and integrating domain knowledge interpretation techniques embedded into the current state-of-the-art IML frameworks. This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models. Specifically, the model outcome, such as aggregated feature weight importance, can be linked to further domain knowledge interpretations using techniques like pathway functional enrichment, drug targeting, and repurposing databases. Additionally, involving end-users and clinicians in focus group discussions before and after the choice of IML framework can help guide testable hypotheses, improve performance metrics, and build trustworthy and usable IML solutions in the biomedical field. Overall, this study highlights the potential of combining advanced computational techniques with domain knowledge interpretation to enhance the effectiveness of IML solutions in the context of complex biomedical datasets.
We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with the training of multiple supervised classifiers, to predict severity based on screened features. Feature importance analysis from tree-based models allowed us to identify 16 variants with the highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%; MCC = 61.55%). Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome-Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response. It also identified additional processes cross-talking with immune pathways, such as GPCR signaling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as "Respiratory or thoracic disease", supporting their link with COVID-19 severity outcome.