Dr Anthony Onoja


Research Fellow
PhD Data Science, MSc Mathematical Statistics, BSc (Hons) Statistics

Academic and research departments

Digital health, School of Health Sciences.

About

Areas of specialism

Data Science; Artificial Intelligence; Statistics

University roles and responsibilities

  • Research Fellow

    My qualifications

    PhD in Data Science
    Scuola Normale Superiore, Pisa, Italy
    MSc. in Mathematical Statistics
    Pan African University and Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
    Bachelor of Science (Honours) First Class in Statistics
    University of Jos, Nigeria

    Research

    Research interests

    Research projects

    Sustainable development goals

    My research interests are related to the following:

    Good Health and Well-being UN Sustainable Development Goal 3 logo
    Quality Education UN Sustainable Development Goal 4 logo
    Reduced Inequalities UN Sustainable Development Goal 10 logo
    Climate Action UN Sustainable Development Goal 13 logo

    Publications

    Highlights

    1. A. Onoja, A. Zahid, K. Elomaa, and N. Geifman, “An Interpretable Model for Predicting Acute Myocardial Infarction in Distinct Patient Profiles,” Stud Health Technol Inform, vol. 327, pp. 452–456, May 2025, doi: 10.3233/SHTI250378.
    2. Gerichten, J., Saunders, K., Bailey, M. J., Gethings, L. A., Onoja, A., Geifman, N., & Spick, M. (2024).
      Challenges in Lipidomics Biomarker Identification: Avoiding the Pitfalls and Improving Reproducibility. Metabolites, 14(8), 461.
      https://doi.org/10.3390/metabo14080461
    3. Onoja, A., Von Gerichten, J., Lewis, H.M., Bailey, M.J., Skene, D.J., Geifman, N., & Spick, M. (2023).
      Meta-Analysis of COVID-19 Metabolomics Identifies Variations in Robustness of Biomarkers. International Journal of Molecular Sciences, 24(18), 14371. https://doi.org/10.3390/ijms241814371
    4. Ramone, T., Romei, C., Ciampi, R., Casalini, R., Valetto, A., Bertini, V., Raimondi, F., Onoja, A., et al. (2023).
      Chromosomal alterations in sporadic medullary thyroid carcinoma and correlation with outcome. Endocrine-Related Cancer, 30(9).
    5. Onoja, A., & Raimondi, F. (2023). https://doi.org/10.1530/ERC-22-0251 
      Interpretability from a New Lens: Integrating Stratification and Domain Knowledge for Biomedical Applications. arXiv preprint arXiv:2303.09322.
    6. Onoja, A., Picchiotti, N., Fallerini, C., et al. (2022).
      An Explainable Model of Host Genetic Interactions Linked to COVID-19 Severity. Communications Biology, 5(1), 1133. https://doi.org/10.1038/s42003-022-04073-6
    7. Onoja, A., Ogundare, O.C., & Tejuoso, F.E. (2022). 
      An Optimization of Outpatients’ Waiting Time and Health-Related Risks. Journal of Natural Sciences and Mathematics of UT, 7(13–14), 21–35. https://doi.org/10.21203/rs.3.rs-1565090/v1
    8. Onoja, A., Raimondi, F., & Nanni, M. (2023).
      An Explainable Host Genetic Severity Predictor Model for COVID-19 Patients. medRxiv preprint. https://doi.org/10.1101/2023.03.06.23286869
    9. Onoja, A.A., Babasola, O.L., & Ojiambo, V. (2018).
      Application of Discriminant Analysis in the Classification of Food Security Status. International Journal of Scientific & Engineering Research, 9(2), 1474–1489.
    10. Onoja, A.A., Babasola, O.L., & Ojiambo, V. (2018).
      Chi-Square Automatic Interaction Detection Modeling of the Effects of Social Media Networks on Students' Academic Performance. Journal of Statistics and Mathematical Sciences, 4, 32–39. DOI: 10.9790/487X-2007024351
    11. Babasola, O.L., Irakoze, I., & Onoja, A.A. (2018).
      Valuation of European Options within the Black-Scholes Framework Using the Hermite Polynomial. Journal of Scientific and Engineering Research, 5(2), 200–213.
    12. Onoja, A., Kembe, M.M., Bwebum, C.D., & Obilikwu, P. (2022).
      An Integrated Big Data Model to Salvage Nigeria's Insecurity Challenges. Nigerian Annals of Pure and Applied Sciences, 5(1), 125–136. DOI: 10.5281/zenodo.6962868
    Anthony Onoja, Johanna Von Gerichten, Holly-May Lewis, Melanie Jane Bailey, Debra Jean Skene, Nophar Geifman, Matt Spick (2023)Meta-Analysis of COVID-19 Metabolomics Identifies Variations in Robustness of Biomarkers, In: International journal of molecular sciences24(18)14371 Mdpi

    The global COVID-19 pandemic resulted in widespread harms but also rapid advances in vaccine development, diagnostic testing, and treatment. As the disease moves to endemic status, the need to identify characteristic biomarkers of the disease for diagnostics or therapeutics has lessened, but lessons can still be learned to inform biomarker research in dealing with future pathogens. In this work, we test five sets of research-derived biomarkers against an independent targeted and quantitative Liquid Chromatography-Mass Spectrometry metabolomics dataset to evaluate how robustly these proposed panels would distinguish between COVID-19-positive and negative patients in a hospital setting. We further evaluate a crowdsourced panel comprising the COVID-19 metabolomics biomarkers most commonly mentioned in the literature between 2020 and 2023. The best-performing panel in the independent dataset-measured by F1 score (0.76) and AUROC (0.77)-included nine biomarkers: lactic acid, glutamate, aspartate, phenylalanine, & beta;-alanine, ornithine, arachidonic acid, choline, and hypoxanthine. Panels comprising fewer metabolites performed less well, showing weaker statistical significance in the independent cohort than originally reported in their respective discovery studies. Whilst the studies reviewed here were small and may be subject to confounders, it is desirable that biomarker panels be resilient across cohorts if they are to find use in the clinic, highlighting the importance of assessing the robustness and reproducibility of metabolomics analyses in independent populations.

    Johanna von Gerichten, Kyle Saunders, Melanie J. Bailey, Lee A. Gethings, Anthony Onoja, Nophar Geifman, Matt Spick (2024)Challenges in Lipidomics Biomarker Identification: Avoiding the Pitfalls and Improving Reproducibility, In: Metabolites14(8)461 Mdpi

    Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC-MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in biomarker identification. In this work, we illustrate the reproducibility gap for two open-access lipidomics platforms, MS DIAL and Lipostar, finding just 14.0% identification agreement when analyzing identical LC-MS spectra using default settings. Whilst the software platforms performed more consistently using fragmentation data, agreement was still only 36.1% for MS2 spectra. This highlights the critical importance of validation across positive and negative LC-MS modes, as well as the manual curation of spectra and lipidomics software outputs, in order to reduce identification errors caused by closely related lipids and co-elution issues. This curation process can be supplemented by data-driven outlier detection in assessing spectral outputs, which is demonstrated here using a novel machine learning approach based on support vector machine regression combined with leave-one-out cross-validation. These steps are essential to reduce the frequency of false positive identifications and close the reproducibility gap, including between software platforms, which, for downstream users such as bioinformaticians and clinicians, can be an underappreciated source of biomarker identification errors.

    Anthony Onoja, Mary Oyinlade Ejiwale, Ayesan Rewane (2021)Interpretable machine learning approach for predicting COVID-19 risk status of an individual, In: Transactions on Networks and Communications9(2)pp. 1-14

    This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution.  The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960.  The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.

    Anthony Onoja, Francesco Raimondi, Mirco Nanni An Explainable Host Genetic Severity Predictor Model for COVID-19 Patients, In: MedRxiv Cold Spring Harbor Laboratory Press

    Understanding the COVID-19 severity and why it differs significantly among patients is a thing of concern to the scientific community. The major contribution of this study arises from the use of a voting ensemble host genetic severity predictor (HGSP) model we developed by combining several state-of-the-art machine learning algorithms (decision tree-based models: Random Forest and XGBoost classifiers). These models were trained using a genetic Whole Exome Sequencing (WES) dataset and clinical covariates (age and gender) formulated from a 5-fold stratified cross-validation computational strategy to randomly split the dataset to overcome model instability. Our study validated the HGSP model based on the 18 features (i.e., 16 identified candidate genetic variants and 2 covariates) identified from a prior study. We provided post-hoc model explanations through the ExplainerDashboard - an open-source python library framework, allowing for deeper insight into the prediction results. We applied the Enrichr and OpenTarget genetics bioinformatic interactive tools to associate the genetic variants for plausible biological insights, and domain interpretations such as pathways, ontologies, and disease/drugs. Through an unsupervised clustering of the SHAP feature importance values, we visualized the complex genetic mechanisms. Our findings show that while age and gender mainly influence COVID-19 severity, a specific group of patients experiences severity due to complex genetic interactions.

    Teresa Ramone, Cristina Romei, Raffaele Ciampi, Roberta Casalini, Angelo Valetto, Veronica Bertini, Francesco Raimondi, Anthony Onoja, Alessandro Prete, Antonio Matrone, Carla Gambale, Paolo Piaggi, Liborio Torregrossa, Clara Ugolini, Rossella Elisei (2023)Chromosomal alterations in sporadic medullary thyroid carcinoma and correlation with outcome, In: Endocrine-related cancer

    Somatic Copy Number Alterations (SCNA) involving either a whole chromosome or just one of the arms, or even smaller parts have been described in about 88% of human tumors. This study investigated the SCNA profile in 40 well-characterized sporadic medullary thyroid carcinomas by comparative genomic hybridization array. We found that 26/40(65%) cases had at least one SCNA. The prevalence of SCNA, and in particular of chromosome 3 and 10, was significantly higher in cases with a RET somatic mutation. Similarly, SCNA of chromosomes 3, 9, 10 and 16 were more frequent in cases with a worse outcome and an advanced disease. By the pathway enrichment analysis, we found a mutually exclusive distribution of biological pathways in metastatic, biochemically persistent and cured patients. In particular, we found gain of regions involved in the intracellular signaling and loss of regions involved in DNA repair and TP53 pathways in the group of metastatic patients. Gain of regions involved in cell cycle and senescence were observed in patients with biochemical disease. Finally, gain of regions associated to the immune system and loss of regions involved in the apoptosis pathway were observed in cured patients suggesting a role of specific SCNA and corresponding altered pathways in the outcome of sporadic MTC.

    The use of machine learning (ML) techniques in the biomedical field has become increasingly important, particularly with the large amounts of data generated by the aftermath of the COVID-19 pandemic. However, due to the complex nature of biomedical datasets and the use of black-box ML models, a lack of trust and adoption by domain experts can arise. In response, interpretable ML (IML) approaches have been developed, but the curse of dimensionality in biomedical datasets can lead to model instability. This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) and integrating domain knowledge interpretation techniques embedded into the current state-of-the-art IML frameworks. This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models. Specifically, the model outcome, such as aggregated feature weight importance, can be linked to further domain knowledge interpretations using techniques like pathway functional enrichment, drug targeting, and repurposing databases. Additionally, involving end-users and clinicians in focus group discussions before and after the choice of IML framework can help guide testable hypotheses, improve performance metrics, and build trustworthy and usable IML solutions in the biomedical field. Overall, this study highlights the potential of combining advanced computational techniques with domain knowledge interpretation to enhance the effectiveness of IML solutions in the context of complex biomedical datasets.

    Anthony Onoja, Nicola Picchiotti, Chiara Fallerini, Margherita Baldassarri, Francesca Fava, Francesca Colombo, Francesca Chiaromonte, Alessandra Renieri, Simone Furini, Francesco Raimondi (2022)An explainable model of host genetic interactions linked to COVID-19 severity, In: Communications biology5(1)pp. 1133-1133

    We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with the training of multiple supervised classifiers, to predict severity based on screened features. Feature importance analysis from tree-based models allowed us to identify 16 variants with the highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%; MCC = 61.55%). Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome-Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response. It also identified additional processes cross-talking with immune pathways, such as GPCR signaling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as "Respiratory or thoracic disease", supporting their link with COVID-19 severity outcome.