18–21 May 2026
Europe/Warsaw timezone

MACHINE LEARNING ALGORITHM TO PREDICT BIOMARKER LEVELS USING METABOLOMICS DATA

19 May 2026, 14:03
18m
Room 14

Room 14

Speaker

Fabiola Del Greco M. (Institute of Biomedicine - Eurac research)

Description

Introduction
Metabolomics measures small molecules (called metabolites) in cells, tissues, biofluids, that represent intermediates and/or end-products of biochemical/cellular processes. As a results, metabolomics has shown to be useful for predicting disease risks or associated biomarkers. Given the large data complexity and size, the Machine learning (ML) approach represents an appropriate statistical and computational tool for building such predictive models.
By a data-driven investigation, our goal is to provide a supervised ML workflow that improves the prediction of disease-related biomarkers through targeted metabolomics data from a general population. In particular, we identify and address some related ML issues that might affect results.
Method
We explore two algorithms based on feature selection, paying attention to the presence of high dimensionality of the feature space under different degrees of correlation (low, moderate, high) and considering both sparse and dense modeling frameworks. By a real data example and a set of simulated mirroring scenarios, we compare: 1. a three-stage algorithm, consisting of a univariable selection of associated metabolites based on Bonferroni-corrected p-values, with collinearity reduction (Variance Inflation Factor ≤5) and multivariable features selection applying popular regularized methods (Lasso; Ridge; Elastic net); 2. a one-stage algorithm, consisting solely of multivariable regularized regressions. We evaluate their predictive performances in terms of estimation accuracy, feature selection stability, and generalization error. The real data example includes 172 serum targeted metabolites measured by liquid chromatography (LC)–electrospray ionization–tandem MS and flow injection electrospray ionization–tandem MS profiling (AbsoluteIDQ p180 kit, Biocrates Life Sciences AG), and four iron-related biomarkers levels, from a subsample of the Cooperative Health Research In South Tyrol (CHRIS) study with around 5,000 participants.
Results
In both real data example and simulations, the three-step ML algorithm leads to more accurate predictions, with lower loss function and root mean square error. For feature selection, the Variance Inflation Factor reduction followed by Lasso regression is relatively stable across different scenarios varying the correlation degree. The modelling framework is not impacting the performance.
Conclusions
This workflow should provide a robust strategy for integrating metabolomics -or any similar large complex dataset- into a ML algorithm for improving prediction. As a result, this integration supports earlier and more precise disease diagnosis, and enables better tailoring of therapies through more reliable patient monitoring, which are central to precision medicine. Considering cost and effort required to measure metabolites, it is useful to select a subset without reducing predictive performance.

75002910269

Author

Fabiola Del Greco M. (Institute of Biomedicine - Eurac research)

Co-authors

Daniele Giardiello (School of Medicine and Surgery, University of Milano-Bicocca) Fabiola Signorini (2 Laboratory of Clinical Epidemiology, Istituto Di Ricerche Farmacologiche Mario Negri IRCCS)

Presentation materials

There are no materials yet.