Speaker
Description
Introduction
Metabolomics measures small molecules (called metabolites) in cells, tissues, biofluids, that represent intermediates and/or end-products of biochemical/cellular processes. As a results, metabolomics has shown to be useful for predicting disease risks or associated biomarkers. Given the large data complexity and size, the Machine learning (ML) approach represents an appropriate statistical and computational tool for building such predictive models.
By a data-driven investigation, our goal is to provide a supervised ML workflow that improves the prediction of disease-related biomarkers through targeted metabolomics data from a general population. In particular, we identify and address some related ML issues that might affect results.
Method
We explore two algorithms based on feature selection, paying attention to the presence of high dimensionality of the feature space under different degrees of correlation (low, moderate, high) and considering both sparse and dense modeling frameworks. By a real data example and a set of simulated mirroring scenarios, we compare: 1. a three-stage algorithm, consisting of a univariable selection of associated metabolites based on Bonferroni-corrected p-values, with collinearity reduction (Variance Inflation Factor ≤5) and multivariable features selection applying popular regularized methods (Lasso; Ridge; Elastic net); 2. a one-stage algorithm, consisting solely of multivariable regularized regressions. We evaluate their predictive performances in terms of estimation accuracy, feature selection stability, and generalization error. The real data example includes 172 serum targeted metabolites measured by liquid chromatography (LC)–electrospray ionization–tandem MS and flow injection electrospray ionization–tandem MS profiling (AbsoluteIDQ p180 kit, Biocrates Life Sciences AG), and four iron-related biomarkers levels, from a subsample of the Cooperative Health Research In South Tyrol (CHRIS) study with around 5,000 participants.
Results
In both real data example and simulations, the three-step ML algorithm leads to more accurate predictions, with lower loss function and root mean square error. For feature selection, the Variance Inflation Factor reduction followed by Lasso regression is relatively stable across different scenarios varying the correlation degree. The modelling framework is not impacting the performance.
Conclusions
This workflow should provide a robust strategy for integrating metabolomics -or any similar large complex dataset- into a ML algorithm for improving prediction. As a result, this integration supports earlier and more precise disease diagnosis, and enables better tailoring of therapies through more reliable patient monitoring, which are central to precision medicine. Considering cost and effort required to measure metabolites, it is useful to select a subset without reducing predictive performance.
75002910269