Speaker
Description
Background: Post-COVID Condition (PCC) affects a substantial proportion of individuals following SARS-CoV-2 infection, and the mechanisms driving symptom persistence remain an area of active research. Identifying risk factors associated with PCC development is important for targeted prevention strategies and clinical management. Machine learning (ML) models offer powerful tools for prediction in epidemiological settings, but understanding which features drive these predictions requires careful interpretation. As part of the RESOLVE-PCC project, we apply model-agnostic feature importance methods to identify and characterize risk factors for PCC from data of the German National Cohort (NAKO).
Methods: We employ multiple complementary feature importance methods that capture different aspects of feature-target associations: For unconditional associations we use permutation feature importance (PFI), leave-one-covariate-out (LOCO), and Shapley additive global importance (SAGE) values, whereas conditional feature importance (CFI) and conditional SAGE values are used for conditional associations. We distinguish unconditional association (whether a feature relates to PCC at all) from conditional association (whether a feature provides unique predictive information given other features). To enable this analysis, we developed xplainfi, a new R package implementing these feature importance methods natively integrating with the mlr3 machine learning framework. The package includes multiple approaches for conditional feature importance not previously available in R, supporting both conditional sampling strategies and flexible model refitting approaches.
Results: By comparing results across methods, we distinguish between features that are merely correlated with other risk factors (high PFI but low CFI) versus those providing independent predictive value. This differentiation has epidemiological implications: features showing only unconditional associations may be proxies for unmeasured factors, while conditionally associated features could represent potentially modifiable risk factors or targets for intervention. Our analysis framework addresses practical challenges in epidemiological ML applications, including mixed-type data and quantifying uncertainty in importance estimates.
Conclusions: Feature importance methods provide interpretable insights into the complex etiology of PCC by revealing different types of feature-target relationships beyond simple prediction performance. We demonstrate how application of model-agnostic interpretability techniques can support epidemiological inference from ML models, helping to bridge the gap between predictive modeling and mechanistic understanding. The xplainfi package provides researchers with accessible tools for conducting such analyses in R, particularly valuable for epidemiological studies requiring rigorous feature importance assessment. The methodological framework developed in this project contributes to ongoing RESOLVE-PCC research efforts and provides generalizable approaches for risk factor identification in other complex health outcomes.
21429403157