6th CEN Conference

Europe/Warsaw
Description

CEN2026 is a joint meeting of the Austro-Swiss Region (ROeS), the German Region (DR), the Polish Region of the International Biometric Society (IBS) and the Polish Biometric Society.

This year’s theme — “Power of Data — Shaping the Future of Life Sciences” — highlights the growing importance of data in transforming and advancing the life sciences. As access to vast amounts of information becomes increasingly common, the ability to interpret and apply it effectively is essential for informed decision-making that influences human health, quality of life, and the environment.

The CEN2026 conference will provide a unique platform to present the latest achievements in biometrics and its applications in the life sciences, including biostatistics, bioinformatics, medicine, pharmacology, pharmaceutical research and development, environmental statistics, genomics, machine learning, and digital medical technologies. We will focus on how data shapes clinical practice, drug development, environmental risk assessment, and population health strategies.

The event will be devoted to the methodological aspects and the practical implementation of statistical methods in the life sciences. An essential objective of the conference is to promote innovative solutions that improve the practical usefulness of scientific research.

    • 09:00 10:30
      A practical introduction to simulating complex trial designs: Full-day course - part 1 Room 13 B

      Room 13 B

    • 09:00 10:30
      An introduction to Bayesian nonparametrics for causal inference: Full-day course - part 1 Room 14

      Room 14

    • 09:00 10:30
      Bayesian borrowing in clinical trials: design choices, assessment of operating characteristics and reporting: Half-day course - part 1 Room 13 A

      Room 13 A

    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:30
      A practical introduction to simulating complex trial designs: Full-day course - part 2 Room 13 B

      Room 13 B

    • 11:00 12:30
      An introduction to Bayesian nonparametrics for causal inference: Full-day course - part 2 Room 14

      Room 14

    • 11:00 12:30
      Bayesian borrowing in clinical trials: design choices, assessment of operating characteristics and reporting: Half-day course - part 2 Room 13 A

      Room 13 A

    • 12:30 14:00
      Lunch 1h 30m
    • 14:00 15:30
      A practical introduction to simulating complex trial designs: Full-day course - part 3 Room 13 B

      Room 13 B

    • 14:00 15:30
      An introduction to Bayesian nonparametrics for causal inference: Full-day course - part 3 Room 14

      Room 14

    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:30
      A practical introduction to simulating complex trial designs: Full-day course - part 4 Room 13 B

      Room 13 B

    • 16:00 17:30
      An introduction to Bayesian nonparametrics for causal inference: Full-day course - part 4 Room 14

      Room 14

    • 08:30 09:00
      Opening Ceremony Room 1 A

      Room 1 A

    • 09:00 10:00
      P1 Plenary Session: Tim Morris: Keynote lecture Room 1 A

      Room 1 A

      Convener: Anne-Laure Boulesteix (LMU Munich)
      • 09:00
        Simulation studies: bringing method to the madness 1h

        Simulation studies involve drawing random numbers to understand the properties and behaviour of statistical methods. Statisticians have been using simulation studies since before computers existed (e.g. ‘Student’ in 1908). However, when it comes to simulation studies, we are largely self-taught. It is often hard understand a simulation study, or even its objective. Indeed, the rationale for many simulation studies seems to be ‘that is what other people do’. With the above definition of a simulation study – rather than statistical simulation more generally – we can see a
        common structure underlying simulation studies across disparate settings. This talk will sketch the structure before diving into some interesting issues, including choices around data generation, the interplay between mathematical results and simulation studies’ results, searching for truth, the contribution of simulation results to methods' fitness-for-use, and reporting of complex simulation studies.

        96432308364

        Speaker: Tim Morris (Novartis Pharmaceuticals UK Ltd)
    • 10:00 10:45
      Coffee break / Poster session 45m Foyer

      Foyer

    • 10:00 17:15
      Poster session x Poster display area

      x Poster display area

      • 10:00
        [01] A hybrid nonparametric framework for outlier detection in functional time series 5m

        Outlier detection in functional time series is challenging due to temporal dependence and the
        coexistence of magnitude, shape, and partially contaminated anomalies. Existing methods often assume independence or rely on model-based approaches, such as the Standard Smoothed Bootstrap
        on Residuals (SmBoR), which may perform poorly under model misspecification. Model-free alternatives, such as the Moving Block Bootstrap (MBBo), improve robustness but may show modest
        true positive rates for magnitude anomalies. This work proposes a fully model free pipeline with two
        components. First, the Directional Outlyingness (DirOut) framework is extended by recalibrating
        its cutoff via MBBo, improving detection of shape and partial outliers while controlling false positives.
        Second, a Sliding Window Functional Boxplot (SWOD) is introduced to exploit local temporal
        neighborhoods and detect magnitude anomalies that global summaries may miss. Simulations show
        that SWOD achieves high detection rates for magnitude outliers, while MBBo calibrated DirOut
        attains near perfect detection for shape and partial anomalies, outperforming SmBoR. The approach
        is further validated on a real temperature dataset, demonstrating its practical effectiveness.

        53573503884

        Speaker: David Solano (Research Group on Statistics, Econometrics and Health (GRECS). University of Girona)
      • 10:05
        [02] Adjust for Positional Uncertainty in Spatial Modeling 5m

        Spatial analyses in epidemiology often rely on accurate geolocation of individuals to estimate spatially structured health outcomes. However, routinely collected surveillance data frequently lack precise residential coordinates, introducing positional uncertainty that can bias spatial inference. This study examines the impact of uncertainty in patient location on the estimated spatial distribution of COVID-19 vaccination.
        Because individual-level residential coordinates were unavailable, we implemented a probabilistic geolocation imputation strategy based on the hierarchical distribution of known administrative units (e.g., neighbourhoods). For each individual, a random spatial location was sampled from a density function defined by population-weighted spatial priors. To propagate uncertainty into the estimation process, the imputed coordinates were integrated within a Bayesian spatial modelling framework using the Stochastic Partial Differential Equation (SPDE) approach [1] implemented through the Integrated Nested Laplace Approximation (INLA) method [2].
        We quantified the impact of positional uncertainty by comparing posterior estimates of vaccination odds surfaces under multiple imputation replicates and uncertainty-weighted spatial priors. The results indicate that ignoring location uncertainty leads to spatial over-smoothing and attenuated spatial gradients in vaccination odds, particularly in high-density urban areas. Incorporating imputation uncertainty within the SPDE-INLA framework yielded more conservative and stable posterior estimates, improving spatial risk characterisation while preserving credible interval coverage.
        This work highlights the importance of formally accounting for spatial uncertainty in epidemiological modelling when exact geocoding is incomplete or unavailable. The proposed framework provides a generalizable Bayesian approach for integrating positional uncertainty into spatial health models, enhancing the robustness and interpretability of public health spatial analyses.

        64288208844

        Speaker: Manuel Moreno (universitat de girona)
      • 10:10
        [03] An Evaluation Metric for Detection-Based Object Counting 5m

        In automatic object detection, reliably counting objects remains challenging, particularly in scenarios with densely packed objects, overlapping instances, large scene variability, or multi-class cases.
        Common evaluation metrics for object detection are based on Intersection over Union (IoU) and do not directly measure the correctness of the number of detected objects. Consequently, a model may achieve high detection scores while substantially overestimating or underestimating the true object count. This discrepancy limits the usefulness of standard detection metrics in applications where accurate quantitative assessment is crucial.
        Several approaches for automatic object counting have been proposed in the literature, including post-processing of detector outputs (Chattopadhyay et al., 2017), density-map regression (Sindagi & Patel, 2018), and hybrid detection–regression methods (Liu et al., 2019). Each of these approaches comes with specific requirements regarding data volume, annotation detail, or robustness to detector inaccuracies.
        Our work was motivated by practical challenges encountered during the analysis of microscopic images of pollen grains from selected plant taxa. While the detector used in the study produced high-quality detections according to standard metrics, it did not provide a reliable estimate of the actual number of grains. The gap between seemingly strong detection performance and inaccurate object counts highlighted a methodological limitation and inspired the development of a more robust metric for counting evaluation.
        Therefore, we propose the evaluation metrics that take into account the detector’s confidence score and cases where multiple labels have been assigned to a single object. Our method may be particularly useful in applications where false positive counts should be limited. The work includes a case study comparing the proposed approach to traditional evaluation metrics.
        References:
        • Chattopadhyay, P., Vedantam, R., Selvaraju, R. R., Batra, D., & Parikh, D. (2017). Counting Everyday Objects in Everyday Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1135–1144).
        • Liu, L., Jiang, J., Jia, W., Amirgholipour, S., Zeibots, M., & He, X. (2019). DENet: A Universal Network for Counting Crowd with Varying Densities and Scales. arXiv preprint.
        • Sindagi, V. A., & Patel, V. M. (2018). A Survey of Recent Advances in CNN‑Based Single Image Crowd Counting and Density Estimation. Pattern Recognition Letters, 107, 3–16.

        64288209924

        Speaker: Agnieszka Kubik-Komar (University of Life Sciences in Lublin)
      • 10:15
        [04] Bridging Data Gaps for Better Groundwater Protection: Advanced Imputation Meets Spatial Modelling 5m

        Missing data is one of the most persistent challenges in environmental monitoring, undermining the reliability of analyses and limiting effective resource management. This issue is particularly critical under European regulations such as the Nitrates Directive (91/676/EEC), which requires accurate monitoring of nitrate concentrations in groundwater to protect ecosystems and public health. Yet, frequent gaps in monitoring data make compliance and risk assessment challenging.
        Although numerous imputation techniques exist, there is no clear consensus on how to evaluate their quality. To address this, we adopted a comprehensive framework that goes beyond simple error metrics. We examined the effect of imputation on the data structure by comparing the distributions of the percentage of stations that exceed nitrate thresholds before and after gap-filling. Marginal distributions were assessed using estimated density functions, with agreement quantified through divergence measures such as Hellinger and Kullback–Leibler distances.
        Instead of relying on global linear correlation–which often fails to capture complex dependencies—we applied monotonic dependence measures. The generalised Lorenz curve provided a robust tool for revealing dependence patterns, thereby offering deeper insight into how imputation reshapes relationships within the data.
        Using two decades of national groundwater monitoring data (2000–2022), we tested six gap-filling strategies, spanning geostatistical methods and modern predictive algorithms. Our findings reveal that indicator kriging consistently outperforms other approaches, preserving spatial patterns and enabling reliable risk forecasting. By leveraging spatial relationships and introducing advanced evaluation tools, this approach strengthens predictive models and supports more informed decision-making.
        Why does this matter? Reliable data is the foundation of effective water management. Addressing missing information and improving evaluation practices can enhance compliance with EU directives, optimise monitoring systems, and protect communities from nitrate pollution.
        (Analyses were performed in R 4.3.1 and ArcGIS Pro 3.4.0.)

        96432308684

        Speaker: Urszula Bronowicka-Mielniczuk (University of Life Sciences in Lublin/Department of Applied Mathematics and Computer Science)
      • 10:20
        [05] Comparing existing methods to address the multiplicity of analysis strategies: A case study in pharmacoepidemiology 5m

        When addressing a particular research question using observational data, many decisions must be made during the conceptualization of the statistical analysis plan. This multiplicity of analysis strategies is a well-known problem that leads to high variation in research findings and associated low replicability, since each decision can lead to different results, even if each decision on its own was scientifically justifiable.

        Several approaches for reporting and visualizing this variation of effects have been proposed to address the consequences of the multiplicity of analysis strategies. For example, the social science literature advocates the multi-model approach of Young and Holsteen, which presents a preferred model estimate alongside results from other plausible models. This concept is similar to sensitivity analysis, which is often used in epidemiological research, but it mainly focuses on aspects of the main analysis model, such as variations in the adjustment variables or the definitions of the exposure and outcome variables. Similarly, the vibrations of effects framework of Patel mainly is concerned with changes in the effect measure estimate due to different adjustment sets and offers a neat visual representation in the form of volcano plots. Specification curves by Simonsohn et al. are an alternative suggestion to visualize the variation in results due to different decisions made at various stages of a statistical analysis, including also the data pre-processing phase. Multiverse-style methods by Steegen et al. , which have been developed within the data science community, also aim to neutrally present the results of ‘’all’’ possible analysis decisions.

        We exemplify these methods for addressing multiplicity in analysis strategies using a pharmacoepidemiological case study. In particular, we aimed to compare the effect of clopidogrel, ticagrelor, and prasugrel in a cohort of 12 000 stented acute coronary syndrome (ACS) patients regarding time to first ACS‑related readmission or death. The data were supplied by the Austrian Health Insurance Fund, and included basic demographic information as well as information on all hospitalization between 2019 and 2023, individual medical service codes and all drug redemptions in that time period.

        We present our preferred and alternative analysis strategies, and visualize and numerically quantify the variation of results according to the approaches mentioned above in addition to reflections regarding their implementation.

        75002916929

        Speaker: Moritz Pamminger (Medical University of Vienna, Center for Medical Data Science, Institute of Clinical Biometrics)
      • 10:25
        [06] Comparison of statistical methods for analysis of treatment effects on repeated patient reported outcomes in randomized clinical trials 5m

        Patient reported outcomes (PROs) are routinely used in randomized clinical trials (RCTs) to capture patients’ health status. Symptom-related PROs represent patients’ subjective perception of their health and are often collected multiple times during a clinical trial. For instance, in COPD, breathlessness or cough scores are captured using a small-range ordinal scale (0-4), representing breathing difficulty and coughing severity. With the possibility of collecting PROs on a daily basis with e-diaries in RCTs, it is important to evaluate the statistical methods used for analysis of treatment effects on PRO-related endpoints. The standard approach using linear mixed models for change from baseline in weekly averages of symptom scores neglects the ordinal structure of the scores and does not consider more sophisticated dynamics and heterogeneity between patients in short and long-term fluctuations that are usually observed in these data. The ordinal structure of the scores can be accounted for using a latent process model allowing for a nonlinear link function between the scores and their underlying process. The heterogeneity in the residual variance allowing for variability in patients’ short-term fluctuations can be analyzed using a location-scale latent process model in which the variance is expressed as a linear structure of a treatment covariate and a patient-specific random intercept. In this work, we compare various statistical methods for the analysis of ordinal scores using a simulation study and clinical trial data. We consider different assumptions regarding the variance of the residual errors, distribution of the scores, as well as using univariate scores or composite scores versus multivariate scores. The goal of this work is to contribute to the discussion on the trade-off between simplicity of the analysis used for the scores and accuracy as well as completeness in evaluating the treatment effect on the PROs from a longitudinal perspective.

        21429403484

        Speaker: Agnieszka Król (AstraZeneca)
      • 10:30
        [07] Designing an integrated longitudinal data platform for exercise-based management of patients with multiple chronic conditions 5m

        Background. Managing patients with multiple chronic conditions is a major challenge in modern health systems, particularly when exercise and lifestyle interventions are delivered in real-world settings. Robust statistical and machine-learning models require carefully designed data structures that capture the complexity of patients’ trajectories, comorbidities, and treatment exposures. In this abstract, we describe the design and implementation of a longitudinal data platform embedded in a real-world exercise rehabilitation center to support exercise-based management and advanced biostatistical and data-science research for this population.

        Methods. The platform integrates routine data from an exercise-oriented rehabilitation program in which assessments are collected to tailor and monitor exercise prescriptions for adults with diverse musculoskeletal, cardiometabolic, and other chronic conditions. We constructed an event-based, key–value data model with three linked components: (i) a profile table containing demographics, socioeconomic factors, work characteristics, and baseline medical history; (ii) a longitudinal timeline of clinical, sleep, pain, training, hydrotherapy, and referral events; and (iii) detailed body-composition measurements from bioimpedance analyzers. All components are linked through a unique user identifier and harmonised time stamps. We implemented systematic coding of comorbidities and treatments, parsing of complex fields, quality-control rules (range checks, internal consistency, soft-delete flags), and de-identification procedures. We then show how this structure can be reshaped into patient-level and time-indexed analytical datasets suitable for joint longitudinal models, multistate and survival analyses, and machine-learning pipelines.

        Results. The current registry comprises several thousand patients with tens of thousands of clinical and training events and a large subset with repeated body-composition assessments. The platform allows reconstruction of individual trajectories of pain, sleep quality, anthropometry, and exercise exposure, while retaining detailed information on multiple chronic conditions and referrals. We illustrate how derived features such as cumulative exercise dose, changes in visceral fat level, and multimorbidity indices can be obtained from the platform and prepared for future statistical modelling and prediction tasks.

        Conclusions. This work demonstrates how a purpose-built, event-based longitudinal data platform can bridge everyday clinical exercise practice and advanced biostatistical or machine-learning methods in the management of patients with multiple conditions. By explicitly designing the data structure around future modelling needs, the registry provides a scalable foundation for studies on prognosis, treatment response, and personalized exercise prescriptions

        42858808226

        Speaker: Nasrin Salimian (Pardis Specialized Wellness Institute)
      • 10:35
        [08] Developing Statistical Power Estimation Models for Spatial Transcriptomics 5m

        Spatial transcriptomics (ST) is a methodological suite that facilitates the in situ, high-resolution measurement of the transcriptome across a designated tissue section. By integrating transcriptional data with spatial coordinates, ST techniques enable the elucidation of key biological phenomena, including cell-type-specific gene regulatory networks, the spatial patterning of cellular architecture, and the mechanisms governing intercellular communication. This capability provides transformative value for research in oncology, neurobiology, developmental biology, and immunology.
        Analyses of ST data typically focus on features of tissue organization such as cell type abundance, spatial co-localization, and neighborhood structure. Statistical methods which quantify changes in cell type proportions and methods which characterize spatial organization and local tissue microenvironments still have complicated sample size and statistical power modeling approaches.
        Recent work has begun to address this gap through simulation-based frameworks and non-parametric spatial resampling techniques designed to estimate power under realistic spatial constraints [1]. These strategies leverage generative models and synthetic ST data to approximate experimental variability and spatial heterogeneity. However, these approaches have not been systematically extended to analyses of differential cell type abundance or spatial co-localization. Also, recent methodologies are computationally heavy due to bootstrapping as well as P-splines cannot enforce 3D monotonicity.
        Here, we review emerging methods for statistical power estimation in ST and evaluate their relevance for studies focused on tissue composition and spatial organization [2]. Specifically, we aim to contribute to the development of a novel, theoretically-grounded methodology for power estimation that minimizes or eliminates the dependence on computationally intensive bootstrapping procedures. We discuss key conceptual considerations, methodological limitations, and opportunities for extending power analysis to abundance- and co-localization-based workflows, with the goal of guiding more robust experimental design and interpretation in future ST studies.

        64288212008

        Speaker: Miglė Gervytė (Institute of Data Science and Digital Technologies, Faculty of Mathematics and Informatics, Vilnius University)
      • 10:40
        [09] Early molecular deviations as predictors of non-communicable diseases 5m

        Non-communicable diseases (NCDs) impose the largest global burden of morbidity, premature mortality, and healthcare expenditure. To shift from reactive to preventive care, early detection of pre-symptomatic molecular changes is essential. We propose a statistical framework for identifying the most sensitive and robust early molecular predictors of prevalent NCDs — including cardiovascular disease, cancer, chronic respiratory disorders, and diabetes — using longitudinal profiling of blood-based parameters. Many molecular variables in human blood are tightly regulated within narrow, person-specific ranges and react sensitively to health perturbations. Detecting early deviations from these individualized baselines may allow NCD identification years before symptom onset.
        To enable such early detection, a study design is required that captures both individualized molecular stability and subtle deviations over extended periods. A new cohort design — repeated baselines, 10-year follow-up on 15000 healthy individuals — provides accurate personal reference ranges and sufficient temporal resolution to capture subtle pre-diagnostic molecular changes. What makes the cohort design truly novel is the integration of standard clinical laboratory parameters with high-throughput Fourier-transform infrared spectroscopy, offering broad, sensitive, and cost-efficient coverage of molecular signatures responsive to early pathophysiological changes. This combined framework captures both established biochemical markers and fine-grained spectral features, enabling the identification of early molecular deviation patterns that may be shared across diseases as well as those specific to individual NCDs.
        Our approach focuses on identifying which molecular variables show the earliest, most consistent signals of deviation during the silent development of NCDs — even before diagnoses exist. To achieve this, we quantify molecular dynamics using linear mixed-effects models (LMM), modeling time as a continuous function, adjusting for relevant covariates to separate population-level trends, between-subject heterogeneity, and within-subject fluctuations. For each variable, standardized residuals are computed across all individuals to detect statistically significant and biologically relevant deviations from expected longitudinal trajectories. Variables with systematically elevated frequencies of such deviations are considered promising early indicators.
        This knowledge will lead to the construction of a targeted, cost-effective NCD screening panel and enable optimized acquisition protocols. Ultimately, linking early molecular deviations with eventual clinical diagnoses will allow development of a multi-parametric screening algorithm capable of stratifying risk across multiple NCDs before symptoms arise. This research outlines a pathway toward population-level molecular precision prevention, combining statistical rigor, personalized baselines, and comprehensive molecular phenotyping to discover the most powerful pre-symptomatic predictors of prevalent NCDs.

        53573518999

        Speaker: Zita Zarándy (Center for Molecular Fingerprinting, Semmelweis University)
      • 10:45
        [10] Ensuring Quality in Preclinical Research: The Importance of Being Human 5m

        Preclinical experiments form the empirical foundation of translational medicine by assessing the feasibility, safety, and efficacy of new therapeutic approaches. Yet, unlike the highly regulated standards of clinical trials, preclinical research often exhibits substantial methodological heterogeneity, leading to concerns about reproducibility, bias, and the robustness of conclusions. These challenges are further intensified by the emerging use of artificial intelligence (AI). While AI has the potential to increase efficiency and support data analysis, uncontrolled or poorly understood applications can amplify existing weaknesses in study quality. In this paper, we discuss the critical importance of human judgment, statistical rigor, and transparent study design for ensuring reliable preclinical evidence. We examine key dimensions of analytical quality across study design, data generation, and evaluation, and consider how AI - particularly large language models (LLMs) and foundation models - can support, rather than undermine, methodological integrity. From a biometric perspective, safeguards for uncertainty quantification, error control, and interpretability are essential when integrating AI into preclinical research. Ultimately, methodological progress depends on a careful balance between human expertise, statistical inference, and computational tools. Illustrative examples highlight both opportunities and pitfalls at the interface of biostatistics, data science, and translational medicine, emphasizing that maintaining human oversight is crucial for generating reproducible and interpretable evidence.

        21429409786

        Speaker: Tina Lang (Bayer AG)
      • 10:50
        [11] Exact conditional likelihood inference in meta-analysis for estimating risk ratio 5m

        Conventional two-stage procedures for binary-outcome meta-analysis use fixed plug-in estimates of within-study variances and depend heavily on large-sample normal approximations. These assumptions are often untenable and can lead to inaccurate inference, especially in sparse settings. Likelihood-based random-effects models, including the binomial–normal and the hypergeometric–normal (HGN) formulations, address several of these limitations but remain confined to the odds-ratio metric, which hampers clinical interpretability and prevents direct estimation of the risk ratio. To address this gap, we introduce a likelihood framework that extends the HGN model to permit direct inference on the risk ratio. The key innovation is a pseudo-observation augmentation strategy that maintains the conditional-likelihood properties of the HGN model while producing an unbiased estimating equation for the log–risk ratio. We further enhance finite-sample performance through jackknife-based adjustments for bias and variance. Real-world examples illustrate that the method provides transparent, interpretable, and computationally efficient inference for risk-ratio meta-analysis.

        21429418888

        Speaker: Hisashi Noma (The Institute of Statistical Mathematics)
      • 10:55
        [12] Fraction-of-Time and robust periodic ARMA for improved analysis of digital health monitoring data 5m

        Physiological monitoring often generates data characterized by strong cyclostationarity (circadian rhythm) and sensor artifacts – irregular noise. Conventional models (e.g. ARIMA) often fail to capture the time-varying dependencies or conflate behavioral rhythms with noise. We propose a signal processing framework adapted for digital health data: the Fraction-of-Time (FOT) probability approach, alongside robust periodic ARMA (PARMA) modeling.
        In contrast with ensemble-based methods that average across populations, the FOT models shift toward temporal occupancy, i.e. how much time a variable spent in a given range. This treats physiological variables as cyclostationary processes, defining probability based on mentioned temporal occupancy to the diurnal cycle. To capture the underlying dynamics, we utilize Robust PARMA modeling. The method accounts for the periodic correlation structure of biological rhythms while mitigating the impact of impulsive sensor outliers.
        We demonstrate the feasibility of the FOT–robust PARMA approach through use cases related to digital health applications (e.g. circadian glucose patterns in diabetes management, heart rate variability monitoring with intermittent disruptions). The framework provides more stable representations of cyclical dynamics than standard models and retains clinically relevant observations that may be overlooked through heavy smoothing. Combining time-occupancy probabilities with robust periodic dynamics offers an alternative direction for extracting longitudinal information from digital health monitoring. This approach complements the existing biomedical modeling techniques where data irregularity and periodicity are dominant features.

        64288210808

        Speaker: Stanislaw Leskow (Warsaw School of Economics (SGH))
      • 11:00
        [13] Impact of Tumor-Assessment Schedules on the Projected Timing of PFS Analyses in Oncology Trials 5m

        Meaningful prediction of when the target number of events will be reached is essential for both sample-size determination and operational planning in event-driven clinical trials. In oncology studies, progression-free survival (PFS) based on RECIST assessments is one of the most commonly used endpoints. Tumor evaluations for determining progression are typically performed at pre-scheduled visits, often every 6–12 weeks depending on the specific trial design, disease setting, and patient burden considerations. Because progression can generally only be confirmed at these discrete assessment times, the timing of when an event is observed is effectively interval-censored. As a consequence, critical design parameters such as treatment-effect estimates (e.g., hazard ratios), and the resulting statistical power may depend not only on the underlying disease dynamics but also on the particular assessment schedule chosen (Tanase et al., 2017).

        In this work, we illustrate how not only point estimates and statistical inference but also the projected timing of reaching the required number of PFS events is influenced by different tumor-assessment schedules. Using realistic oncology trial settings, we show how varying assessment timings and frequencies can produce distinct patterns of event accrual.

        Overall, our results suggest that the choice of tumor-assessment schedule may play a role beyond logistical considerations, with potential statistical and operational implications. Accounting for this dependency in sample-size planning and event-accrual forecasting may help provide more realistic timelines.

        Tanase et al. (2017). A proposal for progression-free survival assessment in patients with early progressive cancer. Anticancer Research, 37(10), 5851-5855.

        96432311826

        Speaker: Laura Kohlhas (Cogitars)
      • 11:05
        [14] Influence of documentation quality on the reproducibility of animal research 5m

        The introduction of standardized reporting guidelines has long been a response to inadequate study descriptions, starting with the CONSORT statement for clinical trials in the 1990s (Begg et al., 1996). One major approach to improve transparency and methodological rigor has been the introduction of standardized reporting guidelines such as the ARRIVE guideline (Percie du Sert et al., 2020). However, several analyses, including Lin et al. (2024), have shown that the implementation of the ARRIVE-guideline remains incomplete. Inadequate or missing documentation particularly affects statistical aspects such as sample size planning, thereby impairing validity, reproducibility, and compliance with the 3Rs (Replacement, Reduction, Refinement). Despite these known deficiencies, a detailed evaluation of documentation quality in publications claiming adherence to ARRIVE is still lacking.
        This study addresses this gap by systematically assessing how comprehensively and consistently the ARRIVE Guideline 2.0 is implemented in published animal intervention studies, with a particular focus on the reporting and justification of sample size planning.
        Here we build upon the systematic review by Lin et al. (2024), which investigated the overall adherence of animal intervention studies to the ARRIVE Guideline 2.0. Based on publications selected from the Lin et al. collection, seventy-five peer-reviewed articles reporting interventional animal experiments with quantitative endpoints were analyzed in greater detail. The evaluation focused specifically on the completeness, transparency, and methodological consistency of reported parameters related to sample size planning. According to selected key criteria of the ARRIVE Guideline, information on effect size, power, significance level, statistical test, and justification of the sample size were systematically assessed.
        A significant proportion of the publications examined contains only incomplete or incomprehensible information on sample size planning, particularly regarding the justification of the selected sample size, the handling of dropouts, or the documentation of effect sizes. Deviations between the planned and actual statistical tests occur frequently without sufficient justification. Future simulation-based analyses are intended to explore how such deviations and incorrect or unverified assumptions could affect statistical power and increase the risk of type-I- or type-II errors.
        The work highlights persistent gaps in the documentation of animal studies, even though comprehensive reporting standards such as the ARRIVE Guideline exist. The findings point to specific areas where improving the implementation of ARRIVE in sample size planning could enhance methodological rigor and promote the consistent application of the 3Rs principles.

        Literature
        Begg, C., et al., doi:10.1001/jama.276.8.637
        Du Percie Sert, N., et al., doi:10.1371/journal.pbio.3000410
        Lin, Y., et al., doi:10.21037/cdt-24-413

        96432305405

        Speaker: Tomke Eiben (University of Applied Sciences and Arts, Hannover, Germany)
      • 11:10
        [15] Joint modelling of general and mental health using copula models: a simulation-based evaluation for COVID-19 health research. 5m

        Title: Joint modelling of general and mental health using copula models: a simulation-based evaluation for COVID-19 health research.

        COVID-19, the disease caused by the SARS-CoV-2 coronavirus, led to a global pandemic that began in December 2019. In the UK, government-mandated lockdowns were imposed to reduce the spread of the disease and understanding the impact of these actions on the general and mental health of the population has become of increasing interest to medical scientists. However, most analyses treat these outcomes separately, especially when they differ in measurement scales. The aim of this study was to model the general health score and mental health score of the population following COVID-19 infection.

        We propose and evaluate copula-based modelling frameworks for jointly correlated health outcomes with mixed data types. Using simulation studies, we assess model performance across three scenarios: continuous–continuous, binary–continuous and categorical-continuous. The simulations mimic a pre-existing UK data from the ‘Next Step’ national longitudinal cohort adjusted by results from COVID-19 survey that was conducted during three waves of the pandemic. We analyse how general health score (continuous, binary or categorical) and mental health score (continuous) are influenced by common covariates such as sex, COVID-19, smoking status, alcohol consumption, exercising, long-lasting health conditions and vaccination status. Models are fitted using the GJRM package in R, employing Gaussian, Ali-Mikhail-Haq (AMH) and Plackett copulas with normal, log-normal and probit marginal distributions. To evaluate the model performance for each of the scenarios, measures including bias, coverage probability, mean square error, and standard error have been calculated.

        Our results showed that the modelling the outcomes in the different scenarios affected the accuracy of the copula models. Across all scenarios, data from the second and third waves was best represented by the copula models. We found that wave one acts differently to the other waves and across all models the confidence interval coverage was the most erratic element, in some cases providing very poor coverage.

        Keywords: joint modelling, copula models, simulation study, Covid-19

        85717609048

        Speaker: Katarzyna Jagoda (University of Plymouth)
      • 11:15
        [16] MISSING OBSERVATIONS IN RESEARCH ON NITROGEN DIOXIDE CONTENT IN THE AIR 5m

        One of the most serious effects of globalisation and human activity on the environment is air pollution. Nitrogen dioxide is particularly harmful to human health. Monitoring its content in the air over a long period of time allows trends to be assessed, and appropriate measures to be taken to improve air quality. In Poland, the Chief Inspectorate for Environmental Protection and its regional branches are responsible for monitoring air pollution levels. Measurements are taken continuously, and the results are recorded as hourly averages. Unfortunately, the data collected is sometimes incomplete. The most common reasons for missing data are power outages, analyser damage, data transmission disruptions, maintenance, and calibration procedures. However, in many cases, missing data can be approximated.
        The MICE algorithm was used to impute the missing data. In majority of cases analysed, the algorithm provides a good approximation of the actual nitrogen dioxide values. However, in some situations, the real values were underestimated.

        85717621069

        Speaker: Dorota Domagała (University of Life Sciences in Lublin)
      • 11:20
        [17] Machine Learning Imputation of Air Pollutant Concentrations Using XGBoost 5m

        This study evaluated the performance of the XGBoost method for imputing missing values in air quality data. The analysis used complete measurements of PM2.5, PM10, SO₂, NO, NO₂, and C6H6 recorded in Lublin in January 2020. To simulate missing data, 15%, 20%, and 25% of observations were randomly removed from each variable and imputed using XGBoost trained on the remaining data. Additionally, missing values were generated in a non-random manner to test the method under more challenging settings. The accuracy of imputations was assessed using the sum of absolute differences between observed and imputed values. Results show that XGBoost effectively reconstructs missing data under both random and non-random patterns, with minimal deviation from true measurements.

        96432319719

        Speaker: Dorota Domagała (University of Life Science)
      • 11:25
        [18] Modelling of alder pollen concentration 5m

        The present study investigates the spatial variability of alder (Alnus) pollen concentrations across different regions of Poland during the period 2001–2020. The primary objective was to identify and classify areas within the country that exhibit similar levels of alder pollen occurrence. The analytical results enabled the delineation of zones characterized by comparable concentrations of the examined pollen. The application of advanced statistical methods facilitated the detection of spatial distribution patterns, thereby allowing the identification of regions with homogeneous aeroallergen characteristics. Moreover, the obtained findings provide a basis for modeling alder pollen concentrations in relation to prevailing meteorological conditions.

        42858800805

        Speaker: Małgorzata Graczyk (Poznań University of Life Sciences)
      • 11:30
        [19] Optimal Grouping for Weibull Modeling of Fertilizer Granule Strength 5m

        This paper presents an analysis of the impact of data clustering on the accuracy of Weibull distribution parameter estimation in strength tests of mineral fertilizer granules. Two approaches are compared: traditional clustering into fixed-width intervals and optimal clustering, derived from a correctly constructed Fisher information matrix for clustered data. Maximum likelihood estimators for data clustered according to optimal limits are also developed. The study was conducted on actual measurement data for three commercial fertilizers. Model fit was assessed using the chi-square test, and the Asymptotic Relative Efficiency (ARE) was calculated to compare the precision of scale and shape parameter estimation. The results showed that optimal clustering consistently increased the efficiency of shape parameter estimation and improved the p-values ​​in the goodness-of-fit tests. It was also demonstrated that the choice of clustering method influences the assessment of distribution skewness and, consequently, the interpretation of phenomena related to granule failure mechanics. The obtained results confirm that in analyses based on grouped data, the use of optimal intervals allows for reducing information loss and improving the quality of statistical inference.

        75002908166

        Speaker: Paweł Kurasiński (University of Life Sciences in Lublin City: Lublin)
      • 11:35
        [20] Optimal cutoff selection for U-shaped prognostic variable in survival data with competing risk 5m

        Selecting clinically meaningful cutoff values for continuous prognostic variables is challenging when the association with risk is U-shaped and competing risks are present. We propose a C index-based method to estimate an optimal pair of cutoff values (c_1,c_2) by directly targeting discriminative accuracy. Our approach first fits a smoothing spline to the log relative hazard from the Fine and Gray (FG) and cause specific hazard (CSH) model to confirm the U-shape relationship. This model then generates candidate cut-off pairs at equal heights on either side of the nadir, partitioning patients into the central low-risk group and the high-risk tail groups. From these candidates, we select the optimal pair that maximizes the inverse probability of censoring weighted (IPCW) C-index. For comparison, we also evaluated Gönen–Heller’s concordance probability estimate (CPE), the minimum p-value (Min-P), and the percentile (Q1, Q3) methods. Monte Carlo simulations spanning symmetric, moderately asymmetric, and severely asymmetric U-shapes with 20% and 50% censoring rate show that IPCW C-index using FG model consistently achieves the best Bayesian information criterion (BIC) and small standard errors (SE), and IPCW C-index using CSH model is the second best. The percentile method is highly stable but modestly inferior by BIC. Min-P tends to select more variable, wider cutoff values, and CPE often yields extreme or unstable cutoff values. Overall, IPCW C-index based cutoff value selection (particularly with the FG model) offers a stable, discriminative, and well-fitting strategy for risk stratification when the continuous prognostic variable exhibits U-shaped relationships under survival data with competing risks. The proposed method is also illustrated with a real kidney-transplant data.

        85717610404

        Speaker: Sook-young Woo (Samsung Medical Center)
      • 11:45
        [22] Polygenic Risk Modeling in Periodontitis: Insights from a Case-Control Study of 4,243 European Cases and Current Limitations 5m

        Genetic susceptibility plays a particularly important role in early-onset (EO) and severe periodontitis (PD). The genetic risk remains largely unexplained, due to limited sample sizes and heterogeneous phenotypes in genome-wide association studies (GWAS). This study investigates whether current GWAS data can be used to construct a polygenic score (PGS) capturing genetic susceptibility to severe PD. A PGS was developed in a three-step design, using a German EO-III/IV-C-PD GWAS (n=692 cases, ≤35 years at diagnosis) as the base dataset, a Spanish EO-III/IV-C-PD GWAS (n=441 cases) to optimize the score, and as validation a Dutch EO-III/IV-C-PD GWAS (n=171 cases) and a German population-based GWAS with later-onset PD (SHIP, n=2,941 cases). The PGS showed a trend toward association with disease status in the Spanish sample, but not in the smaller Dutch or in the SHIP dataset. Case-control distributions overlapped substantially. Genetic correlation analyses revealed no strong overlap with other associated traits. Our results indicate that current PGS models have limited case-control discriminative ability for PD. Larger, harmonized studies are needed to enhance genetic risk prediction and clarify pleiotropic relationships.

        96432317829

        Speaker: Gesa Richter (Department of Periodontology, Oral Medicine and Oral Surgery, Charité – Universitätsmedizin Berlin, Berlin, Germany)
      • 11:50
        [23] Predictive multivariate models and ordination analyses reveal drivers of degradation and plant invasion in the Biała and Czarna Łada river valleys 5m

        The riverine ecosystems of the Biała and Czarna Lada River valleys are undergoing progressive degradation, accompanied by the spread of invasive plant species. To identify the factors driving these processes, we combined predictive modelling with multivariate analyses. To estimate the odds of habitat invasion and degradation, we used logistic regression and classification trees, which revealed that the dominant predictors were soil nutrient content (e.g., nitrogen, phosphorus), soil pH, and disturbed vegetation cover—characteristics often resulting from direct human activity. We used principal component analysis (PCA) to identify key ecological gradients, enabling clearer distinction between habitat conditions associated with invasion and degradation. Furthermore, using non-metric multidimensional scaling (NMDS), based on Bray-Curtis dissimilarity, we revealed distinct changes in plant community composition between sites in the studied river valleys, reflecting differences in the degree of degradation and the intensity of alien species invasion.
        Despite its pilot nature, the research conducted can provide a solid basis for further research and protective measures aimed at preserving biodiversity and reducing the negative effects of anthropogenic pressure.

        64288218099

        Speaker: Monika Różańska-Boczula (University of Life Sciences in Lublin)
      • 11:55
        [24] Preprocessing of MALDI-TOF mass spectrometry data for prediction of antimicrobial resistance via machine learning 5m

        Introduction
        Rapid results from antimicrobial susceptibility testing (AST) are essential to guide the antimicrobial therapy of critically ill patients. Recent developments have revealed that readily available matrix-assisted laser desorption-ionization-time of flight (MALDI-TOF) mass spectrometry data, which is routinely used for bacterial species identification, can also be used to predict antimicrobial resistance with machine learning (ML) techniques (1,2). The required preprocessing of mass spectral data typically follows the seminal work of Weis et al. (1), but numerous possible variants exist and their potential impact on the predictive performance of ML models remains to be studied.

        Objectives
        We systematically study various preprocessing strategies for mass spectral data with respect to
        a) their influence on the performance of different ML algorithms predicting antimicrobial resistance (AMR).
        b) their impact on model transferability to external data sets.
        The methods under study include different binning strategies (such as linear, exponential and dynamic binning) and parameters as well as peak alignment with respect to different reference sets.

        Materials & Methods
        We benchmark different ML pipelines using nested cross-validation on approximately 10.000 E. Coli MALDI-TOF spectra from clinical routine at the University Hospital Münster. The pipelines include various preprocessing strategies and different learners, such as eXtreme Gradient Boosting, Random Forest, and Elastic Nets, as well as hyperparameter tuning. We evaluate and compare the performance of the resulting models for prediction of cefotaxime resistance in E. Coli both on local and on publicly available external data.

        Results
        We present the results of our extensive benchmark experiment and make suggestions for the preprocessing of mass spectral data. We discuss the impact on predictive performance and generalizability of ML models.

        Summary
        Data preprocessing is an important step in developing ML models for AMR prediction from mass spectral data which requires specific methods. Our study helps to improve and robustify such models by identifying optimal preprocessing strategies.

        References:
        [1] Weis, C., Cuénod, A., Rieck, B. et al. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat Med 28, 164–174 (2022).
        [2] Wiesmann, N., Enders, D., Westendorf, A., Koch, R., and Schaumburg, F., Prediction of Antimicrobial Resistance from MALDI-TOF Mass Spectra Using Machine Learning: A Validation Study. Accepted for publication in J. Clin. Microbiol.

        53573506666

        Speaker: Dominic Enders (Institute of Biostatistics and Clinical Research, University of Münster)
      • 12:00
        [25] Realise D: compRehensive mEthodological and operational Approach to cLinical trIalS in ultra-rarE Diseases 5m

        Realise D is a public-private partnership of almost 40 partners from academia, regulatory bodies, clinical research institutes and hospitals, patient organizations, pharmaceutical companies, methodologists, and European Research Infrastructures. Realise D is part of the European Union’s Innovative Health Initiative and funded jointly by the EU and industry. The project started officially in January last year and will run for five years.

        RealiseD is based on clinical use cases emerging from four European Reference Networks underlining the need for new innovative methodologies, especially in ultra-rare diseases.
        We will present an overview of the RealiseD project with a clear focus on the work packages “Innovative clinical trial designs” and “Innovative analysis and data use strategies”. Both include strategies to include the use of external information to improve decision making in trials of ultra rare diseases.

        64288201337

        Speaker: Christoph Gerlinger (Bayer AG, Clinical Statistics and Analytics, Berlin and Department of Gynecology, Obstetrics and Reproductive Medicine, University Medical School of Saarland)
      • 12:05
        [26] Reanalysis and sensitivity analysis of overall survival and progression-free survival of pivotal randomized controlled trials in oncology 5m

        In oncology trials, tumour-based endpoints, like Progression-Free Survival (PFS), Disease-Free Survival (DFS) or Relapse-Free Survival (RFS), are widely accepted. However, their use is more controversial compared with Overall Survival (OS), due to the subjectivity of tumour assessment, and their sensitivity to censoring rules, as they are more prone to obtaining different results depending on the type of primary analysis and censoring rules employed, and, therefore, to bias.
        Sensitivity analyses reassess study outcomes with varying methodologies, assumptions, and/or censoring rules. Therefore, they are used to evaluate how dependent the primary results are on methodological choices, assessing their robustness and the potential impact of bias.
        However, the choice of analyses to include in the study remains a subjective one.

        In our meta-study, we focus on assessing the reproducibility of main studies that supported EMA authorisations of oncological medicines; but also aim to evaluate the robustness of their conclusion using different censoring rules; and analyse, when possible, the difference in results between the primary and the sensitivity analyses to identify possible trends in the choice of primary analyses.

        We identified all oncologic indications that received a positive opinion from the Committee for Medicinal Products for Human Use (CHMP) between January 2020 and December 2024. Using the European public assessment report (EPAR), we then identified the RCTs referred to as ‘main studies’ and randomly sampled 60 of them.
        For each sampled RCT, we requested the IPD, protocols, and information on how the study was conducted to reanalyse the two main efficacy endpoints (PFS and OS) to assess the reproducibility of their results. Both endpoints will be analysed according to the original primary analysis and then using different ways to handle censoring to evaluate the robustness of the study findings.
        Moreover, for each sampled RCT, we extracted publicly available information about the Hazard Ratios (HRs) and confidence intervals (CIs) for PFS and OS, as well as their sensitivity analyses.
        The median number of reported sensitivity analyses for PFS is 2, with a range of 0 to 12, while for OS, the median is 0, with a range of 0 to 6.
        We are also in the process of comparing the differences in HRs and CIs between the primary and sensitivity analyses for each endpoint and assessing the presence of possible trends in the choice of primary analysis.
        This type of studies is necessary to improve the transparency and trustworthiness of RCTs’ results.

        96432307555

        Speaker: Giulia Varvarà (University of Rennes)
      • 12:10
        [27] Social capital, depression, place of residence and periods before and after the COVID-19 pandemic: a logistic regression analysis 5m

        The aim of the study was to verify changes in the association between place of residence and depression prevalence before and after the onset of COVID-19 pandemic. Second objective of the study was to identify indicators of social capital as determinants of the prevalence of depression among people aged 50 years or older living in rural and urban areas.

        The study included data from two cross-sectional studies: Polish part of the COURAGE in Europe and the COURAGE-CAD, conducted in 2011-2012 and 2024, respectively. The analysis was based on pooled dataset and included 2917 participants from the first and 1802 from the second study. Individuals aged 50 years or older were randomly selected from general population including people who lived in Poland. In both study the same methodology was applied. To generalize the study sample to the reference population for years 2011-2012 and 2024, the data from the studies were weighted on their design. In each survey, face-to-face computer-assisted interviews were conducted. The structured questionnaire included, among others, a scale measuring social support (OSLO-3 SSS), social trust, formal and informal social participation, as well as a social network (COURAGE-SNI). It also incorporated an assessment of depression prevalence based on the DSM-IV algorithm for Major Depressive Disorder.

        The results of the logistic regression model with interaction indicated that pandemic period (before vs after) moderates the association between place of residence and depression. The logistic regression model showed that after the pandemic period the odds of experiencing depression was higher in urban area (OR=1.93, 95%CI=(1.04;3.59)) than in rural. There was no statistically significant differences in pre-pandemic period.
        Logistic regression models examining the association between indicators of social capital and depression over time, both before and after the pandemic in rural and urban areas, indicated that an increased level of social trust, formal participation and a higher level of social network were associated with a lower odds of experiencing depression. Only in urban residents subgroup after the pandemic, level of trust was associated with higher odds of depression (OR=1.02, 95% CI=(>1.00;1.04), in fully adjusted model).

        The results confirmed a significant positive effect of some social capital indicators on depression before the pandemic, however during the post-pandemic period the association was ambiguous. After the pandemic, the patterns of the association between place of residence and prevalence of depression was changed in the favor of rural areas.

        Financing COURAGE-CAD: National Science Centre, Poland, OPUS23 grant UMO-2022/45/B/NZ7/04030.

        21429409255

        Speaker: Karolina Majdak (Jagiellonian University Medical College, Chair of Epidemiology and Preventive Medicine, Department of Medical Sociology)
      • 12:15
        [28] Survival of breast cancer patients treated at Tikur Anbessa specialized and Teaching Hospital, Addis Ababa Ethiopia. 5m

        Survival of breast cancer patients treated at Tikur Anbessa Specialized and Teaching Hospital
        Hospital, Addis Ababa, Ethiopia.
        Fatuma Hassen, Fikre Enquselassie, Ahmed Ali, Adamu Addissie, Girma Taye, Mathewos Assefa, Aster Tsegaye
        Abstract
        Purpose: Globally, breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related deaths in women. The purpose of this study was to determine the survival of breast cancer patients and associated factors.
        Methods: This study was done among breast cancer patients treated at the Oncology Center of Tikur Anbessa Hospital, Addis Ababa, Ethiopia. Clinical data were collected from patient files. Median age at diagnosis and Interquartile range (IQR) were calculated. Based on life table analysis, one, three, five, and 10 years' overall survival rates were calculated. Median survival estimates were obtained using the Kaplan-Meier survival analysis method. Survival curves were compared using the Log-Rank statistic. Bivariate and multivariate analysis was performed using Cox's proportional hazards model.
        Results: Our study included a total of 402 patients followed over 10 years. Median age at diagnosis of patients was 43.4[35-50] years. The median follow up time was 58.3 months while the total person-year was 22,998 months. By the end of follow-up, 233 (58%) of patients were dead. The one, two, three, five, and ten-year survival rates were 85, 75, 62, 50, and 34%, respectively. Based on multivariate Cox regression analysis, a more severe stage at diagnosis (HR=3.84; 95% CI 2.00-7.35, P< 0.001), cancer metastasizing 1.79(95% CI, 1.13-2.83, P = 0.012) were significantly associated with an increased risk of death.
        Conclusion: Our study indicated a relatively poor survival rate, which was associated with late-stage diagnosis and metastasizing cancer. Strengthening public awareness and mass screening is needed in order to enhance early screening and initiation of treatment to reduce the advanced stage of breast cancer.

        42858807448

        Speaker: Fatuma Hassen (Department of Medical Laboratory Sciences, College of Health Sciences, Addis Ababa University, Addis Ababa, Ethiopia.)
      • 12:20
        [29] The assessment of construct validity using Item Response Theory models: An example based on the Online Social Capital Index (OSCI) for middle-aged and older people 5m

        The growing role of social network sites in building and maintaining social relationships, generating social support, and providing information implies the need to develop a tool for measuring online social networks, intended for use in population studies related to health and quality of life. Models based on Item Response Theory are widely used in test development and has proven advantages over classical test theory methods.

        The aim of the study was to create a simplified, easy implementable multidimensional instrument to assess all relevant elements of the structure and function of personal online social network.

        Data were collected from a nationally representative cross-sectional survey of Polish adults aged 50 and older (N = 1,802) as part of the COURAGE-CAD study. Face-to-face CAPI interviews were conducted. Items to measuring online social networks covered network characteristics, including quantitative dimensions (e.g. network size, structure, contacts frequency), qualitative dimensions (e.g. emotional bond, social support, reciprocity) and alter members (e.g. family, friends). Type of social media used (e.g. social media sites, messaging application, e-mails) was also distinguished.

        Dimensionality was explored using exploratory factor analysis with polychoric correlations and robust weighted least squares estimation (WLSMV) to identify latent structures of online social connectivity, and the resulting factor structure was subsequently confirmed in a 30% holdout sample using confirmatory factor analysis, evaluating model fit with standard indices including CFI, TLI, RMSEA, and SRMR. Additionally, Mokken analysis was employed to assess the robustness of the latent structure. To account for the ordinal nature of the items and capture multidimensional latent traits, Item Response Theory models, specifically graded response and generalized partial credit models, were employed to estimate item parameters such as difficulty and discrimination, while generating individual factor scores. Differential item functioning analyses were conducted to detect potential biases across age and gender. The final OSCI was constructed as a factor scores scaled from 0 to 100. Validation analyses included correlations with measures of loneliness, quality of life and Social Network Index (SNI). Analyses were conducted using MPlus and R.

        The resulting index provides a validated, multidimensional measure of an individual’s online social network, capable of distinguishing between maintaining offline social networks and building new ones. The tool is suitable for population research, public health monitoring, and automated implementation in online surveys.

        This research is funded in whole by National Science Centre, Poland, OPUS23 grant UMO-2022/45/B/NZ7/04030.

        53573510746

        Speaker: Michalina Gajdzica (Jagiellonian University - Medical College, Chair of Epidemiology and Preventive Medicine)
      • 12:25
        [30] Therapeutic Intervention Effects in Children with Autism Spectrum Disorder: A Pre–Post Statistical Analysis Using Respondent-Driven Sampling (RDS) 5m

        Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by difficulties in social interaction, communication, and restricted and repetitive behavior patterns. In Brazil, according to the Instituto Brasileiro de Geografia e Estatística (IBGE, 2022), approximately 2.4 million individuals have been diagnosed, and this number may reach six million when unidentified cases are considered. This study is a continuation of an investigation published in 2025, in which behavioral improvements resulting from therapeutic interactions were previously demonstrated. In this new phase, an age-stratified analysis was incorporated to examine differences in the pace and magnitude of therapeutic evolution. This approach makes it possible to investigate the impact of early intervention by comparing progress across different age groups and contributes to the understanding of periods of greater neural plasticity. The main objective was to statistically quantify the effects of specific therapeutic interventions on the development of children with ASD. The sample was selected using Respondent-Driven Sampling (RDS), a technique suitable for hard-to-reach populations that increases representativeness and reduces selection bias. Behavioral information was provided by legal guardians through a questionnaire composed of 16 evaluative items on an ordinal scale from 0 to 3, applied at two time points: pre- and post-intervention. Given that the responses were categorical and dependent, McNemar’s test was used to identify significant changes in behavioral patterns. In addition, a Net Promoter Score (NPS) assessment was conducted to measure the perceived level of satisfaction among legal guardians and therapists, providing complementary insight into the perceived impact of the interventions. The results showed statistically significant improvements in several behavioral domains, confirming the effectiveness of the therapeutic practices applied, with indications of greater progress among younger age groups. These findings reinforce the importance of public policies that expand access to evidence-based therapies, especially when initiated early, and highlight the value of robust statistical analyses for measuring clinical outcomes.

        42858812606

        Speaker: Syntia Souza (UFRPE)
      • 12:35
        [32] Twenty-Five Years of Oncology Clinical Trials: Trends from ClinicalTrials.gov (2000–2025) 5m

        Background:
        Oncology trials remain the largest sector of global drug development. However, their complexity, resource demands, and modest success rates underscore the need for more efficient, patient-centred, and methodologically innovative designs.
        Objective:
        To provide a longitudinal assessment of interventional oncology trial characteristics and design trends from 2000 through 2025 using the Aggregate Analysis of ClinicalTrials.gov (AACT) database.
        Methods:
        Drug and biologic interventional oncology trials initiated between January 1, 2000, and September 30, 2025, were analysed using a static version of the AACT database. Key features including trial phase, design type, sponsor, primary endpoints, and cancer indication were summarised by initiation year and trial status.
        Results:
        Among 62,703 interventional oncology trials identified, most were single-arm (48%) and early phase (I–II, 78%), with non-industry sponsors (67%). The use of surrogate endpoints (e.g. PFS, ORR) rose more than four-fold from 2000 to 2024, while adoption of master protocols and adaptive designs increased modestly in the past decade. Completed trials showed shorter enrolment periods and durations over time, and result reporting improved, though timely disclosure (≤ 12 months) remained uncommon. Ongoing trials most frequently studied gastric, breast, leukaemia, non–small cell lung cancer, and non-Hodgkin lymphoma malignancies.
        Conclusions:
        Oncology clinical trial activity expanded substantially over the past 25 years, with incremental adoption of innovative designs and surrogate endpoints. Despite improved reporting, timeliness and integration of patient-reported outcomes remain limited. These findings highlight ongoing gaps between methodological advancement and implementation, reinforcing the need for a more efficient, transparent, and patient-focused oncology trial enterprise.

        96432304955

        Speaker: Beth McDougall (Department of Statistics, Phastar)
      • 12:40
        [33] Win statistics in design and analysis of non-inferiority trials 5m

        The win ratio statistic has gained prominence as an interpretable method for analyzing composite endpoints in clinical trials, typically with a superiority objective. The use of the win ratio requires simulation to estimate the necessary sample size (1). Adapting win statistics to non-inferiority trials and incorporating covariate adjustment remain unresolved methodological challenges (2).

        We use the hierarchical endpoints from a non-inferiority trial in the field of neurology, evaluating de-prescription of anti-seizure treatment, with the hierarchy: (i) survival, (ii) functional outcome (iii) freedom from unprovoked seizures, and (iv) quality of life.

        In non-inferiority trials, a substantial proportion of ties is expected and should ideally inform the treatment effect, as proposed in the win odds statistic (3). Varying the proportion of ties, we compare power to detect non-inferiority between the win ratio, win odds, and alternative U-statistic kernels (4). Ties are induced by coarsening outcomes 2 and 4. Power is plotted against the non-inferiority margin, described in terms of the minimum accepted probability of the experimental treatment winning. Furthermore, using the full hierarchical endpoint, we compare the power of unstratified and stratified win statistics. As a reference, we use only a single continous outcome to compare the stratified win statistics to a standard linear regression model with the stratification variable as a covariate. Simulations follow the ADEMP framework (5).

        Our simulation results quantify how the win ratio, win odds, and related U-statistics perform as the proportion of ties increases and how the stratification affects analysis results. The findings inform the design of future non-inferiority trials and clarify when stratification-based approaches can improve efficiency.

        References
        1. Kronthaler D, Schwenkglenks M, Beuschlein F, Held U. The win ratio at the design stage of clinical trials. 2025;
        2. Pocock SJ, Gregson J, Collier TJ, Ferreira JP, Stone GW. The win ratio in cardiology trials: Lessons learnt, new developments, and wise future use. European Heart Journal. 2024;45(44).
        3. Dong G, Hoaglin DC, Qiu J, Matsouaka RA, Chang YW, Wang J, et al. The win ratio: On interpretation and handling of ties. Statistics in Biopharmaceutical Research. 2020;12(1).
        4. Buyse M. Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Statistics in Medicine. 2010;29.
        5. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019;38.

        75002907804

        Speaker: Johanna Ledoux (University of Zurich, Department of Biostatistics)
    • 10:45 12:15
      Censored data 1 Room 13 B

      Room 13 B

      Convener: Caroline Dietrich (Karolinska Institutet)
      • 10:45
        Overview of methods for planning studies with multiple time to event endpoints 18m

        Introduction:
        In clinical trials time to event endpoints like time to death, time to hospitalization or time to myocardial infarction are often or primary interest. Although multiple events might be observed per individual, only the time to the first occurring event is considered in primary analysis. One reason for this could be that guidelines recommend analyzing the data using the same method that was used to calculate the sample size. In most cases, sample size calculation in time to event settings is based on methods where only the time to first event is considered. However, there can be a gain in power if all events that might occur per patient are analyzed. Hence, sample size calculation for multiple time to events are of interest. The associated methods may not be as familiar to applied researchers. We therefore aim to give an overview of methods that were described in the literature for planning a study where multiple event times are of interest. Advantages and disadvantages of the methods are described. We further aim to illustrate the method using data from a cardiovascular trial.

        Methods:
        We started a first literature search in PubMed, Embase, the Cochrane Library, and Google Scholar with the following search terms: “multiple events”, “sample size calculation”, “recurrent events”, “composite endpoints”, “power calculation”, and “multiple time-to-event”.

        Results:
        So far 29 publications were screened and 11 selected for further review. Different approaches are described: methods for recurrent events using time to event approaches (e.g. Andersen-Gill model) or event counts (e.g. negative binomial model), approaches using additive models or frailty models, and methods for the win ratio. Furthermore, approaches that consider adaptive designs are also described. Some of the methods found involve quite complex formulas which are not easy to implement. Advantages and disadvantages depend on the primary research aim.

        Discussion:
        Planning a clinical trial with multiple time to event endpoints can be difficult. Therefore, we try to give a comprehensive overview of methods that were described in the literature for planning such studies. Since quite different approaches were found that can also lead to different sample sizes, the researcher should be very clear about his/her primary study aim to select the appropriate method for e.g. sample size calculation.

        75002905564

        Speaker: Ann-kathrin Ozga (Institute for Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf)
      • 11:03
        The impact of the number and the size of clusters on prediction performance of the stratified and the conditional shared gamma frailty Cox proportional hazards models 18m

        Researchers in biomedical research often analyse data that are subject to clustering. Independence among observations are generally assumed to develop and validate risk prediction models. For survival outcomes, the Cox proportional hazards regression model is commonly used to estimate an individual’s risk at fixed time horizons. The stratified Cox proportional hazards and the shared gamma frailty Cox proportional hazards regression models are two common approaches to account for the presence of clustering in the data. The accuracy of the predictions of these two approaches has not been examined. We conducted a set of Monte Carlo simulations to assess the impact of the number of clusters, the size of the clusters, and the within-cluster correlation in outcomes on the accuracy of the conditional predictions developed using the stratified and the shared gamma frailty Cox proportional hazards regression model. We compared the accuracy of the predictions in terms of discrimination, calibration and overall performance metrics. We found that the stratified and the shared gamma frailty model provided similar performance, especially for larger size and higher number of clusters. For small cluster size, we observed slightly better discrimination and overall performance for the stratified model and better calibration for the shared gamma frailty model especially at shorter prediction horizons. However, the practical applicability of the stratified Cox proportional hazards model to estimate predictions is limited especially for high within-cluster correlation and when clusters are small, and more likely at longer time prediction horizons.

        32144107528

        Speaker: Daniele Giardiello (Bicocca Bioinformatics Biostatistics and Bioimaging B4 Center, School of Medicine and Surgery, University of Milano-Bicocca,)
      • 11:21
        Evaluating Statistical Methods for Multiple Time-to-Event Endpoints: A Simulation Study on Recurrent and Competing Events 18m

        Accurate analysis of multiple time-to-event endpoints is a persistent challenge in clinical research, where patients may experience several recurrent non-fatal events alongside a competing fatal event. Conventional survival analysis approaches, such as time-to-first-event analyses or the Cox proportional hazards model, often neglect recurrent events or assume independence between event types, which can lead to biased estimates and a loss of clinical information. Although several methods have been proposed to address these issues, including extensions of Cox models, frailty-based models, multistate approaches, and composite endpoint frameworks, their relative performance in realistic data settings remains insufficiently understood.

        In this simulation study, we aim to systematically evaluate and compare statistical methods for analyzing multiple time-to-event endpoints in the presence of recurrent and competing events. We focus on Cox-based approaches (e.g., Andersen–Gill and Prentice–Williams–Peterson models), methods that account for ordered endpoints (e.g., Win Ratio and Win Odds), and weight-based composite endpoint methods (e.g., the Wei–Lachin approach and weighted all-cause hazard ratio). We generate synthetic individual patient data under a range of clinically motivated scenarios that vary in event rates, degree of dependence between events, treatment effects, and censoring levels. Event times are simulated using flexible models that allow control over recurrence intensity and dependence between non-fatal and fatal events.

        Each method will be assessed according to its ability to recover true treatment effects under different conditions. Key performance metrics will include bias, empirical power, coverage probability, and mean squared error. We will also evaluate robustness under deviations from key assumptions, such as proportional hazards or event independence. The simulation design follows established recommendations for transparent and reproducible simulation studies (Burton et al., 2006; Morris et al., 2019).

        Preliminary findings suggest that commonly used Cox-based approaches perform well in simple scenarios but tend to underestimate treatment effects when strong dependencies exist between recurrent and fatal events. More flexible modeling strategies that explicitly represent such dependencies are expected to yield more accurate and interpretable estimates.

        This study will provide a comprehensive and transparent evaluation of current methods for analyzing multiple time-to-event endpoints and identify areas where methodological development is needed. The results will support researchers in choosing appropriate analytical approaches and contribute to improving the planning and interpretation of clinical studies that involve recurrent and competing events.

        96432319128

        Speaker: Duoerkongjiang Alidan (Institute of Medical Biometry and Epidemiology; University Medical Center Hamburg-Eppendorf (UKE))
      • 11:39
        Prognostic Models for Recurrent Event Data 18m

        Prognostic Models for Recurrent Event Data
        Dr Victoria Watson1,2, Prof Catrin Tudur Smith2, Dr Laura Bonnett2
        1 Phastar, London, UK
        2 University of Liverpool, Department of Health Data Sciences

        Background / Introduction
        Prognostic models predict outcome for people with an underlying medical condition. Many conditions are typified by recurrent events such as seizures in epilepsy. Prognostic models for recurrent events can be utilised to predict individual patient risk of disease recurrence or outcome at certain time points.
        Methods for analysing recurrent event data are not widely known or applied in research. Most analyses use survival analysis to consider time until the first event, meaning subsequent events are not analysed and key information is lost. An alternative is to analyse the event count using Poisson or Negative Binomial regression. However, this ignores the timing of events. Recurrent event methods analyse both the event count and the timing between events meaning key information is not discarded. Various methods to analyse recurrent event data exist, but evidence is lacking regarding which recurrent method is most appropriate under different scenarios.

        Methods
        A systematic review on methodology for analysing recurrent event data in prognostic models was conducted. Results from this review identified methods commonly used in practice to analyse recurrent event data. A simulation study was then conducted which evaluated the most frequently identified methods in the systematic review with respect to the underlying event rate. The event rates were categorised into low, medium and high based on data collected in the systematic review to best represent a variety of chronic conditions or illnesses where recurrent events are typically seen.

        Results
        The simulation study provided evidence to determine if model choice may be influenced by the underlying event rate in the data. This was assessed by deriving statistics suitable for recurrent event methods to assess the model fit and predictive performance of the recurrent event methods. These statistics were used to determine if certain methods identified tended to perform better than others under different scenarios.

        Conclusion
        Results from the systematic review and simulation study will be presented including a summary of each method identified. The results will be the first step towards a toolkit for future analysis of recurrent event data, providing evidence which recurrent event analysis method may be better suited given the data being modelled.

        85717605887

        Speaker: Victoria Watson (Phastar)
      • 11:57
        Unsure about the Markov assumption? A comparison of transition probability estimators in multi-state models 18m

        Various estimators for modelling the transition probabilities in multi-state models have been proposed, e.g., the Aalen-Johansen estimator, the landmark Aalen-Johansen estimator, and a hybrid Aalen-Johansen estimator. While the Aalen-Johansen estimator is generally only consistent under the rather restrictive Markov assumption, the landmark Aalen-Johansen estimator can handle non-Markov multi-state models. However, the landmark Aalen-Johansen estimator leads to a strict data reduction and, thus, to an increased variance. The hybrid Aalen-Johansen estimator serves as a compromise by, firstly, checking with a log-rank-based test whether the Markov assumption is satisfied. Secondly, landmarking is only applied if the Markov assumption is rejected.
        In this work, we propose a new hybrid Aalen-Johansen estimator which uses a Cox model instead of the log-rank-based test to check the Markov assumption in the first step. Furthermore, we compare the four estimators in an extensive simulation study across Markov, semi-Markov, and distinct non-Markov settings. In order to get deep insights into the performance of the estimators, we consider four different measures: bias, variance, root mean squared error, and coverage rate. Additionally, further influential factors on the estimators such as the form and degree of non-Markov behaviour, the different transitions, and the starting time are analysed. The main result of the simulation study is that the hybrid Aalen-Johansen estimators yield favourable results across various measures and settings.

        42858801026

        Speaker: Merle Munko (Otto-von-Guericke University Magdeburg)
    • 10:45 12:15
      High dimensional data 1 Room 14

      Room 14

      Convener: Bart Mertens (Leiden University Medical Centre)
      • 10:45
        Unraveling Aging in the CSF Proteome: A Systematic Comparison of Variable-Selection Methods for Protein Risk Score Modeling 18m

        Aging is the dominant risk factor for neurodegenerative and systemic diseases, yet its molecular signatures remain obscured within high-dimensional, noisy, and strongly correlated proteomes. To address this challenge, we introduce the Protein Risk Score (ProtRS) framework—a systematic evaluation framework for ProtRS modeling that assesses how different multivariate approaches extract age-associated signals from cerebrospinal fluid (CSF) proteomics.

        Using the Emory CSF cohort (n = 504, 2,067 proteins), we applied a stringent normalization pipeline that minimizes technical artifacts while preserving biological structure. Four variable-selection strategies were evaluated for predicting chronological age: univariate regression, LASSO, Elastic Net, and the Bayesian regression with regularized horseshoe (RHS) prior. Elastic Net achieved the highest predictive accuracy (Pearson correlation r = 0.73), effectively leveraging correlated protein clusters. RHS performed comparably (r = 0.72) while offering superior parsimony and principled uncertainty quantification. In contrast, LASSO (r = 0.68) and univariate screening (<0.60) underperformed, largely due to their inability to model joint proteomic structure.

        Biological fidelity was evaluated using overlap with established aging-associated markers from the UK Biobank plasma aging clock. Elastic Net and RHS recovered the largest subset of known aging-related proteins, demonstrating strong pathway relevance and improved interpretability.

        To generalize these findings beyond a single dataset, we developed a realistic proteomic simulator capable of replicating empirical correlation structures or generating synthetic ones with controlled sparsity, dimensionality, and multicollinearity. A comprehensive evaluation across 561 high-dimensional scenarios revealed consistent trends: sample size is the primary determinant of predictive performance, while extreme p ≫ n regimes and strong correlations challenge all methods. Across settings, Elastic Net and RHS demonstrated notable resilience—maintaining strong predictive accuracy, stable support recovery, and robustness under multicollinearity. RHS achieved the best false-discovery control, whereas Elastic Net provided substantial computational efficiency. Both consistently outperformed LASSO and univariate approaches in support recovery and predictive stability.

        Together, these empirical and simulation-based results form the building blocks of ProtRS and underscore a central principle: accurate proteomic aging models require multivariate regularization that embraces—rather than suppresses—the inherent correlation structure of the proteome. As a systematic evaluation framework for ProtRS modeling, ProtRS offers a scalable, interpretable foundation for constructing precision aging clocks, identifying mechanistic pathways, and integrating proteomics into multimodal aging research. By bridging statistical rigor with biological insight, ProtRS advances the development of next-generation diagnostics and interventions for age-related disease.

        64288209366

        Speaker: Sathish Ravindranth (TU Dortmund University)
      • 11:03
        Detecting gene-environment interactions to guide personalized intervention: boosting distributional regression for polygenic scores 18m

        Polygenic risk scores can be used to model the individual genetic liability for human traits. Current methods primarily focus on modeling the mean of a phenotype neglecting the variance. However, genetic variants associated with phenotypic variance can provide important insights to gene-environment interaction studies. To overcome this, we propose snpboostlss, a cyclical gradient boosting algorithm for a Gaussian location-scale model to jointly derive sparse polygenic models for both the mean and the variance of a quantitative phenotype. To improve computational efficiency on high-dimensional and large-scale genotype data (large n and large p), we only consider a batch of most relevant variants in each boosting step. We investigate the effect of statins therapy (the environmental factor) on low-density lipoprotein in the UK Biobank cohort using the new snpboostlss algorithm. We are able to verify the interaction between statins usage and the polygenic risk scores for phenotypic variance in both cross sectional and longitudinal analyses. Particularly, following the spirit of target trial emulation, we observe that the treatment effect of statins is more substantial in people with higher polygenic risk scores for phenotypic variance, indicating gene-environment interaction. When applying to body mass index, the newly constructed polygenic risk scores for variance show significant interaction with physical activity and sedentary behavior. Therefore, the polygenic risk scores for phenotypic variance derived by snpboostlss have potential to identify individuals that could benefit more from environmental changes (e.g. medical intervention and lifestyle changes).

        21429414088

        Speaker: Qiong Wu (University of Marburg)
      • 11:21
        Regularized Multi-Omics Regression Modelling for Transcriptomic–Proteomic Integration in Mice with induced liver Damage. 18m

        Regularized Multi-Omics Regression Modeling for Transcriptomic–Proteomic Integration in Mice with induced liver Damage.

        Toxicological compounds exert complex effects on tissues and organisms, which can be investigated using genomic, transcriptomic, and proteomic data. A central challenge lies in understanding the relationship between RNA and protein levels. While these are expected to be positively correlated, transcriptomic variation often explains only a small fraction of protein abundance. This discrepancy motivates the use of advanced regression approaches to improve prediction. Previous work analysed transcriptomic and proteomic data from CCl₄ exposed mice, using regression and DiPa plots to group genes and enhance predictive accuracy. Although mRNA–protein correlations are frequently studied on a genome wide scale, individual mRNA–protein relationships are rarely explored in a regression modeling context that incorporates additional omics data.

        Building on earlier analyses, this study aims to reapply and expand an established regression pipeline to new mouse data sets. Specifically, we seek to predict protein intensities across a new large proteome wide scale under different treatments, evaluate cross dataset comparability, and assess the benefits of combining datasets to increase sample size and improve predictive performance.

        The new dataset comprises 18 mice divided into three groups (control, BDL treatment, and BDL ABSTi remedy), with approximately 1,436 matched gene–protein pairs. A stringent normalization pipeline was applied to minimize technical artifacts while preserving biological structure. Given the high dimensional context (p >> n), we employed based on the pipeline a random forest preselection step prior to LASSO regression. The top 10 covariates were selected based on variable importance, followed by LASSO and post LASSO regression modeling. Cross data set validation was performed to assess model generalizability.

        Using protein covariates, the models achieved an average correlation of ~0.7 between predicted and true protein levels in out of sample validation. This performance demonstrates the utility of random forest preselection combined with LASSO regression in high dimensional omics data. The approach successfully reduced noise, improved predictive accuracy, and highlighted biologically meaningful gene–protein relationships.

        These results indicate that protein intensities of new samples can be predicted with high precision, provided that covariate data are available. In practice, this framework can be applied to impute missing values in proteomic datasets or to detect implausible outliers. Furthermore, combining datasets to increase sample size enhances predictive performance and provides deeper biological insight into toxicological responses. Future work will focus on refining variable selection strategies to fully exploit the potential of integrative omics modeling.

        85717608484

        Speaker: Ngoune Darwin (TU Dortmund University)
      • 11:39
        StabCell: A stability-selection framework with error control for clustering and differential expression analysis of scRNA-seq data 18m

        Single-cell RNA sequencing has given researchers unparalleled insight into biological systems. It enables the identification of distinct cellular subpopulations, the characterization of differences between them, and the assessment of overall tissue heterogeneity. Conventional analysis pipelines first cluster individual cells into similar groups and then test for differentially expressed genes between these groups to identify cell types. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not truly exist. We introduce StabCell, a novel stability-selection framework which integrates clustering and detection of differentially expressed marker genes. By performing clustering and detection of differentially expressed marker genes in randomly selected subsamples of the full dataset, StabCell provides an assessment of clustering stability. Furthermore, it provides finite sample error control over the expected number of falsely selected marker genes per cell subpopulation. In simulation studies, we show that StabCell outperforms conventional analysis pipelines with fewer false positive findings, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (iPSCs) to cardiomyocytes and cardiac fibroblasts reveals that meaningful marker genes are consistently in the set of top-ranked genes. This demonstrates the potential of StabCell to enhance the interpretability and robustness of scRNA-seq analyses, supporting more reliable biological discovery and greater confidence in downstream insights.

        96432309606

        Speaker: Niklas Lück (TU Dortmund University, IUF - Leibniz Research Institute for Environmental Medicine)
      • 11:57
        A Nested Cross-Validation Framework for Leakage-Free Calibration of Adaptive Elastic-Net Regression in High-Dimensional Data 18m

        Penalized regression models such as Lasso, Elastic-Net and their adaptive extensions are widely used for simultaneous variable selection and prediction in high-dimensional data analysis. However, conventional implementations of adaptive Elastic-Net (AdaENet) regression often estimate the adaptive hyper-parameter for the Elastic-Net penalty term using the entire dataset before dividing it into training and testing sets. This practice inadvertently introduces data leakage, allowing information from the test set to influence model training and leading to optimistically biased performance estimates.

        To address this common issue, we propose a Nested Cross-Validation (Nested CV) framework for calibrating the AdaENet model. In our approach, all data-dependent operations, including adaptive weight estimation and hyper-parameter tuning, are performed strictly within the training folds of the inner CV loop. This ensures complete separation between training and testing data, yielding unbiased estimates of predictive performance and variable selection stability. The method jointly identifies the optimal model settings while rigorously assessing generalization accuracy on unseen data.

        We systematically present the mathematical formulations of key penalized regression techniques (Lasso, Elastic-Net, Adaptive Lasso and AdaENet) and compare the standard K-fold CV procedure, susceptible to data leakage, with our proposed leakage-free Nested CV framework. Using comprehensive simulations across varying sample sizes, feature correlations and signal strengths, we evaluate and contrast model performance in terms of signed support accuracy, false positive rate and test mean squared error.

        Results demonstrate that conventional adaptive Elastic-Net approaches substantially underestimate prediction error and overstate variable selection accuracy, while the proposed Nested CV framework achieves superior predictive reliability and robustness across diverse experimental conditions. This advancement provides a generalizable methodology for calibrating adaptive penalized regression models without information leakage. Particularly in complex, high-dimensional domains such as genetic studies, where data exhibit numerous interrelated predictors, our framework ensures reproducible, trustworthy and unbiased model evaluation.

        85717605319

        Speaker: Gul Inan (Koc University)
    • 10:45 12:15
      IS1: Past and Future of Bayesian Biostatistics Room 1 A

      Room 1 A

      Convener: Reinhard Vonthein (Universität zu Lübeck)
      • 10:45
        Bayesian Highs and Lows: A Pharma Industry Perspective 30m

        The pharmaceutical industry has a long history with Bayesian statistics. Already in 1986, Racine, Grieve, Flühler and Smith wrote an article entitled Bayesian Methods in Practice: Experiences in the Pharmaceutical Industry[1], highlighting four typical applications they encountered at that time. Since then, Bayesian methods have been applied to many more problems in the pharmaceutical industry, and statisticians working in this industry have made substantial contributions leading to important advancements in the field. Some particularly appealing areas for invoking the Bayesian approach are (a) prediction problems such as probability of success; (b) multiple imputation; (c) data-sparse problems (as in phase I oncology studies); (d) shrinkage estimation; and (e) the use of external information. While the use of Bayesian statistics for (a)–(c) is widely accepted nowadays, (d) and especially (e) have led to much more controversial discussions. In the confirmatory setting, the strong focus on (frequentist) error control has proven to be a high hurdle, thus limiting even the consideration of Bayesian designs for this purpose.
        In this talk, I will discuss some of the successes (the “Highs”) of Bayesian statistics in the pharmaceutical industry, while also elaborating on some of the real (or perceived) challenges (the “Lows”). I will enrich these reflections with my personal perspective on uncertainty quantification and why I believe that the Bayesian framework provides a unique advantage in this regard.

        [1] Racine, A., Grieve, A. P., Flühler, H., & Smith, A. F. M. (1986). Bayesian Methods in Practice: Experiences in the Pharmaceutical Industry. Journal of the Royal Statistical Society: Series C (Applied Statistics), 35(2), 93-120.

        64288202604

        Speaker: Simon Wandel (Novartis Pharma AG)
      • 11:15
        Bayesian approaches in time of ML and AI 20m

        The Bayesian approach in general has a lot to offer in times of Machine Learning (ML) and Artificial Intelligence (AI). The Bayesian framework itself offers a learning environment, where the prior, and, subsequently the posterior distributions can be updated sequentially, and where human expertise can be incorporated. The approach allows for uncertainty quantification of all quantities of interest. Moreover, it safeguards against overfitting. These advantages reveal a promising near future for the Bayesian approach in times of ML and AI.

        The approach is computationally challenging and requires not only statistical and programming skills, but also the awareness of the decisions made in the data analysis process. Here, the workflow of applied Bayesian statistics helps with iterative model building, model validation and comparison,
        and with diagnosing computational problems.

        The choice of the prior distributions is a critical part of the Bayesian analysis. In the talk, we will briefly discuss suitable prior choices for important learning problems like Bayesian variable selection in high dimensions and Bayesian neural nets.

        75002906248

        Speaker: Katja Ickstadt (Department of Statistics, TU Dortmund University)
      • 11:35
        Lessons learned in the last 25 years 20m

        Lessons learned in the last 25 years

        Gerhard Nehmiz, consultant for Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany

        gerhard.nehmiz.ext@boehringer-ingelheim.com

        The Working Group "Bayes methods" of the IBS / German Region was founded in 2001, and met a need. It had two roots: The WG "Prognosis and decision making" of the GMDS (U. Mansmann) and the German BUGS User Group (DeBUG) (G. Sawitzki, J. König, U. Mansmann, G. Nehmiz).

        The aims of the WG were to support the Bayes approach in theory and application. This would then also strengthen its role in the outer world.

        The activities of the Working Group in the 25 years were, in summary: to connect people, to organize yearly workshops (partly with other WGs), and to organize sessions at the Biometric colloquia or GMDS meetings. A short overview will be given.

        The main "lessons learned" were:
        (1) The WG grew from 16 (2001) through 48 (2006), 61 (2014) to 90 (2025), out of 800-1000 members of the IBS / German Region. Most of these were passive listeners and not active contributors. It was never easy to recruit presenters.
        (2) Also after the Axial era of Bayesian statistics (1990-1995), progress went on remarkably. The WG could by far not keep up with all this. It picked up topics that were relevant from an algorithmic, methodical, medical or environmental point-of-view.
        (3) Finding the right words is helpful for the own understanding and for sending clear messages to the outer world. Communication did benefit from separating the words "probability [of data]" and "degree of certainty [of statements]".
        (4) Good data and meta-data are indispensable. The Bayes approach cannot save a badly designed experiment.

        Sources:
        Webpage of the Working Group
        Webpage of the International Biometric Society, repository of "Biometric Bulletin"

        75002903455

        Speaker: Gerhard Nehmiz (Consultant to Boehringer Ingelheim Pharma GmbH&Co. KG)
      • 11:55
        Discussant 20m
        Speaker: Reinhard Vonthein
    • 10:45 12:15
      Other 2 Room 12

      Room 12

      Convener: Els Goetghebeur (Universiteit Gent)
      • 10:45
        Nonparametric analysis of covariance in Mann-Whitney effects 18m

        Analysis of covariance (ANCOVA) assesses the effect of a group factor on a response while accounting for covariate information. We propose a nonparametric ANCOVA based on Mann-Whitney effects, specifically designed for randomized trials. Unlike classical ANCOVA, our approach does not rely on distributional assumptions or metric-scale data; Ordinal measurements (such as Likert-scale items) are sufficient, as the proposed estimators are rank-based. Through theoretical derivations and extensive simulations, we demonstrate that the proposed nonparametric ANCOVA reliably controls type-I error rates across challenging scenarios - including heteroscedasticity, small samples, and imbalanced group designs. The method integrates seamlessly with modern multiple testing procedures such as multiple contrast tests and naturally extends to cases with multivariate responses.
        This integration enables us to derive simultaneous confidence intervals for Mann-Whitney effecs that are substatially narrower than established intervals, which do not use covariate information.

        42858809555

        Speaker: Konstantin Emil Thiel (Paracelsus Medical University City: Salzburg)
      • 11:03
        Enabling Inference in Small Samples: Bayesian Estimation of Nonparametric Effects 18m

        A common goal in medical research is to estimate a difference between treatment groups and quantify its uncertainty, or to infer a population-level difference. The most commonly used nonparametric group difference measure is the Mann-Whitney (MW) effect. It applies to a broad range of outcomes, including skewed, heteroskedastic and ordinal distributions, since it does not assume a parametric form of the data. This is especially useful in situations where sample sizes are too small to do empirical checks of parametric assumptions. In addition, the MW effect appears in diagnostic trials as the popular target parameter Area under the Receiver Operating Characteristic Curve (AUC).
        Despite decades of research on the inference of MW effects, few Bayesian methods have been developed. Advantages of Bayesian estimation include (a) full uncertainty quantification via the posterior distribution, (b) improved numerical stability, and (c) the capability to include prior knowledge. Modeling external knowledge as a prior distribution can raise study power to a sufficient level in small-sample settings. This permits inference, for example, in rare diseases where recruitment rates are low, and enables more precise AUC estimates in the most interesting parameter ranges near its upper boundary.
        We investigate whether Bayesian methods can improve inference for the MW effect in the unpaired, paired, and longitudinal two-group designs. For the unpaired design, an existing algorithm is extended to improve performance for small samples and parameter values near the boundary. For the paired and longitudinal designs novel approaches are developed. Empirical Likelihood is a nonparametric data model that has been widely used for Bayesian estimation over the last 20 years.
        Extensive simulation studies demonstrate that credible interval coverage is approximately 95% across a broad range of data distributions. The influence of prior specification on coverage and precision (interval length) will be investigated. Additionally, the type-I error and power of Bayesian credible intervals when used as a statistical testing criterion will be evaluated and compared with those of frequentist methods.

        85717611977

        Speaker: Levin Wiebelt (Charité - Universitätsmedizin Berlin)
      • 11:21
        A New Approach to the Nonparametric Behrens-Fisher Problem with Compatible Confidence Intervals 18m

        In the context of a two-group comparison, when the assumption of equal variances between groups is doubtful or the data may be skewed or ordinal, the classical t-test and an effect measure parameterized in terms of means may no longer be suitable. In such cases, it appears more appropriate to formulate the problem as the nonparametric Behrens-Fisher problem of testing H0: θ = 1/2, where θ = P(X<Y) + 1/2P(X=Y) represents the Mann-Whitney effect. This parameter offers a meaningful assessment of treatment effects regardless of the true underlying distribution.

        While many methods exist to test the aforementioned hypothesis, several impose distributional restrictions on the underlying data, such as assuming continuity or excluding discrete data (such as ordered categorical data), thereby limiting their flexibility in practical applications. To date, the well-known Brunner–Munzel test has been regarded the standard method for testing this hypothesis, owing to its minimal assumptions and reasonable performance with larger sample sizes. The Brunner–Munzel test, however, struggles to control the Type-1 error rate at significance levels α<0.05, which may be considered a limitation given recent developments advocating for more stringent significance thresholds and the frequent need to adjust for multiplicity, leading to reduced α-levels. Moreover, the confidence intervals compatible to the Brunner–Munzel test are not range-preserving and tend to exhibit liberal behavior when effect sizes are large.

        In this talk, we present an alternative method to address the nonparametric Behrens-Fisher problem. The test is derived by considering the ratio of the true variance of the Mann-Whitney effect estimator to its theoretical maximum, as derived from the Birnbaum-Klose inequality. Through simulations, we demonstrate that the proposed test effectively controls the Type-1 error rate under various conditions, including small and unbalanced sample sizes, and different data-generating mechanisms. Notably, it provides better control of the Type-1 error rate than the widely used Brunner-Munzel test, particularly at small significance levels such as α = 0.005. We further construct compatible range-preserving confidence intervals and show that they exhibit improved coverage compared to the confidence intervals compatible to the Brunner–Munzel test. Finally, we illustrate the application of the method in a clinical trial example.

        32144104244

        Speaker: Stephen Schüürhuis (Institute of Biometry and Clinical Epidemiology, Charité - Universitätsmedizin Berlin)
      • 11:39
        Graph-theoretic determinants of causal discovery performance in feedback-driven biological networks 18m

        Feedback is pervasive in biological and biomedical systems, yet many causal discovery methods, including widely used score-based approaches such as NOTEARS, impose acyclicity and may therefore misrepresent gene regulatory, pharmacological, or cellular processes. Building on recent advances in cyclic causal inference, such as the intervention-capable Bicycle method, we investigate how graph-theoretic structure governs the feasibility and accuracy of causal discovery in directed networks with feedback.
        We use biologically-inspired directed random graph models and tune their parameters to control key structural invariants, including spectral radius, cycle density, v-structure frequency, and degree heterogeneity. Using these networks as ground truth, we simulate data from linear and nonlinear structural equation models with controlled feedback characteristics. Causal structure is then inferred using both acyclicity-constrained (NOTEARS) and cyclic-capable (Bicycle) algorithms.
        By regressing inference performance on the underlying graph invariants, we identify how specific topological features contribute to the recoverability of causal structure.

        96432310717

        Speaker: Markus Schepers (Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University)
      • 11:57
        The Chicken or The Egg? Causal Inference Methods for Cross-Lagged Effects in Longitudinal Panel Data 18m

        Background
        Longitudinal observational data frequently involve time-varying confounding, autoregressive dependence, and potential reciprocal feedback between processes. These features complicate the estimation of cross-lagged causal effects and challenge the assumptions underlying standard modelling approaches. Methodological evaluation requires transparent, reproducible simulation frameworks to verify that candidate methods can recover the underlying causal data structure.
        Objective
        To construct an empirically informed, reproducible simulation framework following the ADEMP (Aims, Data-generating mechanism, Estimands, Methods, Performance measures) structure, and to assess the ability of Structural Equation Modelling (SEM) and Bayesian dynamic multivariate panel modelling (dynamite) to recover predefined cross-lagged causal effects.
        Methods
        A data-generating mechanism was developed for N=1000 individuals measured at five timepoints (baseline and four follow-ups). It incorporated a time-varying confounder with deterministic drift and gamma-distributed increments, along with autoregressive processes for the exposure (X) and outcome (Y). Three scenarios were examined: (1) no cross-lagged effect, (2) a unidirectional effect from X→Y, and (3) bidirectional feedback. For each scenario, 500 Monte Carlo replications were generated. SEM and dynamite were applied to each dataset. Performance was evaluated using bias, root mean squared error (RMSE), and 95% interval coverage. All code, simulations, and analysis scripts were written for transparency and reproducibility. The computational time required was also recorded. .
        Results
        Dynamite consistently recovered the true cross-lagged parameters across all scenarios, with almost zero bias, low RMSE, and coverage near nominal levels. SEM displayed systematic negative bias and near-zero coverage across all scenarios, including when no true effect was present. Runtime benchmarking based on 2,000 Monte-Carlo replications showed that SEM executed all scenarios in approximately 1.3 hours. Dynamite required 6.2 hours for Scenario 1, 5.9 hours for Scenario 2, and 55.3 hours for Scenario 3, driven by the increased computational burden of posterior sampling in reciprocal-feedback structures. While the operational overhead for dynamite was substantially higher, its parameter recovery was materially superior.
        Conclusions
        Model performance depended critically on alignment between methodological assumptions and the data-generating mechanism. The framework demonstrated that Bayesian dynamic panel models offer robust estimation under time-varying confounding and temporal feedback, whereas SEM failed to recover the true causal effects. These results underscore the importance of rigorous simulation-based validation before applying causal methods to longitudinal observational data.

        85717617289

        Speaker: Tanya Toluay (Charité - Universitätsmedizin Berlin, Insitute of Biometry and Clinical Epidemiology City: Berlin)
    • 10:45 12:15
      Statistical hypothesis testing 1 Room 13 A

      Room 13 A

      Convener: Michael Lauseker-Hao (IBE)
      • 10:45
        Bootstrap calibration: A flexible tool to obtain (simultaneous) confidence, prediction and tolerance intervals 18m

        Bootstrap calibration grounds on a simple idea: Based on a bootstrap sample, one can compute the bootstrap coverage probability of the desired interval. Then, one can alternate the intervals limits until the bootstrap coverage probability approaches the nominal level, e.g. by alternating the α-level used for interval calculation. Finally, the desired interval is calculated replacing the nominal α-level by its bootstrap-calibrated counterpart.
        This idea was already proposed in the 1980ies, and since then, was alternated by several authors in order to yield (simultaneous) confidence, prediction or tolerance intervals. However, a unified framework that enables the calibration of these different intervals is still missing. Closing this gap is the aim of this talk.
        The presented algorithm was initially proposed to enable the computation of prediction intervals for different scales and models and is the foundation of the R package predint.
        It will be shown, that this approach can be easily generalized to yield (simultaneous) confidence intervals as well as equal-tailed tolerance intervals. The idea behind this approach is the individual calibration of the lower and upper limits of Wald-type intervals adapting the calibrated interval to possible skewness of the underlying distribution, enhance the intervals coverage probability and / or adjust for the multiple testing problem.
        Especially, for multiple testing of hypothesis other than differences between parameters, the proposed bootstrap-calibration is extremely promising. The computation of simultaneous confidence intervals for ratios between the means of several treatment groups and a control will be demonstrated based on a clinical multi-arm study regarding the reduction of brain infarct size following mechanical thrombectomy.

        53573510568

        Speaker: Max Menssen (University Medical Center Göttingen, Department of Medical Statistics, Göttingen, Germany)
      • 11:03
        Multiple-use prediction and calibration for all future values: exact simultaneous tolerance bands for regression 18m

        Multiple-use prediction and calibration for all future values play a valuable role in many areas including health and medical research. Simultaneous tolerance bands (STBs) can be used for these purposes. Motivated by real-world problems in health research, this study focuses on the construction of exact STBs for multiple regression over any given rectangular covariate region and for polynomial regression over any given covariate interval.

        We first address a key gap in the literature by constructing exact STBs for multiple regression over a given rectangular covariate region. A new simultaneous tolerance band (STB) is also proposed for both multiple and polynomial regression models. Unlike approximate or conservative methods, the exact STB rigorously guarantees the pre-specified confidence level. This new STB is compared systematically with existing approaches under the average shift criterion.

        Our numerical results show that the new STB outperforms existing alternatives under the average shift criterion and is thus recommended. We apply the new STB to two critical health applications: predicting blood pressure levels in infants and calibrating gestational ages. By leveraging exact STBs, based on a single training dataset, our method enables precise predictions and calibrations for infinitely many future observations.

        64288208897

        Speaker: Yang Han (University of Manchester)
      • 11:21
        Minimum area confidence set optimality for simultaneous confidence bands for percentiles with applications to drug shelf‐life estimation 18m

        Background: The stability of a drug product over time is a critical property in pharmaceutical development. A key objective in drug stability studies is to estimate the shelf-life of a drug, involving a suitable definition of the true shelf-life and the construction of an appropriate estimate of the true shelf-life. Simultaneous confidence bands (SCBs) for percentiles in linear regression are valuable tools for determining drug shelf-life in drug stability studies.

        Methods: In this paper, we propose a novel criterion, the Minimum Area Confidence Set (MACS), for identifying the optimal SCB for percentile regression lines. This criterion focuses on the area of the constrained regions for the newly proposed pivotal quantities, which are generated from the confidence set for the unknown parameters of a SCB. We employ the new pivotal quantities to construct exact SCBs over any finite covariate intervals and use the MACS criterion to compare several SCBs of different forms. Additionally, we introduce a computationally efficient method for calculating the critical constants of exact SCBs for percentile regression lines.

        Results: The optimal SCB under the MACS criterion is demonstrated to effectively construct interval estimates of the true shelf-life. The proposed method for calculating critical constants significantly improves computational efficiency. A real-world drug stability dataset is used to illustrate the application and advantages of the proposed approach.

        75002911487

        Speaker: Lingjiao Wang (Warwick Medical School, University of Warwick)
      • 11:39
        Detecting day-to-day effects in concentration-response experiments in toxicology 18m

        In toxicology, concentration-response experiments are conducted to investigate the toxic behaviour of a given substance. Typically, a parametric model is fitted and effective concentrations to a viability level p (EC_p) are estimated which are used e.g. in further experiments. However, in previous research, it was observed that the estimated EC_p of the same experiment conducted in the same manner on different days differ substantially. The question arises whether regular variability within the experimental day can explain these differences, or whether there are structural changes, so called day-to-day effects, between the different days.
        To detect potential day-to-day effects, we evaluate the EC_p of one day and the corresponding viability of another day at this concentration as a nested function and derive the asymptotic distribution. We use this result to construct appropriate confidence intervals (CIs) and hypotheses tests. Starting with the situation, in which two different experimental days are considered, we generalize the results to the situation of several experimental days using the Dunnet-procedure. An extensive simulation study evaluates the developed methods. Moreover, we apply the testing procedure to a real cytotoxicity dataset to detect potential day-to-day effects.

        53573503048

        Speaker: Julia Eichhorn (TU Dortmund University)
      • 11:57
        Choice of the hypothesis matrix for usual quadratic forms 18m

        Linear hypotheses Hp = y regarding a parameter vector p arise in a wide range of scientific fields, including life sciences, psychology, economics, environmental sciences, and other areas of applied statistics, due to their ability to encode a wide variety of scientific questions using a simple algebraic framework. The unknown parameter vector p can represent, for example, an expectation vector, a vector of regression coefficients, a quantile vector, the vector of nonparametric relative effects, or an upper-triangular vectorised covariance matrix. This general formulation allows both classical parametric models and modern semi- or nonparametric frameworks to be analysed.

        There exist numerous ways to express the same null hypothesis through different matrices H and corresponding vectors y. While for y = 0 there exists a unique projection matrix P representing the same hypothesis, such a unique matrix does not necessarily exist otherwise. Furthermore, practical implementations often use matrices that are not of full rank, which can increase computational cost and reduce numerical stability, particularly in high-dimensional settings.

        Linear hypotheses are typically tested using quadratic forms, resulting in univariate test statistics; such quadratic forms possess numerous desirable properties. Among the most widely used are the Wald-type statistic (WTS) and the ANOVA-type statistic (ATS), which are commonly employed and considered fundamental tools for testing linear hypotheses. This raises important methodological questions: To what extent does the value of the quadratic form based test statistic depend on the specific choice of the pair (H,y)? How can one obtain a representation that is both unique and minimal in dimension, avoiding redundancy? And how can additional structure be imposed to improve computational and interpretational properties?

        In this contribution, we show that for WTS, ATS and quadratic forms, a companion hypothesis matrix with the minimal number of rows can be constructed for each hypothesis, that formulates the same null hypothesis, while always yielding identical test decisions. Moreover, these minimal matrices can be derived constructively, with explicit formulas available for key matrices such as the centering matrix. Finally, our approach yields a procedure for selecting the hypothesis matrix when y is not equal to zero, thereby improving reproducibility.

        75002905844

        Speaker: Paavo Sattler (TU Dortmund; RWTH Aachen)
    • 10:45 12:15
      TC1: 50th anniversary of the closed testing procedure Room 1 B

      Room 1 B

      Convener: Bjorn Bornkamp (Novartis)
      • 10:45
        The invention of the closed testing procedure and early developments 18m

        This talk will present the closed testing procedure as it was first introduced by Marcus et al. (1976). We discuss further developments and early contributions published in the following years. A special focus is given to conferences as the ones in Oberwolfach, Bad Ischl and especially Gerolstein. We also report from the first International Conferences on Multiple Comparison Procedures (MCP) held in Tel Aviv and Berlin.

        32144106669

        Speaker: Markus Neuhäuser (Koblenz University of Applied Sciences)
      • 11:03
        Closed Testing, Interim Analysis and Adaptations – History, Methods, and Modern Challenges 18m

        The strict control of the studywise Type I error rate has long been a cornerstone of confirmatory clinical trials. Closed testing and adaptive designs are two influential ideas in modern trial methodology, yet they emerged from different motivations: one from the need to rigorously control multiplicity when testing multiple hypotheses, the other from the desire to build flexibility into study conduct. This presentation explores how these two frameworks interact, complement, and occasionally conflict.
        Already the test of a single hypothesis in the context of adaptive designs relying on combination tests can be embedded in, or interpreted through, a closed-testing perspective when viewed as testing an intersection hypothesis. We examine how adaptive features can be embedded within a closed-testing structure. By using adaptive closed testing, clinical trial designs can not only allow early stopping for efficacy but also incorporate design adaptations such as treatment-arm selection or population enrichment.
        Depending on how multiplicity and adaptations are addressed, different challenges arise. For example, if multiplicity is handled stage-wise first and the stagewise evidence is then combined using an adaptive combination test, the resulting procedure may become non-consonant. Another issue related to closed testing is that deriving both informative selective confidence intervals and simultaneous confidence intervals is challenging. However, depending on the adaptation rule, it may be possible to achieve both goals. Another challenge concerns how differences in the timing at which information on various endpoints becomes available may interfere with the practicability of implementing closed testing. For example, the rejection of a short-term endpoint may depend on hypotheses involving endpoints that are analysed only at a later time point, such as PFS and OS in oncology trials.
        Finally, we consider forward-looking issues: how closed testing extends or fails to extend to adaptive perpetual platform trials; what happens when the number of hypotheses eventually being tested are unknown, e.g., as hypotheses are added or removed during the conduct of the platform study. How do methodological developments continue to push the boundaries of both adaptability and the concept of studywise error control in complex, evolving trial settings.

        85717617608

        Speaker: Franz König (Medical University of Vienna)
      • 11:21
        Principles in Harmony: Closed Testing Meets the Partitioning and Projection Principle 18m

        In this talk, we will explore the relationship between the closed testing principle for multiple tests with family-wise error rate (FWER) control and the partitioning plus projection principle for constructing simultaneous confidence intervals. Starting with the simple observation that a multiple test with FWER control is formally equivalent to a one-sided simultaneous confidence interval for the vector of binary parameters indicating the true null and alternative hypotheses, we will see that the closed testing principle can be understood as a special case of the partitioning plus projection principle. We will then utilise this relationship to extend some common closed testing procedures to simultaneous confidence intervals, referencing the existing literature on compatible and informative simultaneous confidence intervals. Another key contribution of our talk and research is the extension of the concept of consonance for closed tests to the partitioning plus projection principle, with the aim of deriving computationally efficient algorithms for calculating simultaneous confidence intervals. These relationships and extensions will be illustrated using simple, instructive examples.

        42858803246

        Speaker: Werner Brannath (University of Bremen)
      • 11:39
        Graphical Approaches for Transparent Closed Testing 18m

        The closed testing principle is a fundamental framework to construct multiple testing procedures controlling the familywise error rate in the strong sense. However, a major challenge in the application of the principle is the number of intersection hypothesis tests that need to be specified, which increases exponentially in the number of elementary hypotheses tested and makes it difficult to communicate the resulting multiple testing strategy. Graph-based tests are a transparent approach to specify the intersection hypothesis tests in a closed test. They have been initially proposed for Bonferroni-type tests, for which they formalise an intuitive alpha recycling mechanism. In this setting they also yield a simple sequentially rejective testing algorithm and unify well-known multiple testing procedures, such as gatekeeping, fixed sequence, and fallback procedures making the underlying testing strategy transparent. More generally, the graphical approach can be used to define weighted hypothesis tests for all intersection hypotheses of a closed test. This allows the use of graphs for the construction of closed tests that account for the correlation structure of the test statistics and thereby to improve efficiency.

        In this talk, we review graph-based closed tests, the rationale of the underlying algorithm and discuss the potential loss of consonance of the resulting closed test that can occur if correlations are taken into account. Furthermore, we discuss applications to group-sequential and adaptive designs based on p-value combination and conditional error approaches.

        85717614567

        Speaker: Martin Posch (Medical University of Vienna)
      • 11:57
        Multiple hypotheses testing in clinical trials beyond familywise error rate control 18m

        We consider the problem of testing multiple null hypotheses, where a decision to reject or retain must be made for each one and embedding incorrect decisions into a real life context may inflict different losses. We argue that traditional methods controlling the Type I error rate may be too restrictive in this situation and that the standard familywise error rate may not be appropriate. For example, when disjoint sub-populations are considered, no multiplicity adjustment appears necessary, since a claim in one sub-population does not affect patients in another. Maurer et al. (2023) formalized this perspective by introducing familywise expected loss control by defining suitable loss functions for a given decision rule, where incorrect decisions can be treated unequally by assigning different loss values. For intersecting sub-populations, Brannath et al. (2023) proposed the population-wise error rate, defined as the probability that a randomly selected patient will be exposed to an inefficient treatment. In this talk, we review these approaches, discuss their connections, and explore possible extensions based on the generalized closed testing procedure recently introduced by Xu et al. (2025).

        References
        Brannath, Hillner, Rohmeyer (2023). The population-wise error rate for clinical trials with overlapping populations. Statistical Methods in Medical Research 32:334–352.
        Maurer, Bretz, Xun (2023). Optimal test procedures for multiple hypotheses controlling the familywise expected loss (with Discussion). Biometrics 79:2781–2793.
        Xu, Solari, Fischer, de Heide, Ramdas, Goeman (2025). Bringing Closure to False Discovery Rate Control: A General Principle for Multiple Testing. arXiv:2509.02517v1 [stat.ME]

        32144101506

        Speaker: Frank Bretz (Novartis)
    • 12:15 13:45
      Lunch break 1h 30m
    • 13:45 15:15
      Censored data 2 Room 13 B

      Room 13 B

      Convener: Sarah Friedrich-Welz (University of Augsburg)
      • 13:45
        mRNA COVID-19 vaccination in pregnancy and risk of pregnancy loss: a progressive multistate model to account for time-dependent exposure 18m

        The recently conducted observational Embryotox cohort study on mRNA COVID-19 vaccination aimed to assess the safety of mRNA COVID-19 vaccines in pregnancy. Here, we focus on the methodological approach used to assess the effect of the vaccination on adverse pregnancy outcomes such as spontaneous abortion and stillbirth. The data featured delayed study entry and cohort crossover as well as associated immortal time which had a substantial impact on the results if not appropriately modeled.
        From contact date January 1st 2021 and including pregnancies with an estimated date of birth before October 2022 a final cohort of 8,146 prospectively ascertained pregnant women were vaccinated with an mRNA vaccine in pregnancy or up to 30 days before last menstrual period. Among them 1,478 were vaccinated only after study entry. The unvaccinated prospective comparison cohort consisted of 1,955 pregnant women. In the vaccinated cohort, 102 spontaneous abortions had been observed and 211 in the unvaccinated cohort.
        To model time-dependent vaccination status and different pregnancy outcomes we used a progressive multistate model with unvaccinated/vaccinated as transient states and competing absorbing states for pregnancy outcomes split by previous transient state. Probability estimation based on the data via Aalen-Johansen was used to arrive at relative risk estimates. However, relative risks are more difficult to interpret as a consequence of the complex timing of events, where, e.g. spontaneous abortion may precede vaccination. To aid interpretation, we contrasted results with an overly simplified two-group comparison subject to immortal time bias and to a hypothetical probability estimation where vaccination does not impact hazards. We found that vaccination was associated with a smaller probability of spontaneous abortion. This observed protective association was substantially smaller than in the immortal time biased analysis and nicely illustrated in the comparison with the hypothetical scenario.
        This work was funded by the German Federal Ministry of Health and the Paul-Ehrlich-Institute (PEI).

        42858809037

        Speaker: Lukas Lohse (Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Clinical Pharmacology and Toxicology, Embryotox Center of Clinical Teratology and Drug Safety in Pregnancy)
      • 14:03
        Estimating a treatment effect for survival when membership of the target population is partially hidden 18m

        We consider a two-arm randomized clinical trial in precision oncology with time-to-event endpoint. Patients in the control arm receive standard of care (SOC) treatment whereas patients in the experimental arm are offered personalized treatment, e.g. on the basis of molecular characterization of the disease. However, some patients in the experimental arm will not receive personalized treatment due to various causes; Possible causes are that no personalized treatment is available or paid for. These patients also receive SOC. A goal is to formulate a model that identifies and allows for many variants of the implementation rate of the treatment.
        Classical Intention-to-treat analysis compares the outcomes of the two trial arms irrespective of the actual treatment. This is meaningful to evaluate the effect of a policy.
        An alternative estimand, which may be more interesting in practice, is the treatment effect in the target population of all patients that would actually implement the personalized treatment, if it was offered to them.
        Direct estimation of this effect is not feasible, since information on which patients in the SOC arm would have received personalized treatment if randomized to the experimental arm is missing.
        In particular, the survival distribution in the SOC arm is a mixture of distributions for patients who would and patients who would not potentially receive personalized treatment. We applied a nonparametric approach inspired by Patra and Sen [1] combined with isotonic regression to estimate the survival curves of both components in the mixture. The estimated survival curves are then compared using the restricted mean survival time and hypothesis testing is performed via bootstrap resampling. As an alternative, a semi-parametric approach is proposed.
        Power and Type I error of the two methods are compared in detailed simulation studies. Both approaches are further evaluated on real clinical trial data.
        This work presents a novel rigorous model for the analysis of treatments effects in the presence of mixtures or asymmetric trials. Practical guidelines on when this approach is necessary or appropriate are provided.

        [1] Patra, R. K. and Sen B. (2015), Estimation of a Two-component Mixture Model with Applications to Multiple Testing, Journal of the Royal Statistical Society, Vol. 78 (4), 869-893

        75002920408

        Speaker: Marilena Müller (German Cancer Research Cente)
      • 14:21
        Comparison of survival outcomes with a partly unknown, time-dependent, exogenous treatment – a problem in paediatric SCT-studies 18m

        The assessment of allogeneic stem cell transplantation (SCT) over standard continued chemotherapy in a clinical trial of childhood leukaemia is not straightforward. Standard chemotherapy will be stopped and SCT performed if a donor search identifies a suitable stem cell donor in registries of potential donors. Randomization to SCT or continued chemotherapy is usually not feasible due to ethical considerations. Nevertheless, a fair comparison can be based on the two groups formed by the availability or non-availability of suitable stem cell donors. Thereby, donor availability is a temporarily unknown external baseline variable whose actual values will become known either when a suitable stem cell donor is identified from the registry or after the end of an unsuccessful donor search. However, donor search can be prematurely ceased, e.g. due to patients’ death or deterioration of patients’ health. Unfortunately, then donor availability and thus group membership will remain unknown. There is currently no approach available to correctly illustrate survival probabilities over time for the two groups and compare them at interesting time-points, especially in case of non-proportional hazards that are mainly due to early toxicities after SCT.
        For each patient with prematurely ceased donor search, it is possible to calculate the probabilities that a suitable donor might or might not be identified after the ceasing of the donor search, respectively. These probabilities are utilized to develop adjusted Kaplan-Meier curves for visual group comparison over time. These curves have a valid survival probability interpretation unlike the commonly applied Simon and Makuch curves where patients are allowed to change the group over time.
        These estimated survival probabilities derived from adjusted Kaplan-Meier curves can also be used to assess group differences at selected long-term time points, which is especially interesting in case of non-proportional hazards. A corresponding statistical test is proposed and compared to other test strategies.
        Data from an international study of children with newly diagnosed Philadelphia chromosome-positive acute lymphoblastic leukaemia are used to exemplify the satisfactory performance of the new approach. Other methods are only able to estimate survival probabilities at selected time points. Results of the different methods are compared and discussed.
        The newly proposed method allows for the first time to show Kaplan-Meier curves, when group membership at baseline is unknown and becomes only partly known over time.

        64288208924

        Speaker: Martina Mittlböck (Institute of Clinical Biometrics, CeDAS, Medical University of Vienna)
      • 14:39
        Detecting Early and Late Divergences in Survival Curves Using Nonparametric Effect Measures 18m

        Clinical trials often show treatment curves that diverge early and converge later, or vice versa—patterns that are poorly captured by the proportional-hazards assumption. We develop a joint inferential framework for two nonparametric functionals of censored survival data: the Kaplan–Meier–based Mann–Whitney effect and a novel temporal contrast separating early and late differences. The approach provides interpretable, probability-scale effect measures and enables joint inference for global and temporal contrasts under right censoring. In simulation studies, the method outperforms the log-rank test under non-proportional hazards while maintaining nominal type-I error. A real-world application illustrates how the temporal contrast reveals clinically meaningful early treatment advantages that remain hidden in standard analyses.

        96432303924

        Speaker: Jonas Beck (DKFZ)
      • 14:57
        Evaluation of Z-tests to compare fixed time survival probabilities using stratified Kaplan-Meier estimates with different variance estimators and weights 18m

        Time-to-event variables are among the most relevant primary efficacy endpoints in clinical trials, particularly in later phase oncology trials. When the proportional hazards assumption is expected to be severely violated, an alternative to the log-rank test is needed. Testing for differences in survival probabilities at a pre-defined time point offers one such option and has already been employed in some studies. However, in trials with stratified randomization, using the Kaplan-Meier estimate poses particular challenges: many common variance estimators are not evaluable or estimate a value of zero in strata with no events or in strata where all patients had the event, and the optimal stratum weighting strategy remains unclear.
        Through a simulation study mimicking clinical trials with stratified randomization and various non-proportional hazards scenarios, we compared the stratified log-rank test and Kaplan-Meier-based Z-tests using different variance estimators and stratum weights, evaluating type-I error and power. While the log-rank test remained optimal under proportional hazards, some Z-tests provided robust performance across a broad range of scenarios and outperformed the log-rank test in situations with delayed treatment effects or crossing survival curves.
        For stratified analyses under non-proportional hazards, we recommend a Kaplan-Meier-based Z-test with Borkowf's adjusted variance estimator and Mantel-Haenszel type weights, as inverse variance weights can lead to alpha-inflation. For non-stratified analyses, Greenwood-based or complementary log-log Z-tests are viable alternatives when zero variance can be excluded.

        53573508305

        Speaker: Hannes Buchner (Staburo GmbH)
    • 13:45 15:15
      IS2: Multiple tests beyond parametric assumptions Room 1 A

      Room 1 A

      Convener: Paavo Sattler (RWTH Aachen University)
      • 13:45
        Multiple Comparison Procedures for Simultaneous Inference in General Factorial Design for Multivariate Functional Data under Non-parametric Assumptions 30m

        Functional Data Analysis (FDA), focusing on data composed of functions or curves, has become increasingly popular. We study reliable methods for comparing multiple groups of functional data, especially in studies involving several factors or complex designs. We introduce a new statistical approach designed for multivariate functional data. Our methods are reliable because they allow us to compare mean functions without needing strict assumptions about the error distribution (non-Gaussian errors) or requiring that the variation patterns (covariance functions) are the same across all groups (heteroscedasticity). This makes our approach broadly useful in functional multivariate analysis of variance settings. The main idea is creating tests that perform simultaneous inference for both overall group effects (global hypotheses) and specific group comparisons (local multiple hypotheses), which is essential for detailed post-hoc testing. The test statistic is determined by taking the supremum over the pointwise Hotelling's test statistic across the function domain. We show that these resulting global and multiple tests work correctly when the sample size is large (asymptotic validity). To find the critical values, we use a specific resampling technique called the parametric bootstrap, and we confirm its theoretical correctness. Extensive simulation studies confirm that our new methods work very well even with small samples. They accurately control the type I error rate and Family-Wise Error Rate (FWER) across different scenarios, often performing better than existing methods. Simulation studies also show that the supremum-based method often achieves higher power compared to other integration-based methods and other competitors. Finally, we demonstrate the practical use of our tests by analyzing a multivariate functional air pollution data set. The complete set of proposed tests is implemented in the R package gmtFD available on CRAN.

        53573500484

        Speaker: Łukasz Smaga (Adam Mickiewicz University)
      • 14:15
        From group-sequential to flexible adaptive designs for time-to-event endpoints: Assumptions for valid multiplicity control 20m

        Adaptive and, in particular, group-sequential designs are well-established in clinical trials. Time-to-event endpoints pose particular challenges because individual participants can contribute data to multiple stages of the trial. Nevertheless, the log-rank test - the standard analysis method for time-to-event data - can be embedded in flexible adaptive designs (e.g. with sample-size recalculation, SSR) as long as hypothesis test and SSR are based on a single time-to-event. This relies on the (asymptotic) property of independent increments of the score/log-rank process in calendar time under the null hypothesis and standard censoring assumptions.
        Complexity increases substantially when multiple time-to-event endpoints are to be considered simultaneously. This involves multiplicity across endpoints as well as the use of surrogate endpoints different from the primary endpoint to guide adaptations. Such situations arise, for example, for the prominent endpoints progression-free survival (PFS) and overall survival (OS) in oncology. It has been shown that the family-wise error rate (FWER) can be inflated in this context.
        We illustrate why pure group-sequential designs can control the FWER without assumptions about the joint distribution of the endpoints, whereas more flexible adaptive designs with SSR generally require additional assumptions. We support this with mathematical arguments and simulations for the PFS/OS setting. We further show that assuming a Markov multi-state model for PFS and OS is sufficient to permit the desired flexibility of adaptive designs while maintaining FWER control.

        64288202284

        Speaker: Moritz Fabian Danzer (University of Münster)
      • 14:35
        Wild bootstrapping the (asymptotic) joint distribution of rank-based statistics 20m

        Quadratic forms, such as the rank-based Wald-type statistic or the rank-based ANOVA-type statistic, are widely used to compare multivariate distributions without the necessity of parametric assumptions (like multivariate normality). These tests have two major limitations, however:
        i) They are, by construction, omnibus tests and thus not able to locate which specific dimensions (variables) are driving an indicated overall difference between the distributions.
        ii) They fundamentally require the sample size to be much larger than the dimensionality of the data for reliable asymptotic inference.
        The latter is particularly challenging in applications where there are many outcomes but only a few independent observations, such as genomics, rare disease studies, and pre-clinical studies.
        In contrast, maximum statistics avoid aggregating squared differences across dimensions and instead focus on the single largest studentized difference. This structure provides local test results and therefore identifies which dimensions are driving an indicated difference, overcoming limitation i) above. In addition, they can provide simultaneous confidence intervals.
        Nevertheless, maximum tests remain subject to limitation ii) above: standard inference based on the asymptotic distribution requires the sample size to be much larger than the dimensionality of the data.

        In this talk, we propose a multiplier (wild) bootstrap with Rademacher weights to approximate the distribution of a maximum rank-based statistic. Specifically, we consider the classic comparison of distribution functions (Wilcoxon-Mann-Whitney test statistics) as well as the more flexible comparison of relative effects (Brunner-Munzel test statistics). We employ different standard error specifications for the corresponding test statistics and discuss their performance in terms of Type I error control and Type II error control in a comprehensive simulation study.

        75002902684

        Speaker: Lukas Mödl (Institut für Biometrie und Klinische Epidemiologie, Charité -- Universitätsmedizin Berlin)
      • 14:55
        Beyond Independence: A Unified Approach to the Multiple Nonparametric Behrens-Fisher Problem 20m

        In many trials and experiments, subjects are not only observed once, but multiple times, resulting in a cluster of possibly correlated observations. For example, mice sharing the same cage or students of the same class are typical examples of clustered data. Typically, under the assumption of normally distributed data, mixed models are used for analysis.
        However, this model assumption is rather strict and hard to justify in most real data analyses. Furthermore, skewed data (e.g. waiting times), discrete data (e.g. count data) or ordered categorical data measured on an ordinal scale are typical endpoints in a variety of trials. This motivates the use of nonparametric methods which do not rely on any specific data distribution. For the two-sample case, several nonparametric procedures exist. For binary clustered data, a chi-square-test for contingency tables can be used. Furthermore, generalizations of the Wilcoxon-Mann-Whitney-test exist for testing the null hypothesis of equal distributions of clustered data. An extension is provided by a procedure under a less strict null hypothesis formulated in terms of the Wilcoxon-Mann-Whitney effect.
        Here, we aim to generalize the procedures for the analysis of several samples. Thus, we propose
        a general nonparametric framework for comparing multiple groups of clustered data under mild
        assumptions. We present different inference methods, namely ANOVA-type test statistics and
        a multiple contrast test procedure and investigate their asymptotic behavior. Extensive simulation
        studies indicate that the methods control the type-1 error level well, even with small
        sample sizes. A real data example illustrates the application of the proposed methods.

        42858801928

        Speaker: Erin Sprünken (Charité - Universitätsmedizin Berlin)
    • 13:45 15:15
      Machine learning and data science 1 Room 14

      Room 14

      Convener: Mar Rodriguez-Girondo (Leiden University Medical Center)
      • 13:45
        Identifying Post-COVID Risk Factors with Model-Agnostic Feature Importance 18m

        Background: Post-COVID Condition (PCC) affects a substantial proportion of individuals following SARS-CoV-2 infection, and the mechanisms driving symptom persistence remain an area of active research. Identifying risk factors associated with PCC development is important for targeted prevention strategies and clinical management. Machine learning (ML) models offer powerful tools for prediction in epidemiological settings, but understanding which features drive these predictions requires careful interpretation. As part of the RESOLVE-PCC project, we apply model-agnostic feature importance methods to identify and characterize risk factors for PCC from data of the German National Cohort (NAKO).

        Methods: We employ multiple complementary feature importance methods that capture different aspects of feature-target associations: For unconditional associations we use permutation feature importance (PFI), leave-one-covariate-out (LOCO), and Shapley additive global importance (SAGE) values, whereas conditional feature importance (CFI) and conditional SAGE values are used for conditional associations. We distinguish unconditional association (whether a feature relates to PCC at all) from conditional association (whether a feature provides unique predictive information given other features). To enable this analysis, we developed xplainfi, a new R package implementing these feature importance methods natively integrating with the mlr3 machine learning framework. The package includes multiple approaches for conditional feature importance not previously available in R, supporting both conditional sampling strategies and flexible model refitting approaches.

        Results: By comparing results across methods, we distinguish between features that are merely correlated with other risk factors (high PFI but low CFI) versus those providing independent predictive value. This differentiation has epidemiological implications: features showing only unconditional associations may be proxies for unmeasured factors, while conditionally associated features could represent potentially modifiable risk factors or targets for intervention. Our analysis framework addresses practical challenges in epidemiological ML applications, including mixed-type data and quantifying uncertainty in importance estimates.

        Conclusions: Feature importance methods provide interpretable insights into the complex etiology of PCC by revealing different types of feature-target relationships beyond simple prediction performance. We demonstrate how application of model-agnostic interpretability techniques can support epidemiological inference from ML models, helping to bridge the gap between predictive modeling and mechanistic understanding. The xplainfi package provides researchers with accessible tools for conducting such analyses in R, particularly valuable for epidemiological studies requiring rigorous feature importance assessment. The methodological framework developed in this project contributes to ongoing RESOLVE-PCC research efforts and provides generalizable approaches for risk factor identification in other complex health outcomes.

        21429403157

        Speaker: Lukas Burk (Leibniz Institute for Prevention Research and Epidemiology - BIPS)
      • 14:03
        MACHINE LEARNING ALGORITHM TO PREDICT BIOMARKER LEVELS USING METABOLOMICS DATA 18m

        Introduction
        Metabolomics measures small molecules (called metabolites) in cells, tissues, biofluids, that represent intermediates and/or end-products of biochemical/cellular processes. As a results, metabolomics has shown to be useful for predicting disease risks or associated biomarkers. Given the large data complexity and size, the Machine learning (ML) approach represents an appropriate statistical and computational tool for building such predictive models.
        By a data-driven investigation, our goal is to provide a supervised ML workflow that improves the prediction of disease-related biomarkers through targeted metabolomics data from a general population. In particular, we identify and address some related ML issues that might affect results.
        Method
        We explore two algorithms based on feature selection, paying attention to the presence of high dimensionality of the feature space under different degrees of correlation (low, moderate, high) and considering both sparse and dense modeling frameworks. By a real data example and a set of simulated mirroring scenarios, we compare: 1. a three-stage algorithm, consisting of a univariable selection of associated metabolites based on Bonferroni-corrected p-values, with collinearity reduction (Variance Inflation Factor ≤5) and multivariable features selection applying popular regularized methods (Lasso; Ridge; Elastic net); 2. a one-stage algorithm, consisting solely of multivariable regularized regressions. We evaluate their predictive performances in terms of estimation accuracy, feature selection stability, and generalization error. The real data example includes 172 serum targeted metabolites measured by liquid chromatography (LC)–electrospray ionization–tandem MS and flow injection electrospray ionization–tandem MS profiling (AbsoluteIDQ p180 kit, Biocrates Life Sciences AG), and four iron-related biomarkers levels, from a subsample of the Cooperative Health Research In South Tyrol (CHRIS) study with around 5,000 participants.
        Results
        In both real data example and simulations, the three-step ML algorithm leads to more accurate predictions, with lower loss function and root mean square error. For feature selection, the Variance Inflation Factor reduction followed by Lasso regression is relatively stable across different scenarios varying the correlation degree. The modelling framework is not impacting the performance.
        Conclusions
        This workflow should provide a robust strategy for integrating metabolomics -or any similar large complex dataset- into a ML algorithm for improving prediction. As a result, this integration supports earlier and more precise disease diagnosis, and enables better tailoring of therapies through more reliable patient monitoring, which are central to precision medicine. Considering cost and effort required to measure metabolites, it is useful to select a subset without reducing predictive performance.

        75002910269

        Speaker: Fabiola Del Greco M. (Institute of Biomedicine - Eurac research)
      • 14:21
        From claims to care: Machine learning algorithm to classify urinary tract infection cases using Swiss health insurance data 18m

        Objectives: To evaluate whether machine learning (ML) applied to comprehensive claims data without diagnostic codes can distinguish a high proportion of antibiotic treatment episodes as urinary tract infection (UTI) or non-UTI cases. Such approaches may be valuable for antimicrobial stewardship when diagnosis-linked datasets are unavailable.
        Methods: Outpatient antibiotic prescription claims from three major Swiss insurers (2017–2020; ~40% of the Swiss population) were analyzed. Based on clinical input, specific constellations of claims codes (e.g. positive urine culture plus typical antibiotic) were a priori assigned as indicating UTI episodes, providing the reference classification. Predictors included sex, age group, comorbidity, and diagnostic tests ordered during the episode. Four ML classifiers were tested; performance and interpretability were evaluated, with XGBoost prioritized.
        Results: After cleaning and balancing, 38,982 records (19,491 UTI; 19,491 non-UTI) were included. XGBoost achieved an AUC of 0.94, accuracy of 87.6%, sensitivity of 79.2%, and specificity of 96.1%. Misclassification was asymmetric: 11% of non-UTI cases were labeled UTI, while 2% of UTI cases were misclassified as non-UTI. Diagnostics ordered were the strongest predictors, followed by female sex and older age.
        Conclusions: Even in the absence of diagnosis codes, ML applied to claims data can reliably identify UTI-related prescriptions. This supports the feasibility of claims-based surveillance tools for stewardship, while in parallel highlighting the need for scalable, low-burden approaches to improve direct diagnostic coding in routine data.

        21429400324

        Speaker: Soheila Aghlmandi (University of Basel)
      • 14:39
        Machine Learning Model for Multi-Ancestry Fine-Mapping from Summary Statistics 18m

        Genome-wide association studies (GWAS) often identify genomic regions containing hundreds or thousands of genetic variants with comparable statistical evidence. Extensive linkage disequilibrium (LD) and the sparsity of causal variants obscure association signals, hindering the identification of true causal variants underlying complex traits. Fine-mapping approaches are introduced to distinguish causal variants from closely correlated non-causal ones. Although most previous GWAS and fine-mapping studies have focused on individuals of European ancestry, cross-population fine-mapping can improve causal resolution and discovery power by leveraging broader genetic diversity. We introduce a machine learning–based Bayesian framework that integrates GWAS z-scores and LD matrices from multiple populations. By modelling shared causal configurations across ancestries, the model efficiently estimates posterior inclusion probabilities and identifies credible sets for multiple causal variants. We comprehensively evaluated the performance of the proposed model through simulations, comparing it against single-population fine-mapping with post hoc aggregation and the state-of-the-art SuSiEx method under varying numbers of causal variants, cross-population genetic correlations, and noise levels. We further applied the model to summary statistics from the UK Biobank and China Kadoorie Biobank, representing European and Asian ancestries, to identify causal variants associated with different subtypes of breast cancer. Compared with baseline methods, our model generally achieves better power with comparable coverage, assigns higher posterior inclusion probabilities to putative causal variants, and successfully identifies variants missed by other approaches due to infinitesimal effects from non-causal signals. This deep learning–driven Bayesian inference framework enables scalable fine-mapping across diverse biobanks, offering new opportunities for biological discovery.

        21429412105

        Speaker: Shizhe Xu (University of Oxford)
      • 14:57
        Metabolite discovery in tumor tissue with a self-supervised deep learning approach on MALDI-MSI data 18m

        Metabolite discovery can provide insights into disease mechanisms and help to identify potential biomarkers that contribute to the development of new treatments. We present a self-supervised deep learning approach for metabolite discovery. Molecular intensity distributions obtained via MALDI-MSI (matrix-assisted laser desorption/ionization mass spectrometry imaging) are compared with histological tissue coloration patterns in breast cancer samples from mice. While most deep learning studies in pathology focus on image classification, our work addresses the less common task of image similarity, linking MSI-derived molecular maps with visual features in stained tissue sections. Biomarker discovery is achieved by identifying specific ions or m/z-values overrepresented in tumor regions and exploring how these molecular markers vary across disease stages, including primary tumors, recurrences, and lung metastases derived from the same breast cancer cell line.

        Each tumor sample comprises one reference histological image and approximately 900 MALDI-MSI ion-intensity images. HER2 staining serves as the spatial reference for aligning molecular and histological data. However, alignment is complicated by artifacts introduced during tissue preparation such as shearing, tearing or folding. To address these challenges, we adapted the self-supervised method Bootstrap Your Own Latent (BYOL) for comparing MSI-derived molecular distributions with stained tissue sections. This self-supervised setup eliminates the need for manual labeling of hundreds of MSI images per tumor, enabling efficient representation learning directly from raw data.  Our methodological contribution lies in designing domain-specific augmentations that improve robustness to structural distortions and typical MALDI noise while preserving biologically relevant spatial information.

        We systematically evaluated three configurations: (i) a pretrained ResNet without additional training; (ii) the standard BYOL model with default augmentations; and (iii) customized BYOL variants incorporating subsets of MSI-specific augmentations. Performance was assessed using receiver operating characteristic (ROC) curves on validation and test sets comprising nine tumors unseen during training, which were independently labelled by experts to define a ground truth. The customized augmentation strategy achieved area-under-curve (AUC) values between 0.8 and 0.95 on test data.

        This capability enables large-scale comparison of unlabeled MSI datasets with histological references, is generalizable to a wide range of staining types, and paves the way for identifying molecules associated with specific pathological features.

        85717608808

        Speaker: Annalena Weissert (TU Dortmund University)
    • 13:45 15:15
      Statistical hypothesis testing 2 Room 13 A

      Room 13 A

      Convener: Frank Bretz (Novartis)
      • 13:45
        Test of independence in a three-level model 18m

        In this talk we present a statistical approach to evaluate the relationship between variables observed in a two-factors experiment. We consider a three-level model with covariance structure ${\bf \Sigma} \otimes {\bf \Psi}_1 \otimes {\bf \Psi}_2$, where ${\bf \Sigma}$ is an arbitrary positive definite covariance matrix, and ${\bf \Psi}_1$ and ${\bf \Psi}_2$ are both correlation matrices with a compound symmetric structure corresponding to two different factors. The Rao's score test is used to test the hypotheses that observations grouped by one or two factors are uncorrelated. We analyze a fermentation process to illustrate the results.

        75002911205

        Speaker: Anna Szczepańska-Alvarez (Poznań University of Life Sciences)
      • 14:03
        Testing Independence in Functional Data Using the Distance of Mean Embedding 18m

        Testing independence between functional observations remains a fundamental challenge in modern statistics, particularly in settings involving high-dimensional or infinite-dimensional random objects. The presented work introduces a new framework for independence testing in functional data based on the distance of mean embedding (DIME), a metric recently proposed as a flexible alternative to classical kernel-based measures such as the Hilbert–Schmidt independence criterion (HSIC).
        The methodology consists of two main steps. First, functional observations (univariate or multivariate) are represented through basis expansion, reducing infinite-dimensional functions to finite-dimensional coefficient vectors. Second, independence is assessed using DIME, which offers greater flexibility than HSIC by allowing freedom in the choice of characteristic kernels and of the embedding measure.

        The procedure further incorporates marginal aggregation to improve performance in pairwise independence testing and extends naturally to mutual independence through symmetric and asymmetric aggregation schemes. Simulation studies demonstrate that the proposed tests maintain nominal type I error rates and often achieve higher power than methods based on distance covariance or HSIC.

        The new independence testing procedures were applied to two real-world examples: air pollution and chemometric sugar spectra data. For the U.S. air pollution data, the methods were employed to verify pairwise and mutual independence among various pollutants (NO2, O3, SO2, and CO). Significant dependence between the air pollutants was detected. In the chemometric sugar spectra data, the independence tests were utilized to check the independence between a group of three excitation wavelengths, previously found useful for predicting ash content, and the remaining four wavelengths. The tests confirmed strong dependence between these two sets of functional variables. This indecates the correctness of the proposed functional regression model and that not all excitation wavelengths have to be used in the analysis.

        Overall, the DIME-based framework provides a robust and powerful alternative for independence testing in functional data analysis.

        32144107608

        Speaker: Jędrzej Wydra (Adam Mickiewicz University)
      • 14:21
        Alert identification in time-dose-response data 18m

        Evaluating a response variable in relation to exposure time or dose is a pivotal objective in the assessment of a compound's effect, particularly when determining toxicity in pre-clinical research or pharmacokinetics in clinical trials. The determination of an alert, such as the EC50 value, at which a pre-specified threshold of the response variable is crossed, is an important tool for the evaluation process. In practice, response data might be available for combinations of different exposure times and doses and the alert in relation to both is of interest. In this case, it is crucial to use all available information and extrapolate between cases to ensure the optimal utilisation of the data.

        In this talk, we propose a parametric method that allows the determination of alert–doses for a fixed time, even in the absence of measurements for the specific time, and vice versa, or to discern the full time–dose–alert relationship using all available data. This is achieved by fitting a parametric time–dose–response model and constructing either a confidence band for the two-dimensional curve given a fixed time or dose or a confidence plane for the three-dimensional model fit. Both are derived by a two-step bootstrap approach. It is summarised in terms of a hypothesis test, that can be adjusted to accommodate a variety of alert types. Rejecting the null hypothesis means detecting an alert–dose/time, or the time–dose–alert relationship. The initial model fit is achieved by the flexible framework of a Generalised Additive Model for Location, Scale and Shape (GAMLSS), which is then used to generate the bootstrap samples. This offers the possibility to account for a plethora of complex three-dimensional data structures.

        We demonstrate the validity of our approach through a simulation study and present an application to data from a study investigating the relevance of the exposure duration on cytotoxicity in primary human hepatocytes.

        85717606939

        Speaker: Lucia Ameis (Institute of Medical Statistics and Computational Biology (IMSB), Faculty of Medicine, University of Cologne)
      • 14:39
        Assessing the Impact of Distributional Assumption Violations on Outlier-Detection Methods in Clinical Audits 18m

        Introduction
        Monitoring the clinical performance of healthcare units (e.g. hospitals, surgeons) is the main component for national audits, enabling identification of ‘outlier’ units whose clinical performance, e.g. in-hospital mortality, deviates significantly from expected performance. Accurate detection and subsequent management of outliers are critical for improving healthcare quality.
        Two frequently implemented statistical frameworks for outlier detection are Common Mean Model (CMM) and Random-Effects Logistic Regression (RELR). Our study evaluates their performance through simulation under violations of their underlying distributional assumptions and provides recommendations for their appropriate use.

        Methods
        CMM assumes that the probability of death is the same in all units, attributing any differences to random variation. As the observed variability is often larger than expected (overdispersion), CMM is applied with an overdispersion correction. Outliers are detected using test-statistics based on differences between observed and expected unit probabilities. In contrast, RELR uses test-statistics based on estimated random effects on the logit scale. Both methods assume that the test-statistics for in-control units follow a normal distribution, but they are on different scales: probability scale for CMM and logit scale for RELR. Due to the non-linearity of the logit function, both assumptions cannot hold simultaneously unless outcome prevalence is near 0.5.
        We simulated scenarios varying the number of units, unit sizes, outcome prevalences, and between-unit variability. Two data-generating mechanisms (DGMs) were used, based on CMM and RELR, respectively. For each method, we assessed the overall false positive rate (FPR) and the FPR for ‘good’ (low mortality) and ‘bad’ (high mortality) outliers separately. We further evaluated the performance of using QQ plots for selecting a method whose normality assumption was best satisfied in each scenario.

        Results
        Both methods maintained nominal overall FPR. However, FPRs for good and bad outliers deviated from nominal levels when the DGM and outlier-detection method were misaligned. Under low outcome prevalence, applying CMM to RELR-DGM data caused severe over-detection of bad outliers and under-detection of good outliers (and vice versa). These discrepancies increased with smaller unit size and greater between-unit variability. Application to real datasets with low prevalence showed consistent patterns. Quantifying departures from normality in QQ plots was effective in identifying the most appropriate method for a given scenario.

        Conclusion
        Violating the normality assumption in CMM or RELR can have serious implications, potentially leading to unfair scrutiny of healthcare units or failure to detect underperformance. The most appropriate method can be chosen through checks of the distribution of test-statistics.

        85717610367

        Speaker: Anqi Sui (University College London)
    • 13:45 15:15
      Statistical modelling 1 Room 12

      Room 12

      Convener: Iuliana Ionita-Laza (Columbia University)
      • 13:45
        The effect of attentional control on postural stability in young and older adults 18m

        Maintaining balance is a crucial daily skill, and impairments in postural control increase the risk of falls, particularly among older adults. This study aimed to assess the effect of attentional control on postural stability in young and older adults. The sample consisted of 43 participants (16 older adults, 12 women; 27 young adults, 13 women). Participants performed a series of 60-second standing trials on an AMTI posturographic platform. The protocol included four control trials (quiet standing), two dual-task trials (cognitive task: digit counting), and two biofeedback trials (visual tracking of the Center of Pressure [COP] on a screen), with standardized rest intervals.

        The AMTI platform generates over 30 quantitative variables related to COP trajectory. Given the complex covariance structures of these variables, dimensionality reduction techniques were applied. Specifically, correlation analysis and Principal Component Analysis (PCA) were utilized to extract principal components describing postural dynamics and stability. Subsequently, statistical tests, including nonparametric multivariate analysis of variance (MANOVA), were employed to examine the effects of age, gender, and experimental condition on these reduced components.

        Preliminary results indicate that PCA effectively identifies the primary dimensions of postural variability, specifically those related to movement amplitude, path dynamics, and directional asymmetry. Further analysis is currently underway to fully quantify the interaction effects between age groups and attentional conditions.

        85717616299

        Speaker: Jakub Malik ([1] Poznan University of Physical Education - Faculty of Sport Sciences; [2] Adam Mickiewicz University - Faculty of Mathematics and Computer Science)
      • 14:03
        Lowering the Barrier: An Intuitive Framework for Optimal Design in Immunization Studies 18m

        Our systematic review indicates that optimal design methods are not yet applied in immunization studies in which modeling the antibody kinetics, i.e. the change of antibody concentration over time, is the main objective. We argue that this substantial underutilization is driven by several factors, including limited awareness of the advantages of optimal design and accessibility of convenient software solutions. Additional barriers arise from the challenges of integrating complex mathematical models with optimal design theory, particularly when these models include parameters that are difficult or impossible to measure.

        We introduce an easily accessible tool that lowers the barrier of applying optimal design methods to immunization studies by reducing the need for mathematical knowledge or advanced programming skills, and by being based on easily understandable and interpretable information.

        The framework models the antibody kinetics via the density of the beta distribution, restricting both shape parameters to be greater one, with the assumption that the antibody levels reach a plateau. Thereby, the framework is feasible to design studies in which an initial increase is followed by a decay, a characteristic pattern in many immunization studies. Based on the property of the mode of the beta distribution, we can use the easily understandable information of where one expects to start, when and how high the maximum is expected and when and at which level the plateau is reached to optimize the sampling schedule. To assess the robustness of this framework against misspecification in the initial information, we defined 12 scenarios and discussed single and double parameter misspecification. We also outline potential extensions of this framework, including the incorporation of other functional forms and highlighting the trade-off between practical applicability and model complexity. Furthermore, we developed an interactive R-Shiny application to increase the accessibility of this framework.

        When misspecifying one parameter at a time, the median D‐efficiencies exceeded 0.95 and the first quartiles were greater than or equal to 0.9 for all parameters, highlighting the robustness of the framework. Our analysis indicates that the height of the plateau and the time of maximum are the most sensitive to misspecification.

        In conclusion, this framework provides a good starting point for applying optimal design theory in immunization studies, where describing antibody kinetics is the main objective. Its major advantage is that it uses interpretable information in a convenient implementation, thereby making it accessible for healthcare professionals.

        42858807637

        Speaker: Stefan Embacher (Medical University of Graz)
      • 14:21
        Deriving Duration Time from Occupancy Data – A case study in the length of stay in Intensive Care Units for COVID-19 patients 18m

        This paper focuses on drawing information on underlying processes, which are not directly observed in the data. In particular, we work with data in which only the total count of units in a system at a given time point is observed, but the underlying process of inflows, length of stay, and outflows is not. The particular data example looked at in this paper is the occupancy of intensive care units (ICU) during the COVID-19 pandemic, where the aggregated numbers of occupied beds in ICUs on the district level (‘Landkreis’) are recorded, but not the number of incoming and outgoing patients. The Skellam distribution allows us to infer the number of incoming and outgoing patients from the occupancy in the ICUs.

        This paper goes a step beyond and approaches the question of whether we can also estimate the average length of stay of ICU patients. Hence, the task is to derive not only the number of incoming and outgoing units from a total net count but also to gain information on the duration time of patients on ICUs. We make use of a stochastic Expectation-Maximisation algorithm and additionally include exogenous information that are assumed to explain the intensity of inflow.

        64288216037

        Speaker: Göran Kauermann (Ludwig-Maximilians-Universität München)
      • 14:39
        Novel Sample Size Calculation Approaches for Risk Prediction Model Development with Clustered Binary Data 18m

        Background: Risk prediction models are increasingly being used in clinical practice to predict health outcomes. These models are often developed using data from multiple centres (clustered data) where patient outcomes within a centre are likely to be correlated. It is important that the dataset used to develop a risk model is of an appropriate size, to avoid model overfitting problems and poor predictions in new data. Wynants et al. recommended using at least 10 events per variable (including the random parameter) to minimise bias in the regression coefficients and obtain acceptable C-statistic values when applying a random-effects model to clustered data. This approach focused only on ‘median predictions’ where the random effect is ignored. More recently, Riley et al. (2020) and Pavlou et al. (2024) have proposed methods for sample size for independent data directly targeting the predictive performance of models however, these methods may not be appropriate for clustered data.

        Methods: We conducted full-factorial simulations to assess whether the Wynants method provides sufficient sample sizes for developing prediction models with good predictive performance. We also evaluated the applicability of the sample size methods proposed by Riley and Pavlou for clustered data. Simulation scenarios varied by degree of clustering, number of clusters and predictors, model strength, and outcome prevalence. Model performance was assessed using mean absolute prediction error (MAPE), calibration slope (CS), and the c-statistic. Cluster-specific performance measures were applied, and acceptable target values were prespecified. In addition, we propose two new sample size calculation methods for clustered data: a meta-model based method and another that adapts the Riley and Pavlou approaches through the application of shrinkage. Both approaches directly target model performance measures based on cluster-specific predictions.

        Results: None of the existing methods achieved the target MAPE values. The Wynants and Riley methods failed to attain a CS of at least 0.9 when outcome prevalence was ≥15%. All methods generally yielded c-statistics within 0.02 of their true values. The new methods consistently achieved the target MAPE values and produced CS ≥0.9, with c-statistics within 0.02 of their true values when prevalence was ≤25%.

        Conclusions: Current sample size calculation methods for developing binary risk models often failed to ensure adequate predictive performance of models and may therefore be unsuitable for clustered data. We propose new sample size calculation approaches that consistently achieve strong predictive performance across a wide range of clustered data scenarios.

        75002911349

        Speaker: Rumana Omar (University College London)
    • 13:45 15:15
      TC2: Interpretable Bayesian modelling of highdimensional / complex problems in molecular biomedicine Room 1 B

      Room 1 B

      Convener: Manuela Zucknick (University of Oslo)
      • 13:45
        Probabilistic Variable Importance: A Bayesian Perspective on Interpretable Machine Learning 18m

        Understanding the relative importance of genetic, molecular and environmental factors is crucial for interpretable prediction models in biomedicine and for targeted prevention. While classical regression-based approaches provide direct interpretability through model coefficients, flexible machine learning (ML) approaches such as random forests and neural networks typically rely on post-hoc importance measures. Many state-of-the-art interpretability tools, including Shapley values, Banzhaf values and Beta Shapley values, can be viewed as probabilistic measures of variable importance motivated by coalition game theory, yet their conceptual links to Bayesian modelling have remained underexplored.

        In this work, we develop a unifying Bayesian variable selection perspective on probabilistic variable importance measures in ML. We show that classical importance measures arise naturally under different priors on the model (coalition) space that encode different preferences over model complexity: a uniform prior on the model space leads to Banzhaf values; a uniform prior on the model size corresponds to Shapley values; and hierarchical Beta-Binomial priors give rise to Beta Shapley values. The Bayesian viewpoint clarifies the assumptions, trade-offs and interpretability properties underlying each measure. Furthermore, we introduce novel probabilistic importance measures including empirical Bayes formulations. Finally, we illustrate how the Bayesian variable selection perspective on interpretable ML can facilitate computations via Markov Chain Monte Carlo (MCMC) approaches for approximating probabilistic importance measures in high-dimensional biomedical data applications.

        85717603055

        Speaker: Christian Staerk (IUF – Leibniz Research Institute for Environmental Medicine & TU Dortmund University)
      • 14:03
        An interpretable varying coefficients approach to non-linear regression 18m

        Non-linear regression models are flexible approaches used to model complex associations. In many recent proposals, additional flexibility comes at the cost of loss of interpretability of the model's parameters and, consequently, of the data analysis results. We introduce a flexible model whose parameters are easily interpretable. In particular, the model incorporates non-linear effects through a semi- parametric spline-based representation that separates linear and non- linear effects via an orthogonal basis decomposition. We introduce a covariate-dependent regression coefficient to enhance flexibility and show the proposed approach's equivalence with a non-linear interaction model. In the proposed approach, the order of the covariates is relevant; however, we demonstrate that the model is invariant to this ordering. The proposed model performs comparatively well in simulation studies compared to state-of-the-art approaches. Finally, we illustrate the practical utility of the proposed approach through two applications that show varying degrees of non-linear associations. This is a joint work with Davide Fabbrico and Matteo Pedone

        64288201884

        Speaker: Francesco Stingo (affiliation: University of Florence)
      • 14:21
        A modelling framework for detecting and leveraging node-level information in Bayesian network inference 18m

        Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modelling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximisation algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases.

        96432301055

        Speaker: Hélène Ruffieux (University of Cambridge)
      • 14:39
        Generalized promotion time cure model: A new modeling framework to identify cell-type-specific genes and improve survival prognosis 18m

        Single-cell technologies provide an unprecedented opportunity for dissecting the interplay between the cancer cells and the associated tumor microenvironment, and the produced high-dimensional omics data should also augment existing survival modeling approaches for identifying tumor cell type-specific genes predictive of cancer patient survival. However, there is no statistical model to integrate multiscale data including individual-level survival data, multicellular-level cell composition data and cellular-level single-cell omics covariates. We propose a class of Bayesian generalized promotion time cure models (GPTCMs) for the multiscale data integration to identify cell-type-specific genes and improve cancer prognosis. We demonstrate with simulations in both low- and high-dimensional settings that the proposed Bayesian GPTCMs are able to identify cell-type-associated covariates and improve survival prediction. A case study will be selected from the nodal B-cell non-Hodgkin lymphoma (B-NHL) patient data whose cancer cells are differentiated from various subtypes of B cells.

        32144101866

        Speaker: Zhi Zhao (University of Oslo)
      • 14:57
        Discussant 18m
        Speaker: Manuela Zucknick
    • 15:15 15:45
      Coffee break 30m
    • 15:45 17:15
      Bayesian methods 1 Room 12

      Room 12

      Convener: Gerhard Nehmiz (Boehringer Ingelheim Pharma GmbH & Co. KG)
      • 15:45
        A Bayesian approach to decision making in early development clinical trials: An Open-source solution 18m

        Early clinical trials play a critical role in drug development. The main purpose of early trials is to determine whether a novel treatment demonstrates sufficient safety and efficacy signals to warrant further investment (Lee & Liu, 2008). The new open source R package phase1b is a flexible toolkit that calculates many properties to this end, especially in the oncology therapeutic area. The primary focus of this package is on binary endpoints. The benefit of a Bayesian approach is the possibility to account for prior data (Thall & Simon, 1994) in that a new drug may have shown some signals of efficacy owing to its proposed mode of action, or similar activity based on prior data. The concept of the phase1b package is to evaluate the posterior probability that the response rate with a novel drug is better than with the current standard of care treatment in early phase trials such as Phase I. The phase1b package provides a facility for early development study teams to decide on further development of a drug either through designing for phase 2 or 3, or expanding current cohorts. The prior distribution can incorporate any previous data via mixtures of beta distributions. Furthermore, based on an assumed true response rate if the novel drug was administered in the wider population, the package calculates the frequentist probability that a current clinical trial would be stopped for efficacy or futility conditional on true values of the response, otherwise known as operating characteristics. The intended user is the early clinical trial statistician in the design and interim stage of their study and offers a flexible approach to setting priors and weighting. Impactful graphical communication are demonstrated to illustrate decision making paradigms to statisticians and non-statisticians, making the phase1b a package fit for purpose to facilitate key discussions in early drug development for binary endpoints.

        32144106849

        Speaker: Audrey Yeo (Finc Research)
      • 16:03
        Bayesian Methods in Registered Clinical Trials: A Systematic Review of Studies on ClinicalTrials.gov Through 2024 18m

        Bayesian Methods in Registered Clinical Trials: A Systematic Review of Studies on ClinicalTrials.gov Through 2024
        Giles Partington & Christina Geyer: Phastar
        Bayesian methods are increasingly being incorporated into clinical trial designs to improve flexibility, efficiency, and interpretability. Earlier reviews of published studies (Lee & Chu, 2012) illustrated how these approaches were applied across selected examples. Building on that foundation, we conducted the first registry-based review to describe how Bayesian methods are currently represented in interventional drug trials.
        All interventional trials registered on ClinicalTrials.gov through December 2024 were screened for mention of a Bayesian component. After removing non-drug and duplicate records, more than four hundred interventional trials with a Bayesian element were identified. Each was categorized by therapeutic area, trial size, trial phase, design type, and the role of the Bayesian methods within the analysis framework allowing for an investigation of the changes over time for trial characteristics.
        The review shows a steady increase in the adoption of Bayesian methods over time, with early-phase oncology dose-finding studies remaining the most frequent application.
        Although uptake is still modest relative to all registered interventional studies, the trajectory suggests sustained expansion and diversification of Bayesian methods. Capturing these methodological elements more explicitly in registry data would enhance reproducibility, transparency, and acceptance.
        This work provides the first quantitative snapshot of Bayesian clinical trial activity derived directly from registry data. It demonstrates consistent growth and underscores the opportunity to strength how Bayesian design elements are documented and communicated across clinical research.

        42858804419

        Speaker: Giles Partington (Phastar)
      • 16:21
        Robust calibrated priors for the BLRM that reconcile implicit design beliefs 18m

        The Bayesian Logistic Regression Model (BLRM) with Escalation With Overdose Control (EWOC) is widely used in Phase I Oncology trials. Recently, several publications have highlighted a recurring issue: escalation can be blocked even when observed data strongly suggest safety. I.e., the posterior overdose probability at the next dose remains above the EWOC threshold despite no dose-limiting toxicities at the current dose or below. In addition, early extremes, such as one dose-limiting toxicity (DLT) event in a one-patient cohort, may have disproportionate impact on trial operating characteristics.
        These issues highlight the importance of an adequate choice of prior for dose escalation with the BLRM method. The starting dose and the planned escalation grid already encode a strong, asymmetric belief that implies a prior on the dose-DLT response curve: the starting dose is chosen such that it is likely to be safe, whereas uncertainty grows towards higher dose levels.
        We propose multivariate priors on the log-intercept and log-slope that make the implicit design prior explicit. The prior’s location / reference dose, variance and correlation are calibrated by prior predictive checks against a library of “reasonable” escalation paths (e.g., repeated 0 out of 3 DLT events should allow further for escalation, an early single DLT should not stop the trial for toxicity) that reflect the team’s expectations given the chosen starting dose and grid. Calibration targets include: escalation coherence when current data are safe, avoiding overreaction to single early DLT events, and early decisions qualitatively aligned with 3+3 in the first cohort.
        The design uses standard EWOC (no relaxation of the feasibility bound) and over dose interval probabilities. While the approach is deliberately non mixture for simplicity and interpretability, it admits a straightforward extension to mixtures if additional flexibility is desired.
        We simulated canonical dose-toxicity scenarios across shallow/steep curves, low/high MTDs, and early extreme outcomes. With the calibrated prior and unchanged EWOC, we observed: elimination of safe data lock, robustness to early DLTs without stalling and improved compatibility with 3+3 early on, while maintaining control of overdose allocation and overdose MTD declaration.
        By aligning BLRM priors with the implicit beliefs induced by the starting dose and grid, we can prevent unnecessary stalls and preserve patient safety under standard EWOC while retaining simplicity of implementation (e.g., via the OncoBayes2 R package). Our priors offer a pragmatic upgrade for users seeking reliability in the opening moves in absence of historical data while maintaining rigor later.

        21429411067

        Speaker: Lukas Widmer (Novartis Pharma AG)
      • 16:39
        Assessing Air Quality in Nigerian States Using a Bayesian Hierarchical Environmetrics Model 18m

        Effective air quality regulation and climate change mitigation depend on reducing greenhouse gas emissions and air pollutants. To achieve sustainable green development, this study constructs a spatio-temporal hierarchical model to assess the Air Quality Index (AQI) across Nigerian states and to derive actionable insights for environmental sustainability. Specifically, this study develops a three-level Bayesian hierarchical model based on latent Gaussian likelihoods to capture both random effects and systematic differences among states in Nigeria. The model structure allows for state-level variation, covariate effects, additional unexplained variation and variance decomposition. Air Quality Index data from five geopolitical zones (covering 12 states) were used to validate the developed model. The study revealed the priority level and the actions required for each state, as well as the important contributors to air pollution in Nigeria. All industries must endeavour to reduce emissions and prepare for changes in air quality associated with rising temperature. Stakeholders should mitigate the effects of industry pollution on land biodiversity and preserve natural habitats against the degradation caused by air pollution. The model accuracy is very high, indicating high correlation between the observed and the predicted values.
        The Bayesian model developed was properly validated, the model’s strong fit supports the analytical approach and affirms the demonstrated ability of identified predictors to explain variations in air quality across Nigerian states. States in the South South region requires emergency response to achieve sustainability. This study gives assurance of applying the model in policy interventions and projections in the future.

        32144108408

        Speaker: Oladapo Oladoja (Abiola Ajimobi Technical University, Ibadan, Nigeria)
    • 15:45 17:15
      Clinical trials 1 Room 13 B

      Room 13 B

      Convener: Franz König (Medical University of Vienna)
      • 15:45
        A novel method for inserting dose levels mid-trial in early phase combination studies 18m

        The use of combination treatments in early phase oncology trials is growing. The objective of these trials is to search for the maximum tolerated dose combination from a pre-defined set. However, cases in which the initial set of combinations does not contain one close to the target toxicity level pose a significant challenge. There is uncertainty around how to handle these situations effectively in practice and the literature does not fully evaluate potential solutions.

        To address this, we propose a novel method for inserting dose levels mid-trial. The idea is based on evaluating contours that partition the set of combinations into ones above and below the target toxicity. Dose insertions are made only if a single contour is highly probable, indicating an absence of combinations to explore around the target toxicity.

        We examine our proposed approach applied to two established designs, although any model-based or model-assisted design is an appropriate candidate. Results from our comprehensive simulation study demonstrate that the insertion method can increase the probability of selecting combinations close to the target toxicity, whilst controlling for selecting overly toxic combinations. These methods can be extended to more complex settings, such as trials with joint toxicity and efficacy endpoints.

        53573505117

        Speaker: Matthew George (Phastar)
      • 16:03
        Adaptive Designs in Fast-Track Registration Processes for Digital Health Applications 18m

        Fast-track procedures play an important role in the registration of health products, such as registration processes for digital health applications. These procedures offer the potential for patients to access innovative products earlier. The procedures involve two registration steps. Applicants can first apply for conditional registration. A successful conditional registration provides a limited funding or approval period and time to prepare the application for permanent registration. Products typically only have to fulfil weaker requirements for conditional registration than for permanent registration. The motivating example of the talk is the German two-stage fast-track registration process for digital health applications (DiGA) for reimbursement by statutory health insurances. This procedure has a pioneering role in other countries such as France, Belgium, Austria, Korea, and the UK, where similar procedures exist or are planned to be implemented (see also Chapman, 2025).
        The talk addresses valid and efficient study designs for fast-track procedures. The current standard is to conduct two separate studies. Instead, we suggest using two-stage adaptive designs that permit to use the data from both stages for the application for permanent registration. They also allow to learn from the first-stage data (which we suggest to use for the application for conditional registration) by performing design modifications after the first stage while controlling the Type I error rate. We consider designs where the second-stage sample size is recalculated such that a specific conditional power is reached to ensure that the second stage will likely be successful. We also assume that a sufficient overall success probability for achieving permanent registration is targeted. Having sufficient overall and conditional power requires a minimum and maximum second-stage sample size, respectively. By investigating these parameters and the expected sample sizes, we will demonstrate that in most cases, adaptive designs bring a clear advantage over the current standard of two separate studies. A discussion of the registration requirements and their consequences will also be addressed. The results presented are based on numerical calculations supported by mathematical arguments. The talk will be based on our arXiv paper (Kluge & Brannath, 2025).

        References:
        Suzannah Chapman. Towards identifying good practices in the assessment of digital medical devices. OECD Health Working Papers, 2025.

        Liane Kluge and Werner Brannath. Adaptive Designs in Fast-Track Registration Processes for Digital Health Applications. arXiv preprint arXiv:2507.04092v3, 2025.

        53573509947

        Speaker: Liane Kluge (University of Bremen)
      • 16:21
        adagraph: An R Package for Graph-Based Multiple Testing in Adaptive Trial Designs 18m

        Graph-based multiple testing procedures provide an intuitive way to define closed testing strategies that control the family-wise error rate (FWER) in fixed sample settings [1]. They have been extended to adaptive trial designs based on the (partial) conditional error rate (CER) method [2]. These procedures control the FWER in two-stage designs where the trial is adapted after an interim analysis based on unblinded data. When (part of) the correlation structure between test statistics is known, it can be directly incorporated into the testing procedure to improve efficiency [2, 3].

        Building on these methods, we introduce adagraph, an R package implementing graph-based (partial) CER tests for adaptive two-stage trial designs, extending current R packages for graph-based multiple testing, such as [4]. The package constructs adaptive closed testing procedures from any user-specified graph-based fixed sample multiple test. It covers tests for trials with multiple arms, confirmatory subgroup analyses, multiple endpoints and combinations thereof, as well as different endpoint types. The implemented approach accounts for stochastically dependent test statistics under the assumption that they are (approximatealy) multivariate normally distributed with a known correlation structure, and uses tests based on the Bonferroni inequality otherwise. The package allows for a range of adaptations based on unblinded interim data, such as sample size reassessment, the selection of arms, endpoints or subgroups, and changes to the testing strategy.

        adagraph supports arbitrary correlation structures between test statistics and provides functions to compute those for several common trial designs. This includes trials comparing multiple arms to a shared control group and trials testing pre-specified subgroups alongside the full population. To explore the operating characteristics of the defined adaptive trial designs, adagraph can simulate these trials, providing methods for data generation and trial adaptations. We demonstrate adagraph with several case studies illustrating its practical use for planning adaptive trials with multiple hypotheses.

        References

        [1] F. Bretz et al. “A graphical approach to sequentially rejective multiple test procedures”. In: Statistics in Medicine 28.4 (2009), pp. 586–604.
        [2] F. Klinglmüller, M. Posch, and F. König. “Adaptive graph-based multiple testing procedures”. In: Pharmaceutical Statistics 13.6 (2014), pp. 345–356.
        [3] C. Mehta, A. Mukhopadhyay, and M. Posch. Graph Based, Adaptive, Multi Arm, Multiple Endpoint, Two Stage Design. arXiv:2501.03197 (2025).
        [4] K. Rohmeyer and F. Klinglmüller. gMCP: Graph Based Multiple Test Procedures. R package version 0.8-17. 2024.

        32144109655

        Speaker: Benjamin Fallmann (Medical University of Vienna, Center for Medical Data Science, Institute of Medical Statistics)
      • 16:39
        An adaptive design for interim dose selection and addition for the treatment of whorm infections 18m

        We propose a frequentist, adaptive trial design to investigate the safety and efficacy of three dose levels compared to placebo for the treatment of worm infections. As the safety of the highest dose is not yet established, the study starts with the two lower doses and the control arm. Based on safety and efficacy endpoints observed in an interim analysis, it is decided to either continue with the two lower doses or to drop one or both of these doses and to start an arm with the highest dose instead.
        The proposed adaptive design addresses several challenges: First, the adaptation must rely on an early surrogate endpoint, as the primary endpoint is assessed 12 months after recruitment and is therefore unavailable at the interim analysis. Second, the primary outcome variable follows a mixture distribution with a lognormal component and a point mass at zero. To control the familywise error rate in the adaptive design, we extend the partial conditional error approach to accommodate the addition of new hypotheses after the interim analysis.
        In a comprehensive simulation study a range of design options and analysis strategies are compared and the robustness of the design with respect to design assumptions and parameter values is investigated. The relative effect and confidence intervals across both stages for each dose are estimated using the inverse normal method. The simulation results demonstrate under which conditions the adaptive design enhances the trial's efficiency to identify the optimal dose. Adaptive dose selection allows for resource allocation to promising treatment arms and thereby can increase the chance to select the optimal dose while reducing the required overall sample size and trial duration.

        96432313587

        Speaker: Sonja Zehetmayer (Medical University of Vienna City: Vienna)
      • 16:57
        Confidence intervals for two-stage adaptive designs with subpopulation selection 18m

        Background: We consider clinical trials in which the experimental treatment may have heterogeneous effects across pre-specified patient subpopulations. In such settings, two-stage adaptive enrichment designs allow the enrolled population to be modified at an interim analysis. In stage 1, patients are enrolled from the full population, and based on interim data and preplanned selection rules, stage 2 enrolment may be restricted to subpopulations most likely to benefit. Because these interim decisions are data-dependent, valid statistical inference must account for the adaptation. While hypothesis testing and point estimation methods for adaptive enrichment designs have been well established, corresponding confidence interval methods remain limited. We focus on constructing confidence intervals for the treatment effect in the selected population.

        Method: Confidence intervals that ignore population adaptation may fail to achieve nominal coverage. We propose a new approach that constructs confidence intervals with exact 100(1−α)% coverage conditional on the interim decision, ensuring that unconditional coverage is also exact. Our method applies to a broad class of adaptive enrichment designs. Given the interim selection, we derive the conditional distribution of the naive treatment effect estimator and invert uniformly most powerful unbiased tests to obtain the uniformly most accurate unbiased confidence interval. An efficient computational procedure is provided.

        Results: We conduct extensive simulations which confirm that the proposed intervals achieve the desired conditional coverage with moderate width inflation compared to the standard confidence interval.

        Contribution: This approach offers a rigorous framework for post-selection inference in adaptive clinical trials and contributes to the development of statistically principled adaptive design methodology.

        75002915579

        Speaker: Enyu Li (Clinical Trials Unit, University of Warwick)
    • 15:45 17:15
      IS3: Survival Analysis for Excess Mortality: Insights and Innovations Room 1 A

      Room 1 A

      Convener: Dennis Dobler (RWTH Aachen University)
      • 15:45
        Model extensions for relative survival to assess excess mortality during the COVID-19 pandemic 30m

        The COVID-19 pandemic has led to excess mortality worldwide. Notably, the reported numbers of excess deaths are different from the numbers of deaths from COVID-19. Evaluating pandemic-related mortality should therefore not only be based on cause-of-death data but also on external life tables to enable calculation of population-based measures of the difference between observed and expected mortality. We investigated the impact of infection with and vaccination against COVID-19 on excess mortality during 2020-2021 in persons aged 65 years or older in the Netherlands.
        We did this by incorporating relative survival modelling into a multi-state model considering vaccination, the acute and post-acute phase of (re)infection, and death based on the recent methodology by Manevski et al (2022). The key assumption of relative survival is that the observed hazard of death is the sum of the (unobserved) background and excess hazards where the former is derived from life tables or other historical data. The multi-state model is a time-inhomogeneous Markov model, stratified by sex and age. In this model, transition probabilities can be estimated by the Aalen-Johansen estimator. Different from the standard relative survival setting, the excess ‘hazard’ was a negative quantity for some groups during some periods, making regression modelling by means of a proportional hazards model infeasible. Instead, we explored additive hazards models. Modelling a pandemic made the choice of calendar time as timescale the most obvious, the consequences of which will be discussed, especially violation of the Markov property. In an extended model, we split different reported causes of death into a background and an excess part. We applied the model on real-world, nationwide and unselected individual data from Statistics Netherlands (CBS).
        From 1 January 2020 until the end of 2021, the total cumulative probability of excess mortality was 0.3%, 4.4% of all mortality. This percentage was markedly higher for older persons, especially men, both in absolute terms and as a percentage of subgroup-specific observed mortality. Almost all excess mortality took place after an infection but its probability was much lower for persons who had received a vaccination.
        The novel multi-state model incorporating relative survival enables to split all mortality in background and excess mortality with and without intermediate events. The current application shows the value outside the traditional context of relative survival for cancer patients. Further extensions incorporating, a.o., regression modelling of background and excess hazard, underreported infections and a mechanistic model for disease transmission are under development.

        53573509107

        Speaker: Liesbeth C De Wreede (Department of Biomedical Data Sciences, Leiden University Medical Center)
      • 16:15
        Modelling excess mortality comparing to a control population: A combined additive and relative hazards model 20m

        Regression models for the hazard function have been proposed on both a multiplicative and an additive scale. In medical research the former is often suitable, but in some instances, it is more biologically plausible to assume an additive effect on the mortality rate. The best-known example is in population-based cancer patient survival, where the presence of cancer is assumed to have an additive effect on mortality. This excess mortality rate is typically measured using relative survival, where the observed mortality rate in the cancer population is compared to that in a similar cancer-free population (the expected mortality rate). In such analyses, a publicly available population mortality file stratified on sex, year, and age, is matched to the cancer population, and included in the relative survival model as an offset, and hence assumed to be measured exactly (i.e., without uncertainty). However, situations exist where this standard approach is not optimal or even possible. For example, it might be necessary to stratify the expected mortality on additional factors, such as socio-economy or comorbidity. Alternatively, a suitable population mortality file might simply not exist for the population at hand.

        Here, we propose a flexible parametric excess hazard model on the log hazard scale, incorporating a modelled expected rate from a control population (e.g., matched comparators). By modelling the expected rate, we appropriately allow for uncertainty. Covariate effects are assumed to be multiplicative within the expected and the excess hazard, while the presence of disease among the studied population (e.g., cancer patients) has an additive effect. Following estimation, results are quantified through prediction of the survival, hazard, and cumulative incidence functions, as well as transformations of these, and crucially with associated confidence intervals on all measures.

        Bias and coverage of predictions are evaluated using simulated data mimicking a 1:5 matched cohort study, with cancer cases followed from cancer diagnosis and matched comparators followed from matching date (which corresponds to the diagnosis date for the corresponding cancer case). We further illustrate the method using a population-based dataset of rectal cancer patients diagnosed 2007-2016, with comparators matched 1:6 on: sex, age, country, and being cancer-free using the Colorectal Cancer DataBase Sweden (CRCBaSe).

        The proposed method, offers an alternative in situations when standard relative survival methods do not suffice, and is implemented in the Stata package stexcess (github.com/RedDoorAnalytics/stexcess).

        64288205106

        Speaker: Caroline Dietrich (Clinical Epidemiology Division, Department of Medicine Solna, Karolinska Institutet)
      • 16:35
        Regression models in the extended multi-state model using relative survival 20m

        In many medical applications of event-history analysis, individuals may experience several intermediate events before death, and a non-negligible proportion of deaths is unrelated to the disease under study. While standard multi-state models evaluate the occurrence of different events over time, they do not explicitly model mortality from disease-related causes and from other (population) causes as separate outcomes. To address this, we developed an extended multi-state model based on relative survival [1], which decomposes total (observed) mortality into disease-related (excess) and non-disease-related (population) components with and without intermediate events, in the case when cause of death is unavailable or uncertain.
        Within this framework, we can define transition hazards and transition probabilities and derive non-parametric estimators that incorporate population mortality tables into the estimation process. These estimators enable the estimation of hazards and probabilities of population and excess death with and without intermediate events, and their associated uncertainty.
        For incorporating covariates and individual prediction, we develop regression models for the excess hazards in the multi-state setting, applying both the multiplicative Cox-type and the additive Aalen modelling frameworks. Two main challenges arise in this context: handling delayed entry (relevant for intermediate states) and addressing small excess death rates (common in later follow-up periods). Both challenges can also appear in simpler settings (where only overall survival is considered or there are no intermediate events). We demonstrate how both challenges are handled within the two modelling frameworks and use simulations to investigate their performance under various scenarios.
        As the total mortality in the multi-state model is split into population and excess components, questions arise on the covariate effects and long-term patient predictions that can be obtained from such a model. The two regression approaches address these questions and provide a further understanding of the studied disease. All methods are implemented in the R packages mstate and relsurv, ensuring practical usability.

        [1] Manevski D, Putter H, …, de Wreede LC. Integrating relative survival in multi-state models-a non-parametric approach. Stat Methods Med Res. 2022;31(6):997-1012.

        75002917448

        Speaker: Damjan Manevski (Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana)
      • 16:55
        A penalized spline additive hazards model for modeling excess mortality 20m

        Relative survival techniques are often used to assess excess mortality in a specific study population by splitting observed mortality into background and excess components. These methods have been widely used to estimate cancer-specific mortality without the need for precise cause-of-death data for cancer patients. However, applying these techniques to other settings, such as pandemics or other temporal crises, presents important challenges. Standard relative survival methods assume excess hazards are always positive and that background mortality is specified externally, based solely on demographic factors and independent of the study data. In contrast, pandemics, however, may involve periods of negative excess hazards due to the protective effects of public health measures, as well as substantial heterogeneity in background mortality across subgroups. Moreover, unlike conventional applications where the study population is a small subset of the reference population, in pandemic and other temporal crisis settings the background and study populations represent the same group observed at different time points (e.g. pre-pandemic vs. pandemic periods).
        To address these challenges, we propose a novel flexible parametric additive hazards model that simultaneously incorporates pre-pandemic and pandemic data, using penalized splines to estimate time-dependent covariate effects, with the crisis period included as a covariate. This unified framework for excess-mortality estimation accommodates negative excess hazards and integrates relevant risk factors directly into the background mortality hazard, defined as the pre-pandemic hazard and estimated from the data. In addition, the use of penalized splines is less prone to overfitting that the classical non-parametric Aalen’s method. We investigated two estimation strategies: one adapting the least-squares approach used in Aalen’s method to the penalized spline setting, and the other based on penalized likelihood maximization. The performance of the two approaches is compared through a simulation study based on the COVID-19 pandemic and scenarios mimicking possible pandemic conditions.

        85717614217

        Speaker: Mar Rodriguez-Girondo (LUMC)
    • 15:45 17:15
      Machine learning and data science 2 Room 14

      Room 14

      Convener: Elżbieta Kubera (University of Life Sciences in Lublin)
      • 15:45
        Supervised Machine Learning Models for Longitudinal and Clustered Data 18m

        Longitudinal or clustered data often arise in clinical research, potentially violating the independent and identically distributed (i.i.d) assumption. In regression, (generalized) linear mixed-effect models are frequently used to account for the correlation structure of the data, but these come with restrictions such as the linearity assumption and pre-specification of predictors and their interaction. In contrast, machine learning based algorithms, such as regression trees, random forests and neural networks, offer greater flexibility. Amongst other things, they can automatically select the most relevant variables, capture non-linear relationships and interactions between variables. Nevertheless, these models usually assume i.i.d observations, at least implicitly. To make use of machine learning models in the scenario of clustered or longitudinal data, they need to account for such correlations.
        To address this limitation, several extensions to machine learning algorithms have been proposed to incorporate random effects, drawing on the idea from (generalized) linear mixed effect models (Sela and Simonoff 2012; Hajjem et al. 2017; Ngufor et al. 2019). Alternative approaches include Multi-Task Learning and Recurrent Models for longitudinal data (Cascarano et al. 2023), and domain adaptation and domain generalization techniques for clustered data (Nguyen et al. 2023). However, a comprehensive review on supervised machine-learning algorithms capable of handling correlated data in both longitudinal and clustered formats is currently lacking.
        This work aims to fill this gap by systematically identifying and comparing modelling strategies and methods for the analysis of longitudinal and clustered data with supervised machine learning algorithms. A secondary objective is to assess their use in biomedical research by examining applications in appropriate journals. To achieve this, we employ a scoping review methodology, which involves a comprehensive literature search to map the current knowledge base.
        In this talk, we will present the findings of the scoping review, providing an overview of the available modelling strategies and comparing their strengths and limitations in the context of biomedical research. We will also provide insights into the current use of these modelling strategies, shedding light on their adoption in the field.

        75002909208

        Speaker: Maxi Schulz (University Medical Center Göttingen, Department of Medical Statistics)
      • 16:03
        Can We Trust Interpretable Machine Learning Methods for Longitudinal Risk Prediction? 18m

        Machine learning (ML) models have emerged as a powerful alternative to traditional statistical methods due to their flexibility and ability to leverage large-scale, high-dimensional datasets. However, in sensitive application areas such as clinical and prognostic modeling, deploying ML models requires interpretability in order to reveal underlying model behavior, identify influential risk factors and detect potential biases. Although interpretable machine learning (IML) techniques are increasingly used to illuminate these “black box’’ models, the systematic evaluation of interpretability in non-standard prediction settings remains limited.

        In this study, we examine IML for risk prediction modeling using longitudinal features such as diagnosis (ICD) and medication (ATC) codes, where high dimensionality and sparsity present major methodological challenges. Our focus is on prediction tasks with binary and time-to-event outcomes. We conduct a simulation study to evaluate the effectiveness of different IML techniques in explaining ML-based prediction models under increasing data complexity, including varying degrees of sparsity, dimensionality, and outcome types.

        We fit several prediction models with an emphasis on deep neural network architectures tailored for longitudinal data, and apply a set of model-agnostic and model-specific IML techniques. We assess the accuracy with which these methods recover known data-generating relationships and the alignment of interpretability with predictive accuracy. Finally, we apply the evaluation framework to real-world health insurance data to assess generalizability. This study is the first to systematically evaluates and compares IML techniques for longitudinal prediction modeling. It offers practical guidance for method selection and advances understanding of IML’s role in risk prediction and clinical decision support within healthcare and biomedical contexts.

        75002909317

        Speaker: Julia Höpler (Leibniz Institute for Prevention Research and Epidemiology – BIPS, Faculty of Mathematics and Computer Science – University of Bremen)
      • 16:21
        Flexible nonparametric bootstrapping for machine learning validation studies based on hierarchical data 18m

        Introduction:
        Machine learning (ML) validation studies can often be tackled with standard statistical inference methods, i.e. confidence intervals and statistical tests. While this is reasonable in many situations there are also conditions under which the usual IID assumption is not met, and operating characteristics (coverage probability, type 1 error rate) may thus deteriorate. For instance, hierarchical data structures (multiple medical images per patient, multiple cells per blood sample, …) may introduce dependencies between (lower level) observations.

        Methods:
        We applied a flexible hierarchical bootstrap approach, which has been proposed before in other contexts, to the outlined problem (1, 2). Hereby, drawing samples with replacement is conducted sequentially for each level of the (assumed) hierarchical data structure (e.g. first patients, then medical images per patient). To compare confidence intervals based on different variants of this approach and the traditional bootstrap with regards to relevant operating characteristics (coverage probability, average width), we conducted a simulation study in the ADEMP framework (3). Hereby we investigated different data generating mechanisms in the context of ML validations studies.

        Results:
        Our simulation results indicate that utilizing the hierarchical bootstrap usually outperforms the standard bootstrap as its coverage probability is usually much closer to the target level. The only exception is the IID case where a hierarchical data structure is assumed but not truly part of the data generating process. In this scenario, the hierarchical bootstrap is rather conservative.

        Discussion:
        We conclude that the hierarchical bootstrap investigated in this work is valuable for ML practitioners as it shows promising results while being still simple to apply. It is also widely applicable to arbitrary performance or error metrics. We currently see this method in phase 2 of the methodological research framework (4). In effect, more extensive and diverse simulation studies will be needed in the future to better characterize and understand the operating characteristics in various scenarios.

        References:
        (1) Ren, Shiquan, et al. "Nonparametric bootstrapping for hierarchical data." Journal of Applied Statistics 37.9 (2010): 1487-1498.
        (2) Saravanan, Varun et al. “Application of the hierarchical bootstrap to multi-level data in neuroscience.” Neurons, behavior, data analysis, and theory vol. 3,5 (2020).
        (3) Morris, Tim P., Ian R. White, and Michael J. Crowther. "Using simulation studies to evaluate statistical methods." Statistics in medicine 38.11 (2019): 2074-2102.
        (4) Heinze, Georg, et al. "Phases of methodological research in biostatistics—building the evidence base for new methods." Biometrical Journal 66.1 (2024): 2200222.

        64288212786

        Speaker: Max Westphal (Fraunhofer MEVIS)
      • 16:39
        Explaining Mixed Effect Machine Learning with Shapley Value 18m

        In many real-world datasets, observations are hierarchically structured, such as students nested within classrooms, hospitals within cities, or repeated measurements from the same patient. Performing machine learning without accounting for this clustered structure can lead to biased predictions and misleading interpretations of feature effects.

        Recently, Mixed Effect Machine Learning, an extension of the traditional Linear Mixed Effect Model, has gained popularity for analyzing clustered and longitudinal data, particularly in healthcare applications, due to its ability to capture both population-level and group-specific variations. However, despite its predictive advantages, such models often remain black boxes, making interpretation difficult.

        Applying explainable AI (XAI) tools such as Shapley Values directly to mixed effect models has limited effectiveness because clustered data contain both cluster-level and observation-level features. Moreover, mixed effect models inherently separate structure into fixed effects, shared across all observations, and random effects, which vary across clusters. Standard SHAP values cannot distinguish how contributions operate at these different hierarchical levels, leading to incomplete explanations of model behavior.

        This study proposes an extension of the SHAP framework tailored specifically for Mixed Effect Machine Learning. The proposed approach enables a clear decomposition of feature contributions across cluster and observation levels, offering interpretable insights into how models use features in structured data. Beyond quantifying how much each feature contributes, this method reveals at what level, cluster or individual, the model utilizes each feature. Consequently, it also provides diagnostic guidance on whether additional random effects should be incorporated and how the model’s hierarchical structure should be refined.

        85717602088

        Speaker: Pat Vatiwutipong (Mahidol University)
      • 16:57
        ML with U-smile Visualization: A Novel Approach to Imbalanced Classification Without Thresholds 18m

        Background: Traditional binary classification assessment in machine learning relies heavily on decision thresholds, limiting interpretability and performance in imbalanced scenarios. While metrics like AUC under ROC (Receiver Operating Characteristic curve) provide overall performance measures, they fail to deliver class-specific insights, which is crucial for real-world applications with uneven class distributions.
        Methods: We introduce the U-smile method [1, 2] a novel machine learning framework featuring threshold-free visualization that decomposes the relative likelihood ratio (rLR) into event and non-event components displayed through characteristic U-shaped plots. Comprehensive experiments employed six synthetic datasets with varying predictive power and class imbalance ratios (balanced to 90/10). The method was compared against AUC-based variable selection using bidirectional stepwise selection with 10-fold cross-validation.
        Results: In severely imbalanced scenarios (90/10 distribution), U-smile methods demonstrated superior performance, selecting more relevant variables (4-5 vs 3) and achieving substantial improvements in minority class detection. Key metrics showed significant enhancement: AUC-PR (area under Precision-Recall curve) increased by 16% (0.701→0.812) and F1-score by 21% (0.662→0.798). The framework adheres to Explainable Machine Learning (EML) principles by providing intuitive graphical assessment tools. Evolutionary analysis of U-smile patterns revealed progressive symmetry achievement, with the non-event variant attaining near-optimal balance (rLR₀=0.535, rLR₁=0.569) despite extreme class imbalance.
        Conclusion: The U-smile framework offers a threshold-free, class-specific evaluation approach that outperforms conventional AUC-based methods in imbalanced classification. Its visual interpretability and capacity to identify minority-class-beneficial variables make it particularly valuable for practical applications where class imbalance is prevalent, while simultaneously advancing explainable AI through transparent model assessment.

        [1] Więckowska B, Kubiak KB, Guzik P. Evaluating the three-level approach of the U-smile method for imbalanced binary classification. PLOS ONE 2025;20:e0321661
        [2] Kubiak KB, Więckowska B, Jodłowska-Siewert E, Guzik P. Visualising and quantifying the usefulness of new predictors stratified by outcome class: The U-smile method. Plos One 2024;19:e0303276

        85717606066

        Speaker: Barbara Więckowska (Department of Computer Science and Statistics, Poznan University of Medical Sciences, Poznan, Poland)
    • 15:45 17:15
      Streitberg- and Lienert-Awards Room 13 A

      Room 13 A

      Conveners: Anne-Laure Boulesteix (LMU Munich), Jan Beyeresmann (Ulm University)
      • 15:45
        Optimal designs for non-linear segmented regression 18m

        Optimal designs maximize the experimental efficiency and precision, but are sometimes difficult to obtain, especially in cases with non-trivial underlying model functions. A possible application area providing the motivating example is toxicology. Liver carcinoma cells are modelled as a function of valproic acid (VPA) concentration using the common four-parameter log-logistic (4PLL) model. To prevent modelling implausible negative values, the model function is split into two segments: the standard 4PLL segment and a constant zero segment in parts modelled negatively. A non-linear continuous segmented regression model with unknown segment borders is at hand, for which it is difficult to define optimal designs. I derived and analyzed Bayesian A- and D-optimal segmented designs, along with a newly proposed A-mixed criterion to obtain a more accurate estimation of the parameter that defines the location of the segment border.

        Compared to conventional A- and D-optimal designs, that ignore the segmented model structure, segmented designs improve precision, as demonstrated in the VPA application and in various simulation studies based on the truncated 4PLL model. However, segmented designs may lack robustness under misspecified priors due to the model’s truncated structure. Therefore, a sequential approach is recommended: if data from a non-segmented design suggests a truncated 4PLL structure, combining it with a segmented design yields more stable and precise results.

        Speaker: Jan-Bernd Igelmann (TU Dortmund University)
      • 16:03
        Adaptive group sequential designs for event-driven survival trials 18m

        The concept of interim analysis and adaption is more and more used in clinical trials. Furthermore, one often has several endpoints such as progression-free survival (PFS) and overall survival (OS) which is discussed in the current paper of Danzer et al. (2022). There, testing hypotheses with adaptive group sequential one-sample tests for the distribution of PFS and OS is developed. The authors assume entirely random censoring, but trials are often censored event-driven, both at the interim and at the final analysis. The aim of this work is to extend the current literature proposal for the PFS and OS model to such event-driven censoring in adaptive designs.

        Using a purely counting-process oriented approach simplifies the notation and initially allows for general censoring mechanisms such as event-driven censoring.
        In a simulation study, we compare our results using different approaches of event-driven censoring. By doing so, no negative effect can be detected.

        Speaker: Alexandra Nagel (University of Ulm, METRONOMIA Clinical Research GmbH)
      • 16:21
        A Statistical Approach to Latent Dynamic Modeling with Differential Equations 18m

        In clinical registries, longitudinal patient data are often sparse,
        noisy, and irregularly sampled, yet an important practical question is
        how an individual patient’s health status is likely to change from the current visit to the next.

        Regression-based approaches to longitudinal data analysis typically
        provide a global fit over the full observed time course, but are not
        directly aimed at modeling such temporally local, patient-specific
        changes. Ordinary differential equations ODEs provide a natural
        framework for describing local dynamics based on current status, where
        parameters can be informed by external knowledge, but their use in
        clinical cohort settings remains limited. This is potentially due to a
        larger number of variables to be modeled and a higher noise level, as
        an ODE solution strongly depends on the initial value and jointly modeling many variables is challenging.

        To address this, we propose an ODE-based modeling approach for
        multivariate longitudinal cohort data. Each observation is used as an
        initial value to obtain multiple local ODE solutions, which are then
        combined into a single estimator, enabling prediction from arbitrary
        time points while improving robustness to noise. To accommodate a
        larger number of observed variables, we learn a low-dimensional
        latent space using neural networks, and infer individual-specific ODE parameters from patients' baseline characteristics.
        Differentiable programming allows for simultaneous estimation of the
        latent representation and the dynamic model.

        We illustrate the approach in an application on data from patients
        with spinal muscular atrophy and compare it with global regression-based function fitting.
        The application highlights how different modeling strategies address
        different scientific and clinical questions, and shows the potential
        of combining mechanistic and statistical modeling ideas for longitudinal registry data.

        Speaker: Maren Hackenberg (Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg. Freiburg Center for Data Analysis, Modeling, and AI, University of Freiburg)
      • 16:39
        Variable Selection via Fused Sparse-Group Lasso Penalized Multi-state Models Incorporating Molecular Data 18m

        In the era of precision medicine with increasing molecular information, the use of a multi-state model is required to capture the individual disease pathway along with underlying etiologies with greater precision. Especially the availability of big data with numerous covariates induces several statistical challenges for model building. For multi-state models based on high-dimensional data, effective modeling strategies are required to determine an optimal, ideally parsimonious model.
        Standard methods integrate regularization into the fitting procedure to conduct variable selection. In the multi-state framework, linking covariate effects across transitions is needed to conduct joint variable selection. A useful technique to reduce model complexity is to address homogeneous covariate effects for distinct transitions. We integrate this approach to data-driven variable selection by extended regularization methods within multi-state model building. We propose the fused sparse-group lasso (FSGL) penalized Cox-type regression in the framework of multi-state models combining the penalization concepts of pairwise differences of covariate effects along with transition grouping. For optimization, we adapt the alternating direction method of multipliers (ADMM) algorithm to transition-specific hazards regression in the multi-state setting.
        In a simulation study and application to acute myeloid leukemia (AML) data, we evaluate the algorithm's ability to select a sparse model incorporating relevant transition-specific effects and similar cross-transition effects of biomarkers. We investigate settings in which the combined penalty is beneficial compared to global lasso regularization.
        Thus, effective model selection strategies in multi-state survival analysis are required for enhancing comprehension and interpretation of individual disease pathways, distinct oncological entities and tailored precision therapies, leading to improved personalized prognoses.

        Speaker: Kaya Miah (Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany)
    • 15:45 17:15
      TC3: Estimation after Adaptive Testing Room 1 B

      Room 1 B

      Convener: Werner Brannath (University of Bremen)
      • 15:45
        Estimation after an adaptive design 18m

        When estimating the treatment effect after a group sequential test or a more complex adaptive design, the maximum likelihood estimate is liable to be biased. The ICH E20: Guideline on Adaptive Designs for Clinical Trials has “reliability of estimation” as a key topic. Methods have been developed to reduce the bias in estimators after group sequential and adaptive designs – or even eliminate bias completely. Proposals include Whitehead’s bias-adjusted estimator and Rao-Blackwell unbiased estimators. The resulting estimators have been compared in terms of bias, variance and mean square error. However, these results do not tell the whole story. We shall look more closely at some of these estimators, make recommendations, and note areas where future research is required.

        Speaker: Chris Jennison (University of Bath, UK)
      • 16:03
        Estimation in confirmatory clinical trials planned with an adaptive design - a regulatory perspective 18m

        Reliable estimation of treatment effects is essential for the benefit–risk assessment supporting the approval of new drugs and for the communication of trial results in the European Public Assessment Report (EPAR) and Summary of Product Characteristics (SmPC). In adaptive designs, where trial adaptations such as sample size re-assessment, population enrichment, or treatment arm selection are implemented, estimation becomes particularly challenging. These adaptations may introduce complexities in the estimation of treatment effects, both in terms of bias and precision, but also in the definition of what effect is being estimated—especially when design modifications alter the target population or other components of the estimand.

        This presentation will summarise the current regulatory guidance relevant to estimation following adaptive designs, including the EMA Reflection Paper on Methodological Issues in Confirmatory Clinical Trials Planned with an Adaptive Design, the draft ICH E20 Guideline on Adaptive Clinical Trials (EMA/CHMP/ICH/82035/2023, draft 2023), and related guidance. These documents highlight the importance of pre-specifying estimands, maintaining interpretability of estimates, and ensuring transparency about the impact of adaptations on estimation and inference.

        Examples from recent regulatory procedures will illustrate how estimation issues were addressed in practice, including the communication of treatment effects in the EPAR and SmPC. The presentation will also discuss areas where methodological developments and harmonised regulatory expectations are still needed to support the use of adaptive designs for confirmatory evidence generation.

        42858803848

        Speaker: Florian Klinglmueller (Austrian Agency for Health and Food Safety)
      • 16:21
        Estimation for adaptive designs - a methodological overview 18m

        In adaptive clinical trials, the conventional point estimators of the treatment effect are prone to bias. Similarly, the conventional confidence intervals are prone to incorrect coverage, as well as other undesirable statistical properties. Recent regulatory guidance, such as ICH E20, has highlighted the need to use adjusted estimators and confidence intervals for adaptive designs in order to address these issues.

        In this talk, we provide a comprehensive review of available methods for adaptive designs to 1) remove or reduce the potential bias in point estimators for adaptive designs and 2) construct confidence intervals with the desired coverage. We describe several classes of methodological techniques and provide a classification of adjusted estimators and confidence intervals by the type of adaptive design. We also highlight available software and code and discuss the remaining methodological gaps in the literature.

        32144105679

        Speaker: David Robertson (MRC Biostatistics Unit, University of Cambridge)
      • 16:39
        Point and interval estimation of treatment effects in adaptive clinical trials – past experiences and future developments 18m

        Clinical trials have become more complex in recent decades. They have gradually become longer, involving more centers and more patients. With these tendencies, interim analyses of ongoing trials have become much more common and much research has been done on the design and analysis of data from adaptive trials. By now group-sequential trials are probably more frequent than single-stage trials.

        Until very recently, however, type I error control of the related statistical test decisions was the only result from the research in adaptive designs that has really influenced clinical trial practice. Obviously, the selection of treatment arms at an interim analysis or the narrowing of the patient population to a subpopulation may induce biases as well as affect the coverage probability of confidence intervals. However, these aspects of adaptive designs are largely ignored in practice.
        The talk will start out by presenting some examples of trials where this happened and will compare the reported results with various adjusted estimates that were calculated post hoc. It will illustrate how certain subtleties (such as the difference between conditional and unconditional bias-adjusted estimates) influence the adjusted estimates and hence may explain why the research into this topic has not been picked up by practitioners of clinical trials. Part of the explanation lies in the fact that “bias” is a vague term and that selection bias from selecting a dose should not be treated with the same formal techniques as the bias from stopping the trial at a specific point in time. Furthermore, conditional selection biases can become very large, whereas unconditional selection biases oftentimes remain modest. The talk will conclude with a discussion of what we may want to change in the future.

        53573503205

        Speaker: Ekkehard Glimm (Novartis Pharma)
      • 16:57
        Discussant 18m
        Speaker: Frank Bretz
    • 17:30 19:30
      Welcome Reception 2h
    • 09:00 10:00
      P2 Plenary Session: Kirsten Schorning: Keynote lecture Room 1 A

      Room 1 A

      Convener: Łukasz Smaga (Adam Mickiewicz University)
      • 09:00
        Optimal design and Analysis in cytotoxicity experiments -- Bridging the gap between statistics and toxicology 1h

        Concentration-dependent cytotoxicity experiments are frequently used in toxicology. Although it has been reported that an adequate choice of concentrations, i.e., the design, substantially improves the quality of statistical inference, a recent literature review of three major toxicological journals showed that these methods are rarely used in toxicological practice.
        In this talk, we address the optimal design problem in cytotoxicity experiments from both an applied and a theoretical perspective. On the one hand, we present strategies and concrete examples for making established statistical methodology more accessible to potential users, especially biologists. On the other hand, we consider specific biological challenges in cytotoxicity experiments from the statistician’s point of view: identifying alert concentrations where a pre-specified threshold of the response variable is exceeded. We develop a model-based testing procedure for that purpose and address the corresponding optimal design problem. We construct an optimal design criterion to improve the power of the model-based testing procedure. Thus, an optimal design minimises the maximum variance of the alert concentration estimator. Optimal design theory is developed, and the results are illustrated in several examples in which the alert concentration is identified under different concentration-response relationships. In particular, it is demonstrated within a simulation study that using the optimal design results in more powerful tests for identifying alerts than using other “non-optimal” designs. In a further step, we extend the results to the situation of mixture experiments. Here, the combination of several toxic substances is considered resulting in the need of a new definition for alert concentration (combinations) and new optimal design methodology.

        21429410484

        Speaker: Kirsten Schorning (TU Dortmund University)
    • 10:00 10:45
      Coffee break / Poster session 45m Foyer

      Foyer

    • 10:00 15:15
      Poster session - continued x Poster display area

      x Poster display area

    • 10:45 12:15
      Censored data 3 Room 13 B

      Room 13 B

      Convener: Jan Beyersmann (Ulm University)
      • 10:45
        Inference in pseudo-observation-based regression using (biased) covariance estimation and naive bootstrapping 18m

        The pseudo-observation regression approach provides a flexible alternative to the omnipresent proportional hazards model when modeling time-to-event outcomes. In this approach, estimands representable as expectations are fitted to regression models using covariates of interest. Exemplary estimands that fit this framework are the restricted mean time lost (in competing risks models) or the survival function at a fixed time-point (in simple survival models).

        Even though consistent parameter estimates are readily obtained using standard statistical software, variance estimation turns out to be a more intricate task: We verify the longstanding conjecture that the usual Huber-White estimator is not consistent. By confirming that a plug-in estimator can be used instead, we obtain asymptotically exact and consistent tests for general linear hypotheses in the parameters of the model. Additionally, we confirm that naive bootstrapping can not be used for covariance estimation in the pseudo-observation approach either. However, it can still be used for hypothesis testing by applying a suitable studentization. These methods are evaluated in an extensive simulation study and exemplified with a real data analysis.

        32144113528

        Speaker: Simon Mack (RWTH Aachen University)
      • 11:03
        Evaluating selective‐inference methods for Lasso in survival analysis: a comparative simulation study 18m

        Occam’s Razor suggests that, among several plausible explanations for a phenomenon, the simplest is preferable. Applied to regression analysis, this implies that the smallest model that fits the data is best. Therefore, in terms of analyzing high-dimensional time-to-event data, variable selection techniques are required, if we want to follow the principle of Occam's Razor. A widely used approach is Lasso regularization, but inference after Lasso selection remains challenging, particularly for complex models such as the Cox proportional hazards model, where standard confidence intervals and p-values are not readily available.
        We compared proposals for selective inference targeting the submodel parameters of the Lasso and its extension, the adaptive Lasso including sample splitting, selective inference conditional on the Lasso selection, and debiased Lasso. Using a neutral simulation design motivated by characteristics commonly observed in biomedical time-to-event datasets, we evaluate the empirical properties of selective confidence intervals. The methods are additionally demonstrated using a real-world biomedical dataset.

        21429407688

        Speaker: Lena Schemet (University of Augsburg)
      • 11:21
        Evaluation of Joint Models and Related Approaches for Long-Term Risk Prediction from Short-Term Data 18m

        We consider the following prediction problem using observational data obtained from routine health-care visits. Biomarkers such as blood pressure and cholesterol are repeatedly measured over time, resulting in sparse and irregular longitudinal data for thousands of individuals. In addition, we observe corresponding survival outcomes, such as the time to cardiovascular disease or death, which are correlated with underlying biomarker trajectories. Unlike traditional survival settings, individuals do not share a common starting time; instead, they enter the study at varying ages and without an intervention. Furthermore, individual observation windows are relatively short (e.g., five years)–either because long-term data are not available or because older data might not reflect current patient characteristics. Given these constraints, can such short-term longitudinal data be used to make reliable long-term risk predictions, projecting 10-20 years into the future?

        From a methodological perspective, there is a way to approach this seemingly contradictory problem. Assuming proportional hazards throughout the whole age range, we can employ proportional hazard models that use age as the underlying time scale. In this setting, individuals enter the study at various ages, resulting in left-truncated survival data and an assumed baseline hazard that spans the entire observed age range. Consequently, the prediction horizon for a new patient can extend up to the maximum age in the training data–potentially decades beyond the short observation windows of individuals. What remains unclear, however, is how to adequately harness the longitudinal information for survival prediction. In more traditional time-on-study settings, different methods have been proposed for longitudinal and associated survival data. Among them, joint models have been shown to reduce bias and improve efficiency in parameter estimation. However, these advantages may come at the cost of substantial computational demands. This burden, as well as the quality of resulting predictions, may further be challenged by the left-truncation in the survival data, the long prediction horizon, and the sparsity of visit times.

        In this study, we explored the applicability of joint models and related approaches for long-term risk prediction in data with left-truncated survival times and multiple longitudinal markers as predictors. We use simulation studies to asses the methods, starting with an ideal data scenario with ample repeated measurements per individual, and gradually moving towards a setting that mimics a real-world example of routine health-care data from Austria. We evaluate the methods in terms of prediction accuracy as well as the feasibility of model estimation.

        21429409279

        Speaker: Moritz Madern (Medical University of Vienna, Center for Medical Data Science, Institute of Clinical Biometrics)
      • 11:39
        Comparing variable selection in Cox and accelerated failure time models: noncollapsibility, the phantom hazard 18m

        In descriptive studies, where the primary goal is to identify key predictors of a time-to-event outcome, and in predictive research involving numerous candidate predictors, data-driven variable selection methods are often employed to narrow down the pool of variables. This is particularly necessary when domain expertise is limited or when the practical utility of a prediction model is compromised by an overly complex structure. Despite their utility, variable selection methods may introduce challenges, especially in the context of survival analysis, where the choice of modeling framework can influence the results.

        One such challenge probably arises from the noncollapsibility of the Cox proportional hazards model. Noncollapsibility refers to the property whereby the hazard ratio for a predictor in the Cox model does not represent a marginal association when other variables are included in the model, even if they are independent of the predictor of interest (1). In contrast, the accelerated failure time (AFT) model does not exhibit noncollapsibility. This raises the question of whether the noncollapsibility of the Cox model impacts the operation characteristics of variable selection methods compared to the AFT model.

        In order to investigate this question, we applied backward elimination with different stopping criteria to Cox and AFT (Weibull) models aiming at cardiovascular event prediction with a previously published data set of health screening examinations with short follow-up (2). We also applied variable selection to bootstrap resamples to investigate selection stability. Our results indicate that the selected models and selection stability were very similar despite noncollapsibility of Cox models. We are currently running a larger simulation study to investigate this issue further.

        Selection bias in risk sets may be a reason for noncollapsibility of Cox models, because it affects the correlation structures between covariates within risk sets over time. Our data set exhibited a high proportion of censoring (95.7%), which dominated this possible selection bias by introducing a random element into the composition of risk sets attenuating the correlation of covariates. We preliminarily conclude that the noncollapsibility of Cox models may be negligible for the purpose of variable selection in observational studies with high censoring proportions.

        1. Martinussen, T., Vansteelandt, S., 2013. On collapsibility and confounding bias in Cox and Aalen regression models. Lifetime Data Anal 19, 279–296.
        2. Wallisch, C., et al, 2021. Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling. Statistics in Medicine 40, 369–381.

        64288207955

        Speaker: Lorena Hafermann (Institute of Biometry and Clinical Epidemiology, Charité - Universitätsmedizin Berlin)
    • 10:45 12:15
      High dimensional data 2 Room 14

      Room 14

      Convener: Jaroslaw Harezlak (Indiana University Bloomington)
      • 10:45
        Individuality and information content of infrared molecular profiles: insights from a large longitudinal health-profiling study 18m

        In this study, we investigate the individuality and information content of infrared molecular profiles derived from blood samples in a large, longitudinal health-profiling cohort and compare them to a standard clinical laboratory panel. Using Fourier-transform infrared spectroscopy, we obtained comprehensive molecular fingerprints from 4,704 self-reported healthy individuals over five visits spanning 1.5 years, alongside routine clinical laboratory measurements. We show that infrared profiles are highly individual-specific and remarkably stable over time, with intra-individual variability significantly lower than inter-individual differences—paralleling the characteristics observed in clinical laboratory data. To quantify and compare the information content of these molecular datasets, we employ individual identification as a proxy for Shannon entropy. In this framework, higher identification accuracy reflects a higher amount of information. Infrared profiles outperform the clinical laboratory panel in identifying individuals at scale, suggesting higher intrinsic information content. Furthermore, combining infrared and clinical laboratory data substantially improves identification performance (the identification of less than 3000 individuals by the clinical laboratory panel is boosted to more than 4000 by incorporating the infrared spectroscopic markers), highlighting the value of integrating complementary data modalities. These findings suggest a practical framework, rooted in information theory, for comparing molecular profiling approaches and emphasize the potential of infrared spectroscopy as a complementary tool in personalized medicine.

        32144113928

        Speaker: Kosmas Kepesidis (Ludwig-Maximilians-Universität München)
      • 11:03
        Statistical basis for precision screening with infrared molecular fingerprints: functional data decomposition and lung cancer signals 18m

        Early detection of lethal diseases such as lung cancer requires resolving faint signals amid biological heterogeneity. Precision screening aims to sensitively detect meaningful departures from an individual’s baseline by considering individual-level rather than population-level variability. This work investigates whether infrared molecular fingerprinting (IMF) - mid-infrared vibrational spectroscopy of blood plasma that captures broad molecular information - can support such precision screening. We find that IMFs exhibit strong variability stemming from different sources such as health status, population-level and individual-level variation. We show that functional principal component analysis can disentangle these components, allowing subsequent investigation of various IMF characteristics and application scenarios.
        Two clinical datasets are analyzed: a cross‑sectional lung‑cancer case-control study (511 cases, 993 controls) and a longitudinal healthy cohort (5830 participants with five visits). Each observation comprises an IMF spectrum linked to demographic, lifestyle and clinical covariates. The spectra are treated as functional data over wavenumbers. We estimate between‑ and within‑person covariance components and perform FPCA separately on each to extract dominant modes of variation and reduce data dimensionality. Analogously, lung cancer FPCs are estimated from the diseased cohort. Assuming IMFs follow a Gaussian process, the Karhunen-Loève expansion is used to simulate spectra for different variability scenarios based on which confidence bands are extracted via the modified band depth. Disease detectability is assessed with logistic regression and functional anomaly‑detection methods across representations (full spectra, leading FPCA scores, residuals after projection onto a healthy FPC subspace), summarized by ROC‑AUC. The disease effect in the case-control study, i.e. the average treatment effect (ATE) of cancer on IMF profiles, is approximated via propensity‑score matched cohorts. The ATE estimate is then contrasted with variance components and confidence bands to gauge detectability in precision versus cross‑sectional settings.
        The first 4 FPCs explain 93.1 % of spectral variance in healthy cohorts, with reproducible eigenfunction patterns across healthy datasets. Between‑person variability substantially exceeds within‑person variability and within‑person variation concentrates in specific spectral subregions. The dominant variation modes of lung cancer patients show patterns well distinguishable from healthy controls, indicating a clear disease signal. Binary classification achieves ROC‑AUCs up to 0.89 and anomaly detection up to 0.78. The ATE is comparable to overall variability scales yet clearly exceeds the average within‑person variability. Limitations of this work include covariate availability and site heterogeneity. These findings provide initial evidence that IMF-based precision screening has the potential to detect subtler perturbations under the observed variance structure and merits further evaluation.

        32144111046

        Speaker: Lea Gigou (Department of Statistics, Ludwig Maximilian University of Munich; Max Planck Institute of Quantum Optics; Center for Molecular Fingerprinting)
      • 11:21
        Integrative Prediction Models for Multi-Omics Data with Missing Modalities 18m

        Personalized medicine aims to improve the treatment of complex diseases by tailoring therapies to the individual molecular characteristics of patients. This is possible by using multi-omics data, which combine different molecular modalities from the same individuals. Integrating these modalities allows more comprehensive and powerful modeling. However, their unique characteristics make integration challenging, particularly when specific modalities are missing for some individuals.

        Many existing approaches require complete datasets, excluding individuals with incomplete modalities. This substantially reduces the sample size and may impair predictive performance.

        A promising approach suitable for such missingness involves training modality-specific prediction models separately. Their predictions are used as input for a meta-model that delivers the final predictions and handles missing values. However, it is currently unclear which meta-learners are optimal for a specific research question and combination of partially missing modalities.

        We systematically evaluated the prediction performance of different meta-learners for multi-omics data with missing modalities in a simulation study. We simulated a binary outcome and methylation values, gene expression, and protein abundance using R package InterSIM based on parameters derived from the TCGA Breast Invasive Carcinoma data set, preserving realistic correlations within and across modalities. We considered scenarios varying in the number and combination of modalities exhibiting an effect, which could be independent or dependent across modalities. Effect sizes in each modality ranged from absent to weak, moderate, or strong. For each modality, individuals with missing data were randomly selected. In total, we generated 30 simulation settings, each repeated 100 times.

        The evaluated meta-learners included weighted average, best modality-specific learner, logistic regression, least absolute shrinkage and selection operator (LASSO), the combined regression alternative (COBRA), and random forest. Their performance was assessed using the Brier Score, F1 score, and AUC based on predicted probabilities. Our results show that complex meta-learners, including logistic regression, LASSO, random forest, and COBRA, consistently outperform simpler approaches (weighted average and best modality-specific learner), particularly in settings with stronger effects.

        64288212328

        Speaker: Marina Bleskina (Institute of Medical Biometry and Statistics, University of Lübeck, University Hospital of Schleswig-Holstein, Campus Lübeck)
      • 11:39
        Inference for Functional Matched Pairs Designs with Missingness 18m

        Functional data analysis (FDA) has become increasingly popular in medical biometry and statistics. It is often appropriate to model observations by smooth curves or functions for example in the situation of observations that are sampled quite dense over time or space or in case of high-dimensional repeated measurements as FDA methods allow a flexible modelling. Furthermore they does not assume a certain correlation structure between sampled cases or time points nor equally spaced time points. Despite many methodological development in the last few years, there are still methodologically unsolved problems. One of them is missingness in context of functional data.

        That is why we present a method to deal with testing in the presence of missing values in a functional matched pairs design. by adapting the approach of [1] for functional data. The method assumes a missing completely at random (MCAR) mechanism and works without any distributional or heteroscedasticity assumption. Two permutation approaches were used to realise the testing in the presence of missing values and to reach a good small sample performance. We compare the new method with the bootstrap-based approach of [2] and with the imputation methods of [3].

        [1] Amro, L., & Pauly, M. (2016). Permuting incomplete paired data: A novel exact and asymptotic correct randomization test. Journal of Statistical Computation and Simulation, 87(6), 1148–1159.
        [2] Crainiceanu, C. M., Staicu, A.-M., Ray, S., & Punjabi, N. (2012). Bootstrap-based inference on the difference in the means of two correlated functional processes. Statistics in Medicine, 31(26), 3223–3240.
        [3] Jang, J. H., Manatunga, A. K., Chang, C., & Long, Q. (2021). A Bayesian multiple imputation approach to bivariate functional data with missing components. Statistics in Medicine, 40(22), 4772–4793.

        53573501324

        Speaker: Marléne Baumeister (TU Dortmund University, Department of Statistics)
      • 11:57
        Informative Subsampling via Optimal Design for Efficient Training on Large and High-Dimensional Genomic Data 18m

        Large and high-dimensional biomedical datasets (large n and p) such as genotype data containing hundreds of thousands of genetic variants (SNPs) measured across many individuals require scalable algorithms to enable efficient model training. In this work, we address this challenge by leveraging principles of optimal design and informative subsampling.

        We investigate the applicability of existing optimal-design-based informative sampling techniques, including D-optimality-based IBOSS (information-based optimal subdata selection) and A-optimality-based OSMAC (optimal subsampling motivated from the A-optimality criterion) method for regression, as a means to reduce training time while maintaining prediction and estimation accuracy. Although these methods perform well in classical low-dimensional settings, we find that their effectiveness tends to diminish in complex real-world scenarios characterized by heterogeneous signal-to-noise ratios and correlation structures. Furthermore, these methods are not directly applicable to high-dimensional settings with a larger number of variables than observations (p>n).

        To address these limitations, we propose new strategies that integrate optimal-design-based subsampling with variable selection methods for high-dimensional data. In this framework, the subsample is selected based on a reduced (screened) set of variables, while subsequent model training can still leverage the full set of covariates. Finally, as subsampling reduces computation time but often cannot fully match full-data accuracy, we extend these subsampling-based approaches to ensemble methods based on multiple subsamples, demonstrating competitive performance while maintaining low model training times. Our findings also underscore the need for new subsampling strategies that account for linkage disequilibrium (LD) patterns across diverse populations, which is an important direction for future work.

        75002906724

        Speaker: Subhadra Dasgupta (IUF - Leibniz Research Institute for Environmental Medicine)
    • 10:45 12:15
      IS4: Questions of Experimental Design and Statistical Analysis in Preclinical Research Room 1 A

      Room 1 A

      Convener: Florian Frommlet (Medical University Vienna)
      • 10:45
        From design to decision: A case study in preclinical pooled CRISPR demonstrating statistical opportunity 30m

        Preclinical research is rich with nontrivial design problems that demand statistical leadership. These opportunities exist across in vivo and high-throughput in vitro research systems where statisticians can materially improve translational fidelity by aligning biological questions, design, and analysis to support decision making. In this talk, I will discuss examples that have arisen from a collaboration focusing on utilising a pooled CRISPR research strategy, which is a technique used to explore library-scale perturbations to analyse the impact of perturbation at a gene level. The potential of this technology, and the complex biological questions being posed, is leading to rapid increase in complexity of the experiments being implemented. A collaboration between researcher and statistician opened with a support request for advice on how many samples researchers needed to design equitable in vivo pooled CRISPR experiments that include male and female samples. To answer this, detective work was needed which uncovered that common statistical pitfalls were rife across the research portfolio and included mis-specified observational/biological/experimental units leading to pseudo-replication, pipelines that ignore blocking and repeated measures, and decision making based on differences in nominal significance error. There is also a need to embrace design complexity to improve generalisability. Together, these issues require a partnership to help the community approach the design and analysis with a robust strategy that is scalable and an evolve to meet the evolution in biological questions being asked. I will share the progress made in embracing hierarchical models with simulations to explore sensitivity and evaluate performance. The aim is to demonstrate the opportunities that can arise within preclinical research for statisticians to design smarter experiments, derive defensible sample sizes, and deliver analyses that move preclinical findings closer to the clinic.

        75002902457

        Speaker: Natasha Karp (AstraZeneca)
      • 11:15
        Optimizing Preclinical Research: Sample Size Planning and Sequential Designs 20m

        Preclinical studies often operate under strict ethical, logistical, and financial constraints, resulting in experiments with very small sample sizes. These limitations pose substantial challenges for statistical inference, reproducibility, and the reliability of decision-making in early-phase biomedical research. This talk provides an overview of key design and analysis issues in preclinical experiments, with a particular focus on approaches that help maximize information gain while upholding the principles of reduction and refinement in animal research. After briefly outlining common pitfalls in experimental design, we introduce fundamental considerations for sample size planning in small-scale animal trials, including variance estimation, precision-based approaches, and the role of prior information. Special emphasis is placed on sequential and adaptive designs, which offer flexible interim decision rules, allow for early termination for efficacy or futility, and can substantially reduce the number of animals required without compromising scientific validity. We discuss practical implementation strategies, statistical properties, and typical use cases for sequential methods in preclinical settings.

        Optimizing Preclinical Research: Sample Size Planning and Sequential Designs

        75002914946

        Speaker: Frank Konietschke (Institute of Biometry, Charité - Universitätsmedizin Berlin)
      • 11:35
        Statistical Planning and Reporting for Multi-Laboratory Preclinical Trials 20m

        Translating preclinical results into effective clinical treatments remains a big challenge in biomedical research. Too often, findings from single-laboratory studies fail to replicate in the preclinical context and further, to show effectiveness in clinical trials. One promising approach to validate exploratory findings is through confirmatory multi-laboratory preclinical trials—studies that test the same intervention across several independent labs using harmonized protocols, in a similar way that randomized-controlled clinical trials (RCT). Given the complexity of the experimental design, these trials tend to adopt rigorous experimental practices such as sample size calculation and randomization of animals into treatment groups, which provide robust and reliable evidence to decide whether to move forward into clinical trials or not.

        The added complexity comes with a major challenge: how to plan, analyse, and report these studies properly. In this talk, general recommendations will outline how to plan the analysis and how to report multi-laboratory preclinical studies. The recommendations focus on key items to include in the statistical analysis plan (SAP) drawing from the experimental design and rigorous research practices, and considering statistical principles. Ultimately, the goal is to raise awareness on practical and statistical considerations, and encourage a closer exchange between experimental researchers and statisticians to perform decision-enabling preclinical studies.

        64288204088

        Speaker: María Arroyo Araujo (Berlin Institue of Health, Charité Universitätsmedizin)
      • 11:55
        The Experimental Unit Information Index: Balancing Evidentiary Value and Sample Size of Adaptive Designs 20m

        Reducing the number of experimental units is one of the three pillars of the 3R principles (Replace, Reduce, Refine) in animal research. At the same time, statistical error rates need to be controlled to enable reliable inferences and decisions. This paper proposes a novel measure to quantify the evidentiary value of one experimental unit for a given study design. The experimental unit information index is based on power, Type-I error and sample size, and has attractive interpretations both in terms of frequentist error rates and Bayesian posterior odds. We introduce the EUII in simple statistical test settings and show that its asymptotic value depends only on the assumed relative effect size under the alternative. We then extend the definition to adaptive designs where early stopping for efficacy or futility may cause reductions in sample size. Applications to group-sequential designs and a recently proposed adaptive statistical test procedure show the usefulness of the approach when the goal is to maximize the evidentiary value of one experimental unit.

        32144102805

        Speaker: Leonhard Held (University of Zurich)
    • 10:45 12:15
      Methods in epidemiology 1 Room 12

      Room 12

      Convener: Werner Vach (Basel Academy for Quality and Research in Medicin)
      • 10:45
        Inferring the causal effect direction in genetic association studies: An application to broad depression, obesity, and asthma 18m

        In genetic association studies, Mendelian Randomization (MR) is a popular tool for inferring causal relationships between traits using genetic variants as instrumental variables. Recent methods have been proposed as tools that can infer the causal direction between two phenotypes including MR Steiger, bidirectional MR, causal direction-ratio, causal direction-Egger, and causal direction-GLS. We conducted a comprehensive simulation study evaluating type I error control and power of these 19 summary-data-based MR approaches to correctly determine the effect direction in the presence of pleiotropy, measurement error, unmeasured confounding, and weak instrument variables. In addition, we examined the performance of these approaches when there is a longitudinal causal relationship between the two phenotypes. We also applied these 19 methods to the UK biobank to determine the effect direction between body mass index (BMI) and major depressive disorder (MDD) and BMI and asthma.

        53573515426

        Speaker: Sharon Lutz (Harvard Medical School)
      • 11:03
        Methodological and Practical Challenges in Developing a Cardiac Allocation Score 18m

        Heart transplantation is widely regarded as the gold standard for the treatment of end-stage heart failure. However, shortages of donor hearts necessitate the implementation of waiting lists and allocation algorithms. The German Transplantation Law stipulates the allocation of donor hearts based on urgency of and the benefit from a transplantation. This can be summarized into a single score, the Cardiac Allocation Score (CAS). The two components of the CAS are: firstly, urgency, measured by the life expectancy of a patient remaining on the waiting list; and secondly, benefit, the difference in life expectancy after a transplantation compared to remaining on the waiting list. Hence, in order to calculate an individual’s CAS, two counterfactual predictions are necessary, one for Restricted Mean Survival Time within 1-year (RMST1) if the individual would never receive a heart transplantation and one for RMST1 if the individual would receive the heart transplantation.
        The development of these two prediction models gives rise to methodological challenges: 1) In predicting the survival on the waitlist, complete observation is obscured by transplantation. The prevailing allocation system in Germany and the Eurotransplant (ET) region primarily relies on urgency and waiting time and thus, is informative for censoring due to transplantation. 2) The positive outcomes associated with heart transplantation are masked during the initial postoperative period, which is largely attributable to the significant impact of the surgical intervention.
        Additionally, varying data vintage and the requirement for instant predictions are further challenges. Data vintage means that urgency (life expectancy when remaining on waiting list) must be computed using the data of all persons on the waiting list that is available when a donor heart becomes available, but that data may have been updated at different time points in the past. Moreover, predictions have to be provided instantly, but RMST1 often require numerical integration.
        Several approaches may be useful to consider when training the models, including inverse probability of censoring weighting, parametric accelerated failure time models with various distributions, direct estimation of RMST1 with pseudo-values, landmarking etc. In this talk, we will discuss advantages and disadvantages of possible analysis strategies resulting from combining these elements. Furthermore, we will report on simulations to compare their performances. Overall, our work aims to identify modelling strategies that can most reliably support CAS estimation—and thereby strengthen the fairness and effectiveness of heart allocation systems.
        Schramm, R., Gummert, J.F. Herztransplantation. Chirurgie 95, 101–107 (2024).

        85717609244

        Speaker: Leonie Lenz-Seraphin (Medical University Vienna, Center for Medical Data Science, Institute of Clinical Biometrics)
      • 11:21
        Timeliness of Polio Vaccination during 2019-21 in India: A finite mixture modeling analysis 18m

        Background: Childhood immunization influences directly and indirectly fourteen out of the seventeen sustainable development goals (SDGs). Timely receipt of vaccines protects children from deadly diseases and increases the overall future productivity of the population. With the largest and most heterogeneous population of under-five children, the delay in receiving polio vaccination has not been properly explored using a conventional regression model in India. This study aimed to identify latent subgroups of children according to the polio vaccine timeliness and its associated risk factors.

        Methods: Unit-level data of 102176, 106386, 98071, and 86773 children were used for four doses of Polio vaccine from fifth round of the National Family Health Survey (NFHS) in India. Latent subgroups and associated risk factors of four polio vaccine durations were obtained using C-point finite mixture negative binomial models.

        Results: Among the children with available vaccination dates, about 51.86% received delayed a first dose. The second oral dose showed three latent classes (mean durations: 53, 81, and 127 days), whereas the third and fourth oral doses had three and four classes, respectively. Wealth, geographical zones, and maternal education significantly predicted class membership.

        Discussion: This study is the first to explore the hidden patterns of vaccine delay and subgroup-specific risk factors using finite mixture model technique rather than the “one size fits all” approach, using the world’s largest household survey data. Different latent classes indicate distinct behavior for each dose, backing dose-specific policy intervention. Unveiling the underlying pattern in delay in receiving polio vaccine and its risk factors would help policymakers enable more targeted and efficient immunisation strategies. By identifying dose-specific hidden subgroup-wise patterns, this study provides an insight to assess progress towards SDG3.

        42858805205

        Speaker: Sumit Das (Scientist - I, All India Institute of Medical Sciences (AIIMS))
      • 11:39
        Advancing mixed-effects random forests to predict BMI development in children and adolescents based on multi-cohort data 18m

        Mixed-effects models (MEMs) are widely used in epidemiology to analyze data not being independent and identically distributed (i.i.d.) like longitudinal data. However, MEMs rely on parametric assumptions and require predefined interactions among predictors. In contrast, machine learning (ML) methods such as random forests (RF) assume i.i.d. data but are more flexible in capturing nonlinear relationships and interactions. Mixed-effects ML methods, which combine MEMs with ML, have shown improved prediction performance by leveraging the strengths while mitigating limitations of both approaches [1]. However, these mixed-effects ML methods remain underexplored in multi-cohort settings where multiple cohorts spanning different life stages are combined to extend the observation period and improve individual predictions. We propose multi-cohort mixed-effects random forests (MERFmulti-cohort), a new approach combining MEMs and RF to improve the predictions of individual-level health trajectories beyond the actual individual measurement period based on multi-cohort data.
        Mixed-effects random forests (MERF) [2] iteratively estimate fixed effects and random effects, where fixed effects are captured by the RF while the dependencies in the data are accounted for through suitable correlation structures in the MEM. However, the covariance structures of random effects and residuals are oversimplified in previous MERF implementations and failed to accommodate the complexities in multi-cohort data with hierarchical structures. Our MERFmulti-cohort builds on the out-of-the-box (OOB) implementation of the mixed-effects machine learning framework (mixedML) by Kilian et al. [3] and extends MERF to accommodate complex covariance structures in multi-cohort data. We illustrate and evaluate our approach by predicting individual-level BMI z-score trajectories using harmonized data from two children cohorts.
        We compared our MERFmulti-cohort to previous MERF approaches [4] and three conventional methods: (a) RF, (b) linear regression (LM), (c) MEM. The prediction accuracy was evaluated in four scenarios forecasting BMI z-score in children either based on single or multi-cohort data.
        We show that MERFmulti-cohort outperforms RF, LM and previous MERF approaches when predicting BMI z-scores either in single or multi-cohort settings. When compared with MEM, the improvements pertain exclusively to certain multi-cohort scenarios. We provide guidance on prediction scenarios where MERFmulti-cohort outperforms other methods. We further discuss the benefits of using multi-cohort over single cohort data to enhance the accuracy of individual-level predictions, particularly in cases where single cohorts are limited in the age range covered.

        96432307368

        Speaker: Jiumeng Zhang (Leibniz-Institute for Prevention Research and Epidemiology - BIPS)
      • 11:57
        Estimating prevalence of micronutrient deficiency across multiple biomarkers: Approaches for generalized linear and linear mixed models 18m

        Assessing micronutrient status is essential in nutritional research (Allen, 2025) and typically involves estimating the population wide prevalence of micronutrient deficiencies using biomarker data collected across multiple regions. In such studies, several biomarkers are commonly analyzed to estimate the prevalence of any deficiency, defined as the probability that at least one of the biomarkers falls below its threshold. Here, we compare different methods for estimating this prevalence using either dichotomized biomarkers in a generalized linear mixed model (GLMM) or continuous biomarker data in a (multivariate) linear mixed model (LMM).
        We evaluate three approaches: two GLMM-based and one LMM-based. One method, applicable to both GLMMs and LMMs, obtains univariate prevalence estimates for each biomarker and then combines them to estimate the prevalence of any deficiency (Hothorn et al., 2025). For GLMMs, a second approach constructs a composite dichotomized response variable by first dichotomizing each biomarker using biomarker-specific thresholds and then creating the composite response that equals 1 if any biomarker is below its threshold and 0 otherwise. To quantify uncertainty of the estimated prevalence, we apply both a non-parametric bootstrap, and the delta method that includes or excludes the uncertainty of the estimated variance components of the mixed model.
        The three approaches are compared using a database containing multiple biomarkers measured in several population subgroups across three countries. Overall, the agreement between the methods in terms of estimated prevalence was high with a maximum absolute difference between methods of 0.028 for prevalences ranging from 0.2 to 0.3. Regarding the uncertainty of estimated prevalence, the approaches that use dichotomized data performed very similarly while using continuous data led to slightly elevated uncertainties in many cases. Nevertheless, the LMM approach offers practical advantages over the GLMM, including that prevalence estimation is simpler for a LMM as GLMMs require numerical integration (Gory et al., 2021), which becomes challenging for complicated covariance structures.
        Ongoing methodological work, including the use of a multivariate LMM that accounts for correlations between biomarkers, is expected to yield additional results that will be presented.

        32144107086

        Speaker: Steffen Hadasch (National Institute of Public Health, University of Southern Denmark)
    • 10:45 12:15
      Statistical modelling 2 Room 13 A

      Room 13 A

      Convener: Kirsten Schorning (TU Dortmund University)
      • 10:45
        Comparison of statistical methods for dealing with deviations in concentration-response curves 18m

        Concentration-response curves model the relationship between a concentration of a compound and the response it elicits in a biological system. Here, the viability of cells is considered as response. Typically, parametric models are fitted to the data. Modeling this relationship accurately is crucial for understanding the safety and potency of compounds, since one of the applications of these curves is the estimation of the effective concentration (EC), at which a certain pre-defined response is obtained. However, due to measurement errors or biological variability, it is often observed that measurements at individual concentration levels deviate from the typically assumed sigmoidal models. Such deviations in the data can compromise the accuracy of the estimation of EC values. Due to the usually very low number of measured concentrations, identifying these deviations is challenging.
        Here, we simulated data for different scenarios of possible deviations at each individual concentration levels. We propose several statistical methods that can mitigate the impact of such deviations. Two methods try to identify the deviations and eliminate the data of the corresponding concentrations completely from the curve fitting process. Two methods use the iterative weighted least squares method to assign lower weights to deviations, and another weighted method uses weights obtained from an exponentially decreasing function. All these methods are then compared to a baseline method on how accurately they can estimate the true EC20 value. On average, the methods that eliminate the deviations are the most accurate, but they become considerably worse than any other method when correct data is eliminated instead of the true deviating data. To mitigate this, we recommend to combine one of them with a weighted method, which on average gives a less accurate estimation since it does not eliminate completely the deviating data, but the weighted method can help in validating the deviations that the other method has identified, reducing the risk of wrongly eliminating correct data from an already small dataset.

        64288205677

        Speaker: Huiying Zhou Zhou (TU Dortmund University)
      • 11:03
        A comparison of variable selection approaches for spline regression 18m

        Multivariable regression models are a powerful statistical tool with an innumerable number of applications in explanatory and predictive settings. One key challenge is variable selection —deciding which variables to include or exclude, particularly when dealing with large numbers of candidate predictors. In biomedical data, non-linear relationships between the candidate predictors and the outcome of interest may be present. Generalized additive models (GAMs), which represent non-linear predictor effects using spline functions, are a popular method for modeling non-linear relationships (Perperoglou et al. 2019). In GAMs, the challenge of choosing an appropriate set of variables is further complicated by the simultaneous estimation of appropriate functional forms for the selected variables.
        In this talk, we present an overview of selection algorithms in the context of GAMs (Marra, Wood 2011; Kovács 2024; Breheny, Huang 2015). The investigated selection methods can be grouped into stepwise (forward selection, backward elimination) and shrinkage approaches. The latter include the group LASSO, the group smoothly clipped absolute deviation (group SCAD), the non-negative garrote, and the double penalty approach in which both the range space and the null space of the penalty are shrunken.
        To the best of our knowledge, many of these have not been used before in the context of GAMs and have not been formally compared. We compare the different approaches in a controlled simulation study, using a plasmode framework based on an open-source dataset to retain potentially complex correlation structures. The focus of the simulation study is the performance of the different resulting models, particularly with respect to calibration, discrimination, selection rates, model size and integrated squared loss (Ullmann 2025) . The simulation results provide valuable insights into the properties of the different selection approaches, marking an essential step toward building the evidence base for guidance regarding their use in explanatory and predictive analyses.

        Breheny, P., Huang, J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat Comput 25 (2015).
        Kovács, L. Feature selection algorithms in generalized additive models under concurvity. Comput Stat 39 (2024).
        Marra, G., Wood, S.N. Practical variable selection for generalized additive models. CSDA 55 (2011).
        Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. A review of spline function procedures in R. BMC Med Res Methodol 19 (2019).
        Ullmann, T., Heinze, G., Abrahamowicz, M. et al. A Systematic Categorization of Performance Measures for Estimated Non-Linear Associations Between an Outcome and Continuous Predictors. WIREs Comp Stats 17 (2025).

        21429416749

        Speaker: Franziska Kappenberg (University of Bonn, Medical Faculty, Institute for Medical Biometry, Informatics and Epidemiology)
      • 11:21
        A Clustered-Metric Simulation Study Comparing Flexible Regression Techniques for Non-Linear Associations 18m

        In regression modeling, relationships between continuous predictors and outcomes are often assumed to be linear, yet allowing for non-linear associations can substantially improve model performance. A variety of methods for flexible regression—such as fractional polynomials and spline-based approaches—have been proposed to model non-linear associations. However, comprehensive and systematic simulation studies comparing multiple flexible regression techniques remain scarce. Such comparisons are essential for guiding researchers in selecting appropriate methods in different data settings.
        We present results from a simulation study that systematically compares several flexible regression approaches. A central feature of our study design is the careful selection of performance measures used to assess the curves estimated by the different methods. Different measures may capture different aspects of the curves and therefore favor different methods. This was recently illustrated in our publication on a categorization of performance metrics for evaluating non-linear associations between continuous predictors and outcomes [1], published on behalf of Topic Group 2 of the Strengthening Analytical Thinking in Observational Studies (STRATOS) initiative. Because the categorization includes a wide range of measures, we propose a novel strategy to identify subsets of measures that capture distinct aspects of the estimated curves. Our "clustered-metric strategy" is based on cluster analysis of the performance measures to detect groups of measures that attribute similar performance to the methods. By means of applying our clustered-metric evaluation to the results of our simulation study, we demonstrate that this approach may reduce redundancies and facilitate a clearer interpretation of the methods' performance while avoiding selective reporting.
        Our proposed clustered-metric evaluation illustrates how different performance measures align – or diverge – in assessing model quality. It is a transparent and concise strategy to report results of a comprehensive simulation study, well suited for comparing flexible regression techniques, but also transferable to other topics.

        [1] Ullmann, T., Heinze, G., Abrahamowicz, M., Perperoglou, A., Sauerbrei, W., Schmid, M., Dunkler, D., for TG2 of the STRATOS Initiative, 2025. A Systematic Categorization of Performance Measures for Estimated Non‐Linear Associations Between an Outcome and Continuous Predictors. Wiley Interdisciplinary Reviews: Computational Statistics, 17(3), e70042.

        64288216029

        Speaker: Theresa Ullmann (Institute of Clinical Biometrics, Center for Medical Data Science, Medical University of Vienna)
      • 11:39
        Detection of changes in time series of preclinical measurements for selecting Virtual Control Groups 18m

        Virtual Control Groups (VCGs) represent an approach in which historical control data (HCD) from previous animal studies are used to replace animals in current control groups. The VICT3R project (Developing and implementing VIrtual Control groups To reducE animal use in toxicology Research), funded by the Innovative Health Initiative (IHI), aims to reduce the use of animals in toxicological research by implementing VCGs.

        When using HCD to replace current control groups with VCGs, it is essential to carefully match the HCD with the characteristics of the current control groups. For achieving comparability between treated and control groups, reduction up to elimination of potential confounding effects is key. Therefore, a HCD pool of suitable control animals that could serve as VCGs must first be created by filtering and matching historical data to the legacy study. Established methods for selecting matching observations include clustering techniques and propensity score matching. VCGs can then be generated by sampling animals from this filtered HCD pool.

        Typically, studies conducted within the last five years are considered when constructing VCGs in toxicological contexts. However, this fixed time window may not always reflect periods with stable observations. We therefore introduce a new approach to determine the time interval during which observations remain stable, using a method for changepoint detection. In the underlying model, a constant function is assumed between changepoints. The most recent estimated changepoint indicates the start of the time interval with stable values, which represent observations that could be used as virtual controls.

        The proposed method is applied to data from the VICT3R database on male Sprague-Dawley rats from 28-day studies. To evaluate its performance, simulation studies are conducted by generating observations across multiple studies over time and introducing artificial changepoints into the data.

        21429408127

        Speaker: Wiebke Dammann (TU Dortmund University - Department of Statistics)
      • 11:57
        Limitations and Challenges of Mixed Models Repeated Measures (MMRM) Analysis 18m

        Repeated measures data are commonly encountered in a wide variety of disciplines including business, agriculture and medicine. They entail collection of multiple measurements from the same unit or subject over time, space or both. The fact that observations from the same unit will not be independent poses particular challenges to the statistical procedures used for the analysis of such data. Longitudinal data is a special case of repeated measures. In a longitudinal context, data are clustered within patients, thus, a random effect remains constant within a patient but changes across patients. Mixed models for repeated measures (MMRM) are suited for modeling continuous outcomes measured at discrete time points or within defined time-windows, hence applicable in balanced designs such as randomized control trials (RCT), utilizing time as a categorical factor. Typically, MMRM specifies no patient level random effects, but instead models the correlation within the repeated measures over time through unstructured correlation matrix of residual errors. With highly unbalanced designs, MMRM may encounter considerable challenges associated with cross-level bias. If measurements occur on a more ad-hoc basis, such that times of measurement vary across subjects, it may no longer be feasible to use MMRM. Even with balanced RCT designs, the choice of treating time as a categorical factor or a continuous variable depends on the research goal. If one is interested in studying the functional relationship between the outcome and time, it is appropriate to treat time as a continuous variable, hence not feasible within MMRM. Linear mixed-effects (LME) models consider both fixed and random effects, hence allows considerable modeling flexibility. In our case study, we analyze data for a 2 treatments by 2 periods crossover trial, within MMRM and LME modeling frameworks; applying Grizzles model, James & Kenward model and piecewise linear model.

        96432301484

        Speaker: Moses Mwangi (Kenya Medical Research Institute, Kenya; University of Hasselt, Belgium)
    • 10:45 12:15
      YSS1 (ROeS) Room 1 B

      Room 1 B

      Convener: Andrea Berghold (Medical University of Graz)
      • 10:45
        Navigating Multiplicity in Multiverse Analyses: A Simulation Study and Case Application to Lung Cancer Staging Using SEER Data 18m

        Multiverse analysis offers a powerful framework to assess the robustness of statistical inferences across a spectrum of plausible analytical choices. However, when applied to predictor selection, especially in high-dimensional settings, the issue of multiplicity becomes critical. In this study, we present a comprehensive simulation framework to evaluate the impact of different multiple testing correction strategies—such as Bonferroni, Benjamini-Hochberg, and permutation-based approaches—on the stability and interpretability of predictor selection within multiverse analyses.
        We further demonstrate the practical implications of these findings through a case study on lung cancer staging, using data from the SEER (Surveillance, Epidemiology, and End Results) program. Target variables include clinical stage classifications, with predictors drawn from demographic, tumor-specific, and treatment-related features. Our multiverse approach explores variations in preprocessing, model specification, and selection criteria, revealing how multiplicity corrections influence the perceived robustness of key predictors.
        Results highlight the trade-offs between false discovery control and model generalizability, emphasizing the need for transparent reporting and principled correction strategies in multiverse workflows. This work contributes to the growing discourse on reproducibility and robustness in biomedical data science, offering practical guidance for researchers navigating complex modeling landscapes.

        References
        1. Steegen S, Tuerlinckx F, Gelman A, Vanpaemel W. Increasing Transparency Through a Multiverse Analysis. Perspect Psychol Sci. 2016 Sept 1;11(5):702–12.
        2. Streiner DL. Best (but oft-forgotten) practices: the multiple problems of multiplicity—whether and how to correct for many statistical tests. Am J Clin Nutr. 2015 Oct;102(4):721–8.

        53573505644

        Speaker: Gloria Brigiari (Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova)
      • 11:03
        Modeling Antibody Kinetics in Pregnant and Lactating Women Following COVID-19 and Tdap Vaccination 18m

        Despite the availability of vaccines, infectious diseases such as COVID-19, tetanus, diphtheria, and pertussis remain persistent public health threats, particularly among vulnerable populations including pregnant and lactating women. As most research on protection against infectious diseases to date has focused on antibody-mediated responses, understanding how antibodies behave over time remains an important aspect for elucidating disease dynamics.
        Although both COVID-19 and Tdap (tetanus, diphtheria, acellular pertussis) vaccines are approved and recommended for pregnant and lactating women in Belgium, limited data exist on antibody kinetics and potential interactions when both vaccines are administered in these populations. To address this gap, the MATabMATHics project investigates how Tdap and COVID-19 vaccines influence vaccine-induced antibody kinetics, as well as interactions between these vaccines. More specifically, serum samples from participants in the PREGCOVAC.BE trial were analyzed, in which pregnant and lactating women received a COVID-19 vaccine, and additionally received a Tdap vaccine during pregnancy (for lactating women, during pregnancy preceding the lactation period).
        In order to study the dynamics of vaccine-induced antibodies, a conditional linear mixed modeling approach has been adopted. We modeled SARS-CoV-2- and Tdap-specific antibodies individually, yielding two pathogen-specific modeling frameworks accommodating different vaccination schedules (i.e., three vaccine doses for SARS-CoV-2 and one dose for Tdap), recorded prior and breakthrough COVID-19 infections, and including the timing of vaccination with the other vaccine as a conditional factor.
        The linear mixed model was implemented as a distributional regression model in GAMLSS (Generalized Additive Model for Location, Scale, and Shape). This approach allowed us to address left censoring due to lower limits of detection and to account for observed heteroscedasticity at the distributional level using variance modeling. The impact of (conditional) fixed effects was analyzed using conditional means.
        The two modeling frameworks, which were implemented for the SARS-CoV-2-specific RBD-IgG and the Tdap-specific PT antibodies, suggest that administrating one vaccine in temporal proximity to the other may influence antibody kinetics, potentially resulting in lower antibody levels.

        96432311208

        Speaker: Lukas Frank Buchhäusl (Medical University of Graz)
      • 11:21
        Edgington's Method for Random-Effects Meta-Analysis 18m

        Meta-analysis can be formulated as the combination of p-values from multiple studies into a joint p-value function, from which inference for the average effect, including point estimates and confidence intervals, can be derived. We extend Edgington's p-value combination method for random-effects meta-analysis by treating the combined p-value function as a confidence distribution of the average effect and incorporate uncertainty in heterogeneity estimation via a confidence distribution implied by the generalized heterogeneity statistic (Kronthaler and Held, 2025a). To quantify heterogeneity, another central task of random-effects meta-analysis, we propose constructing predictive distributions by integrating the normal effect distribution over both Edgington’s confidence distribution and the confidence distribution of the heterogeneity parameter (Kronthaler and Held, 2025b). The methods explicitly account for parameter uncertainty, and represent it through full confidence and predictive distributions rather than providing only scalar or interval summaries.

        Simulation results indicate that confidence intervals achieve near-nominal coverage for more than three studies and heterogeneity. The point estimator exhibits small bias under model misspecification and substantial heterogeneity. Prediction intervals typically maintain nominal coverage for more than three studies, and both confidence and prediction intervals effectively capture skewness in effect estimates. In contrast, formulations of the methods which ignore parameter uncertainty often exhibit under-coverage. Overall, Edgington’s method, equipped with confidence distribution adjustments for heterogeneity uncertainty, has potential as a viable alternative or complement to classical random-effects meta-analysis.

        References:
        Kronthaler, D., & Held, L. (2025a). Edgington’s method for random-effects meta-analysis part I: Estimation. arXiv.

        Kronthaler, D., & Held, L. (2025b). Edgington’s method for random-effects meta-analysis part II: Prediction. arXiv.

        32144104964

        Speaker: David Kronthaler (Epidemiology, Biostatistics and Prevention Institute, Department of Biostatistics, University of Zurich)
      • 11:39
        Evaluation of cancer screening programmes: integration of the biological tumour growth model into the MOCCI method 18m

        Breast cancer remains one of the most common cancers among women worldwide. Breast cancer screening programmes aim to catch the disease at its early phase, by regularly examining asymptomatic women for signs of cancer. The rationale is straightforward: early detection, before symptoms onset, offers patients broader treatment options and improves the chances of recovery. To evaluate the cancer screening programmes, we focus on two measures: lead time and overdiagnosis.

        While, in the real world, a person can either be or not be invited to the screening programme, in a counterfactual framework, a person is considered in both potential worlds – the one where they are invited to the screening programme and the one where they are not. Within this framework, lead time is defined as the difference between the times at which a person would have been diagnosed in the two worlds. Another important metric is overdiagnosis, referring to cancers detected by screening that would never have been identified had the person not been invited to the programme. E.g., this can happen when a non-progressive tumour is detected at screening visit, but the disease would never have progressed to the symptomatically detectable phase.

        The MOCCI method [1] was developed to jointly estimate lead time and overdiagnosis. Based on comparison of cancer incidences between two groups (invited or not invited to the programme), it aims to find distributions of lead time and overdiagnosis which best explain the difference in incidence, using MLE principles.

        In this work, we extend the MOCCI method by integrating the biological tumour growth model [2]. Tumour size at diagnosis, routinely recorded in cancer registries, offers valuable information that can be used for the lead time estimation. Alongside age and calendar time at diagnosis, tumour size can be introduced to the MOCCI method as a third modelling dimension. With this introduction, the focus of the estimation procedure shifts to the estimation of latent processes (e.g. tumour growth rate), from which the lead time and overdiagnosis can be estimated.

        We outline the extended MOCCI estimation procedure, focusing on the utilization of the tumour growth model, and present preliminary simulation results assessing the feasibility of the proposed approach.

        [1] Vratanar B, Pohar Perme M. Estimating lead time and overdiagnosis in cancer screening programmes: the MOCCI method, under review.
        [2] Isheden G, Humphreys K. Modelling breast cancer tumour growth for a stable disease population. Stat Methods Med Res. 2019 Mar;28(3):681-702.

        64288211608

        Speaker: Ema Požek (Faculty of Medicine, University of Ljubljana)
      • 11:57
        Comparative Analysis of Classification Models for Pharmaceutical Permeability Prediction 18m

        In this study, PERMY data set taken from Pharmaceutical Statistics Using SAS: A Practical Guide is analyzed. It describes permeability of cell membranes, which is the ability of a molecule to cross a membrane. Biological structures are a complex layer of molecules and proteins. Substances require a particular structure to pass through the target membrane and drugs that fail to demonstrate sufficient permeability should be excluded from further testing. For that specific reason, permeability is important in the early stages of drug development.
        The aim of this study is to compare several classification methods for binary classification, using 71 molecular properties whose meanings are not explicitly known. Due to possible collinearity and near singular data matrices, the models were complemented with multicollinearity and principal component analysis for dimension reduction. The following methods were compared: logistic regression (including stepwise, decision tree and cluster based variable selection), decision trees, neural networks, random forests, gradient boosting trees and bagging trees. The data was split into training and validating subsets. To rank model performances, average square errors were computed, while confusion matrix and misclassification rate were used to assess classification accuracy for each algorithm. Statistics mentioned above were compared using validation data. Additionally, to assess model fit and complexity of the candidate models, metrics such as Gini index and area under the ROC curve were evaluated.
        Special attention regarding possible interpretation was given to black box algorithms, primarily because of their robustness to near singular data matrices. These models were further explained using surrogate decision trees, which provide insight into variable importance and internal model structure.

        64288210087

        Speaker: Jana Habus-Korbar (Student at University of Zagreb, Faculty of Science, Department of Mathematics)
    • 12:15 13:45
      Lunch break 1h 30m
    • 13:45 15:15
      Clinical trials 2 Room 13 B

      Room 13 B

      Convener: Sonja Zehetmayer (Medical University of Vienna)
      • 13:45
        A systematic empirical comparison of different statistical approaches for a multi- aspect analysis of clinical trial data in rare diseases 18m

        The servEB project (WISS 2025, federal state of Salzburg, 20102/F2300645-FPR) combines
        clinical expertise, advanced statistical analyses, and AI-driven imagine classification technology to improve the assessment of trial outcomes in rare diseases, especially Epidermolysis Bullosa research as an example. When defining meaningful endpoints, multiple aspects of the disease have to be considered, including quantitative, validated outcomes (e.g., number of lesions) as well as patient-relevant outcomes (e.g., quality of life, pain, pruritus). Accordingly, the statistical analysis approach should appropriately account for the multi-faceted characteristics of those outcomes. Therefore, we address this challenge by systematically comparing a range of uni- and multivariate statistical methods with respect to both their empirical performance (i.e., type-I-error rates and power) as well as the interpretation and the properties of the respective estimands. Results indicate type-I-error control of the evaluated nonparametric approaches at the two-sided 5% level and good performance in terms of empirical power in moderate to large sample sizes. Specifically, the R package npmv yielded stable and competitive power at both very small and large sample sizes. Semiparametric MANOVA achieved the highest power but with a highly liberal type-I-error rate. First results look promising with respect to the potential of significant improvements in clinical trial design and patient care.

        85717603164

        Speaker: Martin Geroldinger (Research Program Biomedical Data Science, Paracelsus Medical University Salzburg)
      • 14:03
        Assessing covariate-adjusted risk differences in small-sample trials: A comparative evaluation of statistical methods 18m

        Binary endpoints are commonly used to measure clinical outcomes in randomized controlled trials. In this context, conditional odds ratios (ORs) based on logistic regression have been routinely used as population-level summary to quantify treatment effects. However, ORs have been criticized for a lack of interpretability, non-collapsibility, and sensitivity to model specification. In response, risk differences (RDs) have gained traction as a more interpretable and clinically relevant measure that better aligns with typical estimands of interest in clinical trials. However, assessing covariate-adjusted RDs, especially in small-sample settings (N ≤ 150) typical of early-phase trials, remains methodologically challenging. Motivated by recent regulatory guidance and ongoing methodological discussions on covariate adjustment for unconditional estimators, we systematically evaluate a broad set of statistical methods for assessing RDs in a large-scale simulation study, including various g-computation approaches, Mantel-Haenszel methods, and unconditional tests. Our findings reveal that some g-computation variants with parametric variance estimators fail to maintain nominal Type I error rates in small samples. In contrast, bootstrap-based and Mantel-Haenszel methods may offer a more favorable balance between error control and statistical power. Based on our results, we provide practical recommendations to guide practitioners in selecting statistical methods that (1) target a desired estimand, (2) perform reliably under small-sample conditions, and (3) balance robustness, efficiency, and interpretability. Thereby, we hope to support more reliable and clinically meaningful inference from small-sample clinical trials.

        53573503879

        Speaker: Martin Schnuerch (Global Biostatistics & Data Sciences, Boehringer Ingelheim Pharma GmbH & Co. K)
      • 14:21
        Allocation Bias in Rare Diseases Clinical Trials with Multiple Endpoints 18m

        Background:
        Multiple endpoints are a major topic of discussion in rare disease research, particularly regarding to patient-centered outcome measures, as they allow for a more comprehensive assessment of treatment effects. However, a critical challenge in these trials is allocation bias, as they are often unblinded or single-blinded. Allocation bias arises when future treatment allocations can be predicted from prior ones, potentially leading to the preferential assignment of patients with specific characteristics to either the treatment or control group. Despite its potential impact, the effects of allocation bias in clinical trials with multiple endpoints remains insufficiently studied.

        Methods:
        To quantify allocation bias in two-arm parallel group trials with continuous multiple endpoints, we derived a biasing policy based on the convergence strategy of Blackwell and Hodges. We assessed the impact of allocation bias by evaluating type I error rates of various multiple testing approaches, including the Bonferroni correction, all-or-none, and Wei-Lachin methods, in the presence of bias. In a simulation study we computed these type I error rates across various randomization procedures and evaluated whether allocation bias leads to inflated error rates.

        Results:
        Simulations show that allocation bias inflates type I error rates, leading to incorrect statistical conclusions. Even small bias effects can cause the nominal 5% significance level to be exceeded. The extent of inflation depends on the chosen randomization procedure and the multiple testing approach used. Less restrictive randomization procedures, such as Complete Randomization and the Big Stick Design, exhibited the lowest type I error inflation, while Permuted Block Randomization results in the highest type I error inflation.

        Conclusion:
        Allocation bias threatens the validity of clinical trials and should be minimized through careful study design. In particular, selecting a randomization procedure that reduces susceptibility to allocation bias is crucial. Regardless of the analytical approach, adopting less restrictive randomization procedures, such as the Big Stick Design, can reduce allocation bias and improve the reliability of trial results. Informing the scientific community that the Big Stick Design outperforms Permuted Block Designs in preventing allocation bias effects on test decisions is particularly important, especially given that the randomization section of the ICH E9 guideline still refers to block randomization. The developed methodology provides guidance on selecting bias-mitigating randomization procedures, contributing to more robust trial designs. Moreover, this approach can be extended to enable bias-adjusted testing, offering a way to correct for allocation bias and ensure more valid results in rare disease trials.

        21429413686

        Speaker: Stefanie Schoenen (RWTH Aachen)
      • 14:39
        When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials 18m

        Even in rare diseases, where the sample size is limited and blinding is less frequently implemented, randomized controlled trials are considered the gold standard to proof efficacy. Randomization is used to mitigate bias and regulatory guidance recommend the investigation of the impact of bias on the test decision. We quantified how allocation bias affects the test decision in small sample two-arm group sequential trials under a biasing policy based on the Blackwell-Hodges convergence strategy. Type I error and power were evaluated under Lan-DeMets spending (Pocock, O'Brien-Fleming, Wang-Tsiatis-type functions), with and without futility (non-binding, binding), varying interim timing, number of looks and stage-wise restarting of randomization. Allocation bias inflated type I error most for more restrictive randomization procedures, especially permuted blocks with small block sizes. Spending more alpha at interim reduced inflation. Non-binding futility reduced type I error, while binding increased type I error inflation for more aggressive stopping boundaries. Stage-wise restarting modestly reduced inflation for most procedures. Overall, group sequential choices had secondary effect and did not rescue a predictable randomization scheme. When allocation bias cannot be ruled out (e.g. open-label trials), we recommend less restrictive randomization procedures (e.g. big stick design) or, if using permuted blocks, large block sizes

        64288213626

        Speaker: Daniel Bodden (RWTH Aachen University)
      • 14:57
        Multiple Treatment Arms, Multiple Biases? Allocation and chronological biases in platform trials 18m

        In rare diseases, the need for innovative clinical trial designs is increasing. Platform trials are becoming particularly popular, as they allow for flexible adding and dropping of arms and reduce sample size requirements by using a shared control. In a platform trial setting with two experimental arms and one control, we use clinical trial simulations to quantify the impact on operating characteristics such as type I error rate, power, and bias caused by allocation and chronological bias.

        Especially if a trial is not blinded, allocation bias may damage the integrity of the trial. If the researcher could predict the next allocation, it might be tempting to include a “better” patient if the next treatment is more likely to be an experimental treatment, using the information from a prognostic marker. To quantify the allocation bias, we evaluate and compare different allocation biasing policies like the Blackwell-Hodges convergence strategy. We investigate the impact on the type I error rate for different randomization methods such as permuted block randomization or complete randomization. Furthermore, we evaluate both one-step and two-step randomization for these methods and we explore how the error rate changes depending on the entry of the treatment arm of interest.

        A chronological bias can occur if there are time-related changes in the outcome, e.g., caused by changes in the patient population when treatment arms leave or enter the trial. We explore different time trends such as step functions, linear, or seasonal trends.

        We show how the results are impacted when using either only concurrent controls, or enriching the analysis by using non-concurrent control data from patients who have been included into the study before the treatment of interest joined the platform trial.

        Results show that allocation bias substantially inflates Type I error when using permuted block randomization, particularly with small block sizes. It is less pronounced when using complete randomization. Chronological bias causes the highest type I error inflation in the presence of monotonically increasing trends, especially when simply pooling all available controls, i.e., both concurrent and non-concurrent. However, using an ANOVA model adjusted for time periods, rather than a standard t-test, corrects for the type I error inflation caused by time trends.

        32144106684

        Speaker: Nico Bruder (Institute of Medical Statistics, Medical University of Vienna)
    • 13:45 15:15
      IS5: Navigating in murky waters: reproducing preclinical findings Room 1 A

      Room 1 A

      Convener: Frank Konietschke (Charité - Universitätsmedizin Berlin Campus Mitte)
      • 13:45
        Reproducibility and Ethics in Nonclinical Statistics: Building Trust 30m

        Scientific integrity is the cornerstone of progress in biomedical research. Nowhere is this more critical than in nonclinical settings. Reproducibility – the ability to consistently replicate findings across studies, laboratories and organisations is – is not just a technical requirement. It is a fundamental attribute that underpins trust. As nonclinical research continues to expand in scope and complexity, it is both a scientific and moral responsibility.

        Nonclinical research and manufacturing environments present distinct challenges to reproducibility. Biological variability, evolving experimental models, and the diversity of data sources can introduce uncertainties that complicate both study design and interpretation. Constraints on sample sizes, limitations in animal models, and the pressure for rapid innovation further increase the risk of irreproducible results. In manufacturing, the translation of laboratory findings to scalable, reliable processes demands rigorous validation and ongoing monitoring, all while navigating shifting regulatory landscapes.

        Artificial Intelligence (AI) and Machine Learning (ML) are transforming nonclinical statistics by enabling the analysis of complex, high-dimensional data and automating aspects of experimental design and quality control. These technologies hold immense promise for uncovering novel insights and driving efficiency. However, they also introduce new complexities that can threaten reproducibility if not carefully managed.

        Transparent reporting is essential to enable independent verification, foster collaboration, and accelerate scientific progress. Clear documentation of methodologies, data sources, analytical decisions, and limitations allows others to reproduce results and build upon previous work.

        Nonclinical statisticians have too often been viewed as on-demand specialists. This presentation will focus on how the rise of data-centric roles and powerful technologies opens the door to a more integrated, collaborative and strategic role for statisticians in drug development and manufacturing. In the era of big data and AI-driven research, statisticians can play a pivotal role in safeguarding methodological quality and ensuring that findings are not artifacts of flawed design or analysis.

        53573503699

        Speaker: Helena Geys (Johnson&Johnson Innovative Medicine R&D)
      • 14:15
        The Impact of Methodological Rigor on Reproducibility in Biomedical Research 20m

        Low rates of replicability in early phase biomedical research hinder progress and putatively cause high attrition rates in clinical trials. To improve evidence generation processes, preclinical confirmatory studies and preregistration offer potentially effective strategies. By comparing conduct and outcome of preclinical studies utilizing such strategies, we examined how different degrees of rigor improve evidence generation.
        We evaluated experimental rigor of preclinical studies by extracting measures to reduce risk of bias, sample sizes, and effect sizes of primary outcomes of three different preclinical data sets. A. Preregistered single-laboratory animal studies of two study registries. B. Published multi-laboratory animal studies, extending an existing dataset. C. A unique set of confirmatory multi-lab projects including associated single-lab exploratory data from a dedicated German funding call.
        Methodological rigor in reported results increased across all three data sets relative to exploratory experiments. In preregistered studies, time from preregistration to publication was faster in experiments with high methodological rigour. Reliability qua sample size of preregistered experiments was heterogenous with frequently optimistic effect size assumptions. In confirmatory experiments, sample sizes were increased in comparison to the exploratory phase resulting in higher reliability. Effect sizes were decreased in multi-laboratory and confirmatory studies. In preregistered studies effect sizes were also lower than preregistered values. The magnitude of this decrease was positively associated with measures of experimental rigour. In the confirmatory studies, approximately 80% of experiments failed to generate hypothesis confirming evidence on several replication assessment criteria. Similar results were obtained in multi laboratory studies and preregistrations. A follow-up in-silico investigation into the test severity of diverse replication criteria revealed large differences in sensitivity and specificity between criteria.
        Our preregistered side-by-side analysis of three complementary data sets that employ rigorous research practices such as preregistration and multi-laboratory study design demonstrates how such strategies meaningfully contribute to preclinical evidence generation processes. Decisions to move to clinical trials will benefit from particularly confirmatory multi laboratory designs. Such trials provide relevant information on efficacy and are well suited to effectively reduce decision uncertainty.

        53573501267

        Speaker: Ulf Toelch (BIH QUEST Center for Responsible Research)
      • 14:35
        Planning animal experiments based on estimation error considerations 20m

        Animal experiments are often purely exploratory, with little to no data available to support the planning phase. Nonetheless, ethical guidelines demand scientifically sound biometric planning. The experimental designs are typically complex, involving numerous experimental groups and adaptive steps, which complicates statistical planning.

        In recent years, statistical aspects of such experimental designs have been increasingly advocated by authorities and the scientific community [1]. However, so far, no statistical approach actually acknowledge the complexity of the experimental designs. Instead, statistical planning usually focuses on a small subset of the design, e.g., a two-groups-comparison, and applies classical biometrical methodology from clinical trials, i.e., 5% type I error rate, 80% power, and a priori estimated effect size. Often, with the argument of the experiment being “exploratory” instead of “confirmatory”, this biometric justification for the sample size of the two-group comparison is extrapolated to the rest of the experiment. Even though it is widely known that effect sizes from animal experiments are strongly biased and suffer from poor replicability and translation to clinical trials [2, 3], little emphasis has been put on this remarkable gap between experimental research and statistical planning.

        We demonstrate that common design practices in animal experiments introduce substantial error in effect size estimation, even if properly adjusted for inflated type I error rates and false discovery rates. To address this, we propose a simulation-based approach to quantify the estimation error and to classify its magnitude compared to a reference design, enabling an intuitive assessment of the suitability of a specific experimental design. Additionally, we present and dicuss resampling estimators that improve effect estimation and reduce estimation error even in complex experimental designs, consequently contributing to the reproducibility of preclinical findings.

        [1] Piper, Sophie K., et al. "Statistical review of animal trials—A guideline." Biometrical Journal, 65(2): 1-12. 2023.
        [2] Kimmelman, Jonathan, Jeffrey S. Mogil, and Ulrich Dirnagl. "Distinguishing between exploratory and confirmatory preclinical research will improve translation." PLoS biology, 12(5): 1-4. 2014.
        [3] ter Riet, Gerben, et al. "Publication bias in laboratory animal research: a survey on magnitude, drivers, consequences and potential solutions." PLoS ONE, 7(9): e43404. 2012.

        42858803969

        Speaker: Dario Zocholl (University of Bonn, Medical Faculty, Institute for Medical Biometry, Informatics and Epidemiology)
      • 14:55
        Sample Size Minimisation in Preclinical Animal Trials 20m

        In preclinical animal studies, researchers often have a certain degree of freedom when it comes to selecting the exact statistical analysis strategy for their experiment. Ideally, this analysis strategy should be specified prior to the experiment (and preregistered, if possible), with sample size planning conducted in accordance with the chosen analytical approach. Sample size calculations performed for different potential analysis strategies may yield substantially different estimates of the required number of units to achieve a desired level of statistical power. In animal experiments in particular, achieving adequate statistical power with the smallest possible sample size is desirable for ethical, financial, and practical reasons. Consequently, when multiple analysis strategies are possible, researchers may calculate sample sizes for each strategy and select the one that requires the fewest animals to reach the desired statistical power. At first glance, such sample size minimisation appears both reasonable and ethically appealing.

        However, sample size planning is often based on prior data (e.g., from pilot experiments or previously published studies), which may be affected by publication or follow-up bias to an unknown extent. As a result, sample size estimates derived from using these data are often too small to achieve the intended statistical power. Selecting the analysis strategy that yields the smallest of these underestimated sample sizes can further reduce the actual power of the study. This is problematic for reproducing preclinical animal trials because underpowered studies result in a higher number of false negatives, making it less likely that the results of a previously published study with a true effect can be reproduced. Minimising the sample size in this way may thus constitute a questionable research practice and could be more appropriately described as ‘sample size hacking’ (in analogy to ‘p-hacking’). In this project, we formalize and discuss this concept and conduct simulation studies to quantify the impact of ‘sample size hacking’ on statistical power across various scenarios.

        96432302484

        Speaker: Nicole Ellenbach (Institute for Medical Information Processing, Biometry and Epidemiology, Faculty of Medicine, LMU Munich, Germany and Munich Center for Machine Learning, Munich, Germany)
    • 13:45 15:15
      Machine learning and data science 3 Room 14

      Room 14

      Convener: Christian Staerk (IUF)
      • 13:45
        A Comprehensive Comparison of Methods for Quantifying Similarity of Datasets 18m

        Quantifying the similarity between two or more datasets is an important task in statistics and machine learning. In meta-learning, it enables the transfer of knowledge across tasks and datasets. In simulation studies, the similarity between the distributions assumed in the simulation and the distributions of the datasets for which the performance of methods is assessed is crucial. Similarly, in the context of synthetic data, the similarity of the generated data to a real-world dataset is typically evaluated to assess the quality of the data generation. In various applications, statistical two- or k-sample tests can be used to check whether the underlying distributions of two or more datasets coincide.

        Many approaches for quantifying dataset similarity have been proposed in the literature. The choice of a suitable method is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies. In previous work, we systematically reviewed 118 methods applicable to multivariate data that make no parametric assumptions and consider the full underlying data distribution. We provided a taxonomy of the methods based on the underlying ideas and compared them theoretically regarding their applicability, interpretability, and theoretical properties.

        Here, we compare the most promising methods identified in the theoretical comparison regarding their performance in practice. We conduct a comprehensive simulation study to assess the practical performance of the methods across diverse scenarios, including two- and multi-sample settings for both categorical and numerical datasets. We evaluate how well the methods detect certain differences, e.g., differences in location, scale, or higher moments, between datasets. Moreover, we analyze computational aspects such as runtime, memory consumption, and numerical stability. Based on the results, we give recommendations for selecting appropriate methods. We propose method combinations that are able to detect a wide spectrum of differences between datasets.

        53573507749

        Speaker: Marieke Stolte (TU Dortmund University, Department of Statistics)
      • 14:03
        A New Approach to Distinctness Testing 18m

        The assessment of crop variety distinctness, uniformity, and stability (DUS) is a fundamental component of plant breeding and registration processes. Traditionally, one-dimensional analysis of variance is conducted separately for each attribute. However, before conducting separate analyses, it would be worthwhile to apply multivariate methods to determine whether a given variety differs from others simultaneously in all examined attributes. Multivariate evaluations are crucial for selections as they allow for a comprehensive analysis of multiple traits simultaneously, thereby providing a more holistic assessment of a variety's performance. This study introduces a novel approach to distinctness testing based on machine learning, aiming to improve classification accuracy and efficiency in crop variety evaluation. An Artificial Neural Network (ANN) model was developed to classify plant varieties using phenotypic data collected according to DUS guidelines. The network architecture incorporated advanced techniques such as batch normalization and dropout, which enhanced model robustness and reduced overfitting. Furthermore, a new subset division strategy was proposed, ensuring a balanced representation of varieties and trait combinations during model training and validation. The model can effectively recognize both known and previously unseen varieties using, demonstrating generalizability and practical value for breeders. The study highlight the utility of machine learning in supporting variety distinctness assessments, offering flexible tools for agricultural research and plant breeding.

        42858813608

        Speaker: Laura Slebioda (Department of Mathematical and Statistical Methods, Poznań University of Life Sciences)
      • 14:21
        Predicting mixture of experts performance by generalized estimating equations 18m

        The aim of the work is to find important characterizations of mixture of experts which have
        an impact on improvement of combined classifier performance over the averaged
        performance of the base learners. The problem was examined for various high
        dimensional genomic data sets.
        Mixture of experts are useful for responses differentiating among base classifiers.
        From this point of view diversity of architectures is key. Different models can respond
        differently to limited amounts of data, which is crucial problem in genomic data, reducing
        the risk of overfitting.
        Diversity can be forced by merging learners of the following different ideas. Base
        classifiers are among other: neural networks with different parameters as: decay
        parameter, starting weight, number of neurons and layers. Additionally, the following
        machine learning methods were applied: Support Vector Machines with different kernels
        and regularization parameters diagonal and non-diagonal shrinkage discriminant analysis, naive Bayes and random forest. Those algorithms present different optimization strategies, which may help to avoid getting stuck in local minima.
        Diversity of base classifiers is additionally forced by taking different defined sizes of
        variables sets and different selection methods (single and combined). We are also
        interested in the number of combined classifiers from aforementioned set, because too few
        models in the expert mix may not provide an advantage while too many similar models
        lead to redundancy.
        Generalized estimating equations with auto-regressive correlation structure for increasing number of genes, identity link and variance to mean Gaussian relation were applied to model the improvement of mixture of experts over the mean of base learners performance. The explanation variables in the model were: the number of selected genes, diversity measure, the squared previous values and interaction between diversity and size of genes, also for squared
        values. Clusters in generalized estimating equations were defined as the genes selection
        method.
        Various diversity and similarity measures based on results of constituent classifiers
        are taken into account: diversities based on entropy, several types of mutual information
        and various measures based on averaged diversities for all pairs of classifiers.
        Examined predictors are important in different scenarios. Standardized coefficients in generalized estimating equations models with P-values indicate the most significant characterizations for improvement of performance of joined classifiers. The results maybe useful to in order to predict the best possible ensemble of classifiers.
        References
        Kamateri, E.; Salampasis, M. An Ensemble Framework for Text Classification. Information
        2025, 16 (2), 85.

        32144123319

        Speaker: Małgorzata Ćwiklińska-Jurkowska (Department of Biostatistics and Biomedical Systems Theory, Nicolaus Copernicus University)
      • 14:39
        Predicting Ordinal Outcome Using Interpretable Artificial Representative Tree with Conformal Uncertainty Quantification 18m

        Background:
        A random forest (RF) is an efficient method for prediction but it is difficult to
        interpret.
        Artificial Representative Trees (ARTs) are a special type of surrogate model
        that approximates the original strucutre of the RF in a single tree, achieving
        similar predictive accuracy.
        Conformal Predictive Systems (CPS) provide a framework for uncertainty
        quantification by generating prediction intervals. It is also possible to
        calculate the probability of an observation being above a selected threshold.

        Motivation:
        Our aim is to combine the strengths of ARTs and CPS in order to make
        reliable predictions of ordinal outcomes and to generate a stable tree that is
        easy to interpret while still conveying accurate local uncertainty insights per
        leaf. We will use a single ART to predict both the regression value and the
        probability of exceeding a diagnostic threshold. One particular strength with
        the suggested solution is the ability to dynamically adapt the ART to different
        thresholds. For illustration, we use the NHANES dataset containing blood
        glucose levels and related health indicators.

        Methods:
        We compared four modeling strategies: (i) an integrated ART + CPS
        approach, (ii) a decision tree + CPS model, (iii) multiple ARTs, and (iv)
        decision trees. The output from (i) and (ii) are joint predictions and
        uncertainty estimations, and for (iii) and (iv) different models are required for
        continuous and probability outcomes. Prediction performance (mean squared
        error, Brier Score) was evaluated using 10-fold cross-validation. Coverage of
        prediction intervals and model interpretability, measured by tree depth and
        cross-fold similarity, are also evaluated.

        Results:
        The ART + CPS and Decision Tree + CPS models achieve comparable
        predictive performance for both glucose level and probability prediction of
        prediabetes and diabetes. However, the ART + CPS models are substantially
        more interpretable. Their trees are about half the size of decision trees and
        more stable. Additionally, they avoid the need for multiple specialized models
        for each prediction task that might lead to contradictory interpretations

        Conclusion:
        This study shows that combining ARTs with CPS produces a single
        interpretable model that balances accuracy, stability, and explainability, while
        providing quantitative predictions with uncertainty estimates and probabilities
        for diagnostic categories.

        32144101844

        Speaker: Lea Kronziel (University of Lübeck)
      • 14:57
        modgo 2.0: an R package for synthetic data generation to mimic original study data 18m

        Sharing of original study data may be restricted by data protection policies. Instead, synthetic data that mimics the original data structure may be shared between research groups. This work introduces modgo 2.0 which may be used for generating synthetic data from existing study data. Simulations may be based either on the combination of the rank inverse normal transformation with simulation from the multivariate normal or on the use of the generalized lambda and/or generalized Poisson distributions. Scales of the variables may be continuous, ordinal, categorical, dichotomous, and/or even survival. We also provide an extension to the simulation of survey data which contains weights. Simulations on real data demonstrate the flexible use of modgo. The R package modgo is useful when existing study data may not be shared. Unique features are the inclusion of survival data and the expansion to simulating survey data. Its novel expansion to the generalized lambda and Poisson distributions permits the sharing of truly anonymized subjects.

        96432300455

        Speaker: Andreas Ziegler (Cardio-CARE)
    • 13:45 15:15
      Methods in epidemiology 2 Room 13 A

      Room 13 A

      Convener: Sharon-Lise Normand (Harvard Medical School)
      • 13:45
        Detection of Measurement Errors and Heterogeneity During the Collection of Observational Data: A Simulation Study 18m

        In field studies, measurements are often collected over extended periods, during which subtle shifts in data quality or instrument performance can occur. Recognizing and quantifying such measurement heterogeneities over time is essential to ensure the validity of study results and to intervene at an early stage if possible. However, the performance of available statistical approaches for identifying temporal variability that indicates measurement accuracy remains poorly understood.
        We present a simulation study designed to systematically compare seven statistical methods (ARIMA, fused LASSO signal approximator [FLSA], GAM, LOWESS, moving average, PELT, and piecewise regression) for assessing measurement heterogeneity across the data collection period. All methods were implemented using default parameter settings to reflect typical application scenarios. We generated 70,720 datasets covering a wide range of sample sizes, from 30 to 1,000 observations, to evaluate the performance of each method under 136 varying data conditions and systematic error patterns. These include sample size, distribution, signal-to-noise ratio, type and magnitude of systematic change. Four estimands were defined to investigate the ability of each method to detect temporal variability and change points in the simulated scenarios.
        Our results demonstrate that method performance is strongly dependent on both data distribution and sample size. LOWESS and GAM consistently delivered the most stable results across all performance measures and error patterns, making them suitable default choices for routine monitoring. For scenarios without true change, PELT performed best with normally distributed data, while FLSA excelled with log-normal distributions. Change-point detection revealed method-specific strengths: moving average was superior for detecting single jumps, whereas FLSA handled more complex change patterns most effectively. For normal data, the moving average and, for lognormal data, PELT tend to systematically overestimate the performance of the range as the sample size increases. Notably, ARIMA and PELT exhibited considerable instability and sensitivity to sample size, particularly degrading with larger N.
        Based on these findings, we recommend LOWESS and GAM as robust general-purpose methods for detecting measurement heterogeneity across diverse field study conditions. For change-point enumeration specifically, FLSA provides consistent performance with minimal sensitivity to sample size. ARIMA and PELT should be avoided as default approaches due to their inconsistent bias behavior and performance degradation with increasing sample size. These recommendations provide practical guidance for researchers implementing quality control procedures during data collection.

        75002901206

        Speaker: Ronja Foraita (Leibniz Institute for Prevention Research and Epidemiology - BIPS)
      • 14:03
        Detecting Temporal Measurement Heterogeneity in Cohort Studies: Lessons from the Study of Health in Pomerania 18m

        Longitudinal observational studies and clinical trials routinely collect extensive phenotypic data under changing organisational, technical, and environmental conditions. Variations in examiners, devices, protocols, or ambient factors can introduce consequential forms of measurement heterogeneity and measurement error over time. Although these sources of bias are well recognised, systematic and transparent procedures to detect temporal data patterns in this particular context have been insufficiently assessed.
        Using data from the Study of Health in Pomerania (SHIP-START-4, 2019–2021) as a case example, we demonstrate the susceptibility of real-world cohort data to time-related measurement variability. SHIP-START-4 comprised 1182 participants, which underwent interviews, questionnaires, and various clinical examinations, including ultrasound examinations, ECG, blood pressure, spirometry, hand grip, laboratory assays, amongst others, resulting in about 1300 metric phenotypic variables with a median number of 877 observations (Q10=251; Q90=1182).
        To address temporal trends, we applied seven commonly used statistical approaches—ARIMA, fused LASSO signal approximator (FLSA), GAM, LOWESS, moving average, PELT, and piecewise regression—to SHIP-START-4 data. Estimands were the range of the systematic change, variance, the mean absolute deviation around the median, and the number of change points. The systematic detection of these findings was also implemented as part of an automated assessment pipeline.
        Applying the statistical methods listed above to the same data yielded markedly different estimates of heterogeneity and error, illustrating the complexity of making the right choice to inform on the presence and magnitude of measurement heterogeneity and measurement error. The resulting diversity corresponds to findings of a parallel simulation study, in which the identical methods were evaluated under controlled conditions representing key types of patterns empirically observed in SHIP.
        This work highlights two key insights. First, empirical cohort data are intrinsically vulnerable to temporal heterogeneity and should routinely be assessed for them. Second, metadata-driven pipelines allow large-scale studies to incorporate trend detection and measurement-error diagnostics into routine data-quality workflows, but the effectiveness of such monitoring depends critically on robust statistical methods.

        21429409805

        Speaker: Carsten Schmidt (University Medicine Greifswald City: Greifswald)
      • 14:21
        Navigating Complexities in Assessing Systemic Health Effects of Tattoos in a Population-Based Cohort 18m

        Background: Tattoos and permanent make-up (PMU) gain increasing popularity, yet their potential systemic health implications remain poorly understood.
        Methods: To investigate associations between tattoos/PMU and chronic disease outcomes, we analyzed data from the LIFE-Adult Study, a population-based cohort of 10,000 adults recruited in Leipzig, Germany (2011–2014). A dedicated tattoo-specific questionnaire was administered between June 2018 and December 2020 to 4,248 participants, of whom 7.4% (n = 320) reported tattoos or PMU (4.7% tattoos, 3.1% PMU, 14% both). The study was approved by the University of Leipzig Medical Faculty Ethics Board, with written informed consent obtained from all participants.
        We examined liver toxicity and a composite cardiovascular disease (CVD) outcome (myocardial infarction or heart failure) supported by a biomarker for cardiac insufficiency NT-proBNP. Exposure data included tattoo/PMU characteristics (location, size, color, age). To address confounding, we constructed a directed acyclic graph (DAG) and applied full matching on age, sex, smoking status, body mass index, socioeconomic status, and alcohol consumption—maximizing data retention and enabling estimation of the average treatment effect on the treated (ATT) in the exposed group. Statistical analysis employed weighted logistic regression on matched data, with unweighted regression on unmatched data used for comparison.
        Results: The prevalence of liver toxicity was 11.3% (23/203) among tattooed participants and 10% (382/3,944) among non-tattooed controls, with no association observed between tattoos/PMU size and liver enzyme elevation. We observed a sex-specific pattern so as the risk ratio for liver toxicity was 1.35 (95% CI [0.8–2.3]) in men and 0.43 (95% CI [0.2–0.91]) in women, a difference confirmed in unmatched analyses and persisting after excluding participants with hepatitis. The prevalence of CVD, was 5.4% (15/278) among tattooed individuals, compared to 5.1% (201/3,972) in non-tattooed controls, with a risk ratio of 1.2 (95% CI [0.7–2.1]) in the matched cohort and 1.1 (95% CI [0.7–1.8]) in unmatched analyses. NT-proBNP levels, were similarly elevated in both groups (6.4% in tattooed vs. 6.4% in non-tattooed), yielding a risk ratio of 0.8 (95% CI [0.4–1.6]).
        Discussion: Our findings suggest a potential sex-modulated association between tattoos and liver toxicity. Cardiovascular outcomes showed an increased risk as well. Weighted models provided more stable estimates. Notably, sex-specific patterns emerged though limited power in the female subgroup constrained inference. Despite the small sample size, elevated risk ratios for cardiovascular disease and liver toxicity suggest a need for other longitudinal studies of the health effect of tattoos.

        21429417717

        Speaker: Narges Ghoreishi (Department Exposure, Unit of Epidemiology statistics and exposure modelling, German Federal Institute for Risk Assessment (BfR))
      • 14:39
        Beyond Case Counts: Simulation Evidence for Probability-Based Pandemic Surveillance 18m

        High-quality data are essential for reliable epidemic surveillance. Traditional systems relying on passive case reporting that may lead to unreliable prevalence estimates depending on the specific disease. Using the example of the COVID-19 pandemic, we show that once prevalence exceeds moderate levels, conventional reporting becomes biased and unstable. Beyond this point, drawing additional representative samples provides accurate estimates and enables the collection of additional information necessary for deeper insights into the pandemic’s impact. Adaptive surveillance designs incorporating probability sampling are necessary to ensure data quality and enable reliable evidence-based policies.

        75002901684

        Speaker: Inken Siems (Trier University)
    • 13:45 15:15
      Other 3 Room 12

      Room 12

      Convener: Christiana Drake (University of California)
      • 13:45
        Feasibility of Photoplethysmography for Heart Rate Asymmetry Assessment: A Comparative Study using the Autonomic Aging Database 18m

        Background: Heart Rate Asymmetry (HRA) represents a specialized domain of Heart Rate Variability (HRV) analysis, quantifying the unequal contribution of accelerations and decelerations to the overall heart rate variations. While HRA provides unique insight into the nonlinear dynamics of autonomic control, its assessment has traditionally relied on high-resolution Electrocardiography (ECG). With the expansion of wearable technology, there is a critical need to validate whether Pulse-to-Pulse (PP) intervals derived from Photoplethysmography (PPG) can serve as a reliable surrogate for RR intervals in complex HRV analysis.
        Methods: Data were obtained from the Autonomic Aging database (PhysioNet), specifically selecting 617 healthy volunteers recorded with the Task Force Monitor system. This subset provided simultaneous recordings of dual-channel ECG (1000 Hz) and continuous non-invasive blood pressure via finger photoplethysmograph (100 Hz upsampled to 1000 Hz). Two sets of RR intervals were extracted from lead I and II ECG, while PP intervals were derived from the blood pressure signal. The analysis focused specifically on comparative assessment of HRA indices: Guzik's Index (GI), Porta's Index (PI), Ehlers' Index (EI), and Deceleration Input (DI). Additionally standard spectral HRV metrics were compared.
        Preliminary analysis utilized Pearson’s correlation. To address the limitations of simple correlation, the final study will employ Bland-Altman plots, Linear Mixed Effects Models and Principal Component Analysis (PCA).
        Results: Initial findings indicate a high degree correspondence between PPG and ECG derived HRA metrics. Notably, for Guzik’s Index, the correlation between PPG and ECG1 signals (R=0.865; p<0.0001) exceeded the simultaneous ECG inter-lead ECG1 vs ECG2 correlation (R=0.835; p<0.0001). This suggests that PPG signals may possess sufficient precision to capture subtle dynamics in heart rate.
        Conclusion: Preliminary results support the utility of using PPG for advanced nonlinear HRV analysis. Our study aims to confirm whether consumer-grade optical sensors can provide medical-grade insights into HRA.

        32144107755

        Speaker: Rafał Pawłowski (Collegium Medicum of Nicolaus Copernicus University)
      • 14:03
        Assessing the Reliability of Virtual Control Groups in Preclinical Toxicology 18m

        The principles of Replacement, Reduction, and Refinement (3Rs) have become fundamental to modern biomedical research. In this context, Virtual Control Groups (VCGs) offer a promising strategy to reduce the number of animals used in toxicological and pharmacological studies. Rather than including concurrent control groups (CCGs) in every experiment, VCGs rely on historical control data collected under comparable experimental conditions to provide the necessary reference distributions for statistical evaluation. To ensure scientific credibility and regulatory acceptance, the validation of VCG-based conclusions is essential.
        The VICT3R project was established to advance toxicology research by developing and validating VCGs built from high-quality historical control data. The project integrates standardized CDISC Standard Data Tabulation Model for Nonclinical Studies (SEND) datasets, comprehensive data curation pipelines, and AI-supported analytical workflows to safeguard data integrity, harmonization, and reproducibility. This unified database forms a regulatory-compliant foundation for robust VCG generation and represents an important step toward data-driven reduction of animal use in preclinical research.
        As part of VICT3R, we systematically assessed the validity of VCGs across multiple species using both empirical data evaluation and simulation-based performance testing. Statistical comparisons included assessments of group means and variances, hypothesis testing, and effect size estimation. Differences between VCGs and CCGs or treatment groups were evaluated using t-tests, Levene’s tests, and Cohen’s d to quantify potential deviations. In addition, simulations were conducted to evaluate false-positive rates, statistical power, and robustness across a range of realistic experimental scenarios and data-matching strategies.
        Our results demonstrate that VCGs can provide statistically and biologically equivalent outcomes to CCGs when stringent data selection, metadata-based matching, and standardized transformation and quality checks are applied. Under these conditions, VCGs can serve as a valid and ethically preferable alternative in decision-making for toxicological studies.

        This standardized validation framework contributes to transparent and reproducible VCG implementation and promotes scientific and regulatory confidence in this approach. By enabling reliable statistical inference without unnecessary concurrent controls, VICT3R supports ethically optimized preclinical research and accelerates practical adoption of VCGs in alignment with the 3Rs principle.

        96432310326

        Speaker: Timur Tug (Fraunhofer Institute for Toxicology and Experimental Medicine ITEM)
      • 14:21
        Modeling data with values above the upper limit of quantitation 18m

        In pharmaceutical research and preclinical development data below the lower limit of quantitation are quite common although sometimes not properly dealt with. Beyond time-to-event settings measured data above a general or even subject specific upper limit of quantitation are less common.
        Malignant tumor cells can metastasize. When tumor cells metastasize they might cause new tumors called secondary or metastatic tumors. Effective metastasis treatment should at best prevent generating secondary tumors or at least limit their numbers. Specific animal-tumor models are used where the number of metastases in lungs is quite easily to be determined after the sacrifice of the animals. For physiological reasons this holds true up to a certain number of metastases, above that number metastases become connected and therefore indistinguishable – resulting in values above a limit of quantitation as right censored count values.
        As the main focus of research projects is usually a comparison of the efficacy of a treatment vs. a vehicle control or between different treatments to decide with which compound at which dose to proceed, both a proper test strategy and effect estimation is inevitable. Based on such a real-life example, modeling strategies will be discussed along with some thoughts on how to communicate the results.

        53573512805

        Speaker: Ulbrich Hannes-friedrich (Bayer AG, Pharmaceuticals)
      • 14:39
        Consideration of missing values in sample size calculation using multiple imputation 18m

        In many clinical trial analyses, missing data is addressed through multiple imputation (MI) to avoid loss of information and potential bias. However, this approach is not taken into consideration at the planning stage when calculating the sample size. Here, it is common practice to inflate the calculated sample size by an estimated dropout rate in order to maintain the desired power. This results in a discrepancy between the analysis method for which the sample size is calculated and the evaluation method ultimately used.
        MI allows uncertainty of the estimator to be properly represented by filling in missing values with several plausible values. Based mainly on the between-imputation (BI) variance, Zha and Harel [1] proposed a power calculation formula, demonstrating that statistical power can be higher when MI is used, which has a particular impact on sample size planning. Further research is needed to systematically evaluate how much power can be gained in order to give recommendations beyond the commonly used inflation of the required sample size.
        We extend the simulation study by Zha and Harel [1] with a fixed proportion of missing response under the “missing at random” and “missing completely at random” assumptions, whereby we compare different imputation methods such as predictive mean matching and Bayesian linear regression and vary the number of covariates and their respective relationship to the outcome. We conduct power analyses for different sample sizes. In each scenario, we first validate the provided formula with the simulated power, and then compare the results to the power obtained by complete-case analysis. We propose the number of imputations needed to obtain a robust estimate of the BI variance and thus, the power, in our simulations.
        We analyse whether the power gain from multiple imputation in the outcome can be robustly quantified in various settings; particularly in relation to the BI variance. Additionally, we identify scenarios in which MI leads to the most substantial improvements and demonstrate that under optimal conditions, it is possible to eliminate the inflated part of the sample size entirely.
        This simulation study aims to contribute to the improvement of sample size calculation for clinical trials that use imputation methods in the primary analysis. Future work expands this framework to include a blinded interim analysis and adaptive sample size adjustment.

        [1] Zha, R., Harel, O. Power calculation in multiply imputed data. Stat Papers 62, 533–559 (2021). DOI: 10.1007/s00362-019-01098-8

        85717608547

        Speaker: Teresa Byczkowski (Institute of Medical Biometry, Heidelberg University)
      • 14:57
        Calibrating machine learning approaches for probability estimation in case of the absence of calibration data 18m

        Statistical prediction models for binary outcomes are becoming increasingly popular. One signifi‐
        cant challenge is calibrating these models to suit the characteristics of a target population that is
        structurally different from the original population. Calibration is especially challenging when there
        is no training data available from the target population. To address this problem, we propose a novel
        calibration method, SimCal, which uses synthetic data generated from the model development data
        in conjunction with marginal statistics from the calibration cohort. We show that expert‐judgment
        modeling (EJM) may be used for calibration if cross‐sectional data from the target population are
        available comprising expert judgments about the potential outcome and the covariates. We de‐
        scribe three alternative calibration approaches when calibration data are lacking: similarity‐binning
        averaging (SBA), adaptive calibration of predictions (ACP), and Elkan calibration. In a simulation
        study, we compare SBA, ACP, Elkan calibration, and SimCal. R code for applying these methods
        is provided from the re‐analysis of data on coronary artery disease. We illustrate all 5 calibration
        approaches with a real data set for predicting functional outcome after stroke. None of the ap‐
        proaches performed convincingly in all situations. SimCal performs well when model parameters
        are correctly specified. EJM failed on the stroke data. Further research is urgently required for cali‐
        bration in the absence of calibration data.

        75002900909

        Speaker: Eleonora Di Carluccio (Cardio-CARE)
    • 13:45 15:15
      YSS2 (DR & PLR) Room 1 B

      Room 1 B

      Convener: Maren Hackenberg (University of Freiburg)
      • 13:45
        Hypothesis Testing in Ill-Conditioned Functional Response Models 15m

        In the functional response model (FRM), where a functional response is explained by scalar predictors, inference becomes challenging when the design matrix is not full-rank, leading to an ill-conditioned model (ICFRM). Widely used methods for this problem, such as $L^2$-norm-based tests (Zhang, 2013), suffer from critical flaws such as poor control of the type I error rate, which can invalidate statistical conclusions. To address these gaps, we first introduce two new test statistics for the general linear hypothesis problem in ICFRM: the globalizing pointwise F-test ($G_n$) and the $F_{max}$-test (Smaga and Stefańska, 2025). We employ robust nonparametric and parametric bootstrap techniques to approximate their null distributions. Simulation studies confirm that our proposed tests successfully control the type I error rate and exhibit greater statistical power than existing methods across different scenarios. We apply our methods to the audible noise data. The practical problem behind this data set is the motivation for our studies as the regression model is ill-conditioned in this case. Our ongoing research introduces a novel, alternative methodology: projection-based testing. This approach addresses the same problem by demonstrating that the functional null hypothesis can be equivalently expressed in terms of its projection. Preliminary simulation studies are underway to compare the performance of these projection tests against our validated $G_n$ and $F_{max,n}$ procedures. Initial results suggest the projection methods are promising, but their effectiveness is not consistent across all settings, opening a clear avenue for further research.

        96432309126

        Speaker: Natalia Stefańska (Adam Mickiewicz University)
      • 14:00
        missKnockoffs: A Robust Approach to Variable Selection in Incomplete Omics Data under False Discovery Control 15m

        Over the past two decades, the problem of selecting relevant variables in high-dimensional data analysis has gained particular importance in both statistics and machine learning. Despite substantial advances in modeling techniques and numerous algorithmic proposals, most existing approaches overlook the issue of missing observations — a phenomenon ubiquitous in real-world datasets, especially those derived from omics studies.

        To bridge this methodological gap, we propose missKnockoffs, an extension of the Model-X knockoff framework to the setting of incomplete data. The procedure proceeds in two stages: first, missing values are imputed using selected imputation strategies; subsequently, so-called knockoff variables are generated to enable control of the False Discovery Rate (FDR). To reduce the impact of randomness in the knockoff generation process, we introduce a mechanism of multiple knockoff replications combined with appropriate aggregation techniques.

        Furthermore, we propose a novel approach to aggregating knockoff statistics, which exhibits desirable and well-justified theoretical properties. The effectiveness of the missKnockoffs method is validated through an extensive suite of simulation experiments, primarily evaluating test power and FDR control capability.

        In a comparative analysis, missKnockoffs is benchmarked against several state-of-the-art variable selection algorithms, including SLOBE and ABLAS. Experimental results demonstrate that the proposed method achieves competitive, and in many cases superior, performance in terms of balancing test power and false discovery control. Additionally, the method’s practical utility is confirmed through an application to real omics datasets.

        64288205884

        Speaker: Dominik Nowakowski (Medical University of Bialystok, Department of Biostatistics and Medical Informatics)
      • 14:15
        Conditional distribution function-based measure for independence testing of functional data. 15m

        In modern data analysis, technological advancements frequently result in the collection of Functional Data (FD), where observations are naturally represented as smooth functions, curves, or surfaces over a continuum (e.g., time or space). Examples include daily stock prices, continuous temperature recordings, or spectroscopic measurements. Functional Data Analysis (FDA) offers a powerful framework for modeling such phenomena, addressing challenges like high dimensionality and irregular sampling better than traditional multivariate methods. On the other hand, a fundamental task in statistics is examining the relationship, or independence, between variables. While widely studied for classical multivariate data, testing independence in the functional setting remains a significant challenge. Existing methods, while valuable (e.g., Krzyśko et al., 2022, 2025), may not always be optimal, necessitating the development of alternative and more robust procedures.

        This work, which is also a master's project, proposes a novel approach to test the independence of functional data by leveraging a basis expansion technique. Functional data is first approximated as a linear combination of basis functions (e.g., Fourier or B-spline) in a Hilbert space. The resulting finite set of basis coefficients effectively translates the functional problem into a standard multivariate one. We adapt and extend the recently proposed multivariate independence measures and tests (Wang et al., 2025) to this functional context.  

        We conduct a comprehensive simulation study to assess the new method's statistical properties, focusing on the control of type I error rate and the power. The proposed methodology has broad practical implications in fields where functional dependence is crucial: finance (determining the independence between two stock price trajectories); environmental science (assessing if the annual temperature curve in one region is independent of the annual rainfall curve in another); biometrics (testing the relationship between two continuous physiological signals, e.g., two types of brain activity traces). We illustrate the use of new methods in such or similar practical problems.

        References:

        1. Krzyśko, M., Smaga, Ł., Kokoszka, P. (2022). Marginal distance and Hilbert-Schmidt covariances-based independence tests for multivariate functional data. Journal of Artificial Intelligence Research 73, 101613.

        2. Krzyśko, M., Smaga, Ł., Wydra, J. (2025). Distance of mean embedding for testing independence of functional data. Signal Processing 233, 109959.

        3. Wang, L., Zhou, H., Ma, W., Yang, Y. (2025). A conditional distribution function-based measure for independence and K-sample tests in multivariate data. Journal of Multivariate Analysis 205, 105378.

        32144101705

        Speaker: Filip Pieczątkiewicz (Adam Mickiewicz University)
      • 14:30
        Combining machine learning methods for subgroup identification in time-to-event data with approximate Bayesian computation for bias correction 15m

        In clinical development it is essential to identify subgroups of patients who exhibit a beneficial treatment effect, ideally before moving to confirmatory trials. Such subgroups are often defined by predictive biomarkers with corresponding cut-off values. However, data-driven selection of biomarkers or cut-offs introduces selection bias, i.e. the treatment effect within the selected subgroup is overestimated.
        In previous work, the approximate Bayesian computation (ABC) algorithm was used to correct for this selection bias, but it was limited to situations with a reduced number of potential subgroups. Machine learning (ML) methods explore a much wider range of subgroups, but this also increases the risk of bias and thus the challenge for effective bias correction. In this work we investigate how ML-based subgroup selection, specifically model-based partitioning (MOB), can be combined with the ABC algorithm to correct for selection bias. We first set up the methods by adapting MOB for subgroup selection and extending the ABC algorithm to time-to-event settings. Then, we evaluate our approach in terms of bias, overlap with the true subgroup, rate of correct biomarker inclusion and similarity in subgroup size in simulation studies based on the ADEMP framework.
        Results from the simulation study indicate that the ABC approach effectively reduces the bias of treatment effect estimates in subgroups identified by MOB. The root mean squared error (RMSE) of the naïve estimates can be decreased from 0.171 to 0.112 in scenarios with large subgroup effects. Nevertheless, the approach also has some limitations: the ABC algorithm is computationally intensive, the performance highly depends on the choice of prior distributions and is less effective when the true subgroup effect is weak. In summary, our findings highlight the importance of addressing selection bias in ML-based subgroup selection and demonstrate how the ABC framework can provide a reasonable correction strategy.

        21429411305

        Speaker: Henrik Stahl (University of Applied Sciences Darmstadt)
      • 14:45
        Estimating the Causal Effect of a Cumulative Exposure on an Outcome in Studies Prone to Confounding and Irregular Visits 15m

        Non-experimental data, such as electronic medical records, are often used in causal inference to estimate the effect of an exposure on an outcome of interest. However, this type of data can be affected by potential sources of bias in causal analyses. For example, these data do not come from a study design that ensures a balance of patient characteristics between exposure groups, a problem known as confounding. Patients are also observed irregularly over time, which can lead to selection bias. Methods have recently been proposed to address these challenges, but they have mostly focused on acute treatment effects, which occur rapidly and are short-term (Pullenayegum et al., 2023; Coulombe and Yang, 2024).

        In this presentation, we propose a methodology, the Inverse Density Exposure and Monitoring (IDEM) estimator, to consistently estimate the causal effect of a cumulative exposure on an outcome measured repeatedly in non-experimental longitudinal studies. Under certain assumptions, the proposed estimator accounts for delayed treatment effects and allows for causal estimation in the presence of time-fixed confounding and irregular observation times of the outcome. To achieve this, the Inverse Density of Treatment (IDT) and Inverse Intensity of Visits (IIV) weights are combined using generalized estimating equations to derive the IDEM estimator. Using properties of two-step estimators, we present results on its asymptotic distribution, which is valid under a set of conditions.

        In a simulation study with four scenarios in which the exposure and visit models vary, the causal estimates obtained with IDEM were compared with those obtained using the ordinary least squares (OLS) estimator and two other simply weighted estimators, which we refer to as IDT and IIV. Across all scenarios, IDEM showed the smallest bias. The four estimators were then applied to the Phenobarb dataset (Grasela and Donn, 1985) to estimate the causal effect of cumulative phenobarbital administration on its irregularly measured blood concentrations in newborns, with weight and Apgar score available at baseline.

        References:
        Pullenayegum, E. M., Birken, C., Maguire, J., and TARGet Kids! Collaboration. (2023). Causal inference with longitudinal data subject to irregular assessment times. Statistics in Medicine, 42(14), 2361–2393.
        Coulombe, J., and Yang, S. (2024). Multiply robust estimation of marginal structural models in observational studies subject to covariate-driven observations. Biometrics, 80(3).
        Grasela Jr, T. H., and Donn, S. M. (1985). Neonatal population pharmacokinetics of phenobarbital derived from routine clinical data. Developmental Pharmacology and Therapeutics, 8(6), 374–383.

        53573512747

        Speaker: Mathilde Dicaire-Cartier (Institute for Medical Information Processing, Biometry, and Epidemiology, Faculty of Medicine, LMU Munich, Germany; Munich Center for Machine Learning, Munich, Germany; Department of Mathematics and Statistics, Université de Montréal, Montréal, Canada)
      • 15:00
        Variable Selection in Meta-Regression with Suspected Interaction Effects 15m

        Meta-analyses synthesise the results of multiple independent studies to obtain more comprehensive knowledge about a research topic. When study outcomes vary, meta-regression can be used to identify potential sources of heterogeneity across studies. One complication is the typically small number of studies available. Due to this, interaction terms are often omitted in meta-regression models, despite recommendations from previous research to consider them. In the meta-analysis on acute heart failure of Kimmoun et al. (2021) this caused possibly misleading or wrong results. This work aims to determine which variable selection method is able to identify moderator variables with an effect in a meta-regression, particularly in settings where interaction effects are suspected. The comparison includes commonly used methods, such as significance testing and information-theoretic criteria, as well as a tree-based algorithm called meta-CART introduced by Li et al. (2020). The latter machine learning approach promises a great potential in identifying interaction effects due to the underlying tree-structure. I conducted a simulation study varying the number of studies, the magnitude of heterogeneity, as well as the number and measurement scale of moderator variables to evaluate each method’s performance under realistic conditions. The methods were also compared to the results of the illustrative example by Kimmoun et al. (2021). The results demonstrate that, in comparison to meta-CART, conventional selection methods struggle with a high ratio of moderators to studies, which magnifies when interaction effects are included. Meta-CART is a robust method with a comparably low computational effort. Overall, the findings highlight the strong potential of tree-based methods for variable selection in the presence of interaction effects, while emphasising the continuing need for caution regarding spurious findings in meta-analytic research.

        32144112607

        Speaker: Paula Lorenz (TU Dortmund University)
    • 16:00 19:00
      Annual German Region Meeting Room 1 A

      Room 1 A

    • 16:00 19:00
      Excursions 3h x Outdoor

      x Outdoor

    • 19:30 22:30
      Gala Dinner 3h

      Ale Gloria restaurant, plac Trzech Krzyży 3 street

    • 09:00 10:00
      P3 Plenary Session: Susanne Strohmaier: Keynote lecture Room 1 A

      Room 1 A

      Convener: Dominik Heinzmann (Roche)
      • 09:00
        Causal inference in practice – One (?) estimand, many (!) analytical decisions 1h

        For many medical research questions, randomization is unethical or infeasible and decisions have to be informed by results based on observational – often routinely collected - data. Such data have enormous potential to inform stakeholders including health policy makers, health professionals and the general public about the impact of their decisions on public as well as individual health. However, impact evaluations depend not only on the quality of the underlying data source, but crucially also the choices made for the (statistical) analysis approach. These choices begin with translating the scientific objectives into well-defined target estimands. Even if a primary research question can be identified thus, conceptualizing a corresponding statistical analysis plan (SAP) targeting the estimand of choice necessitates numerous follow-up decisions ranging from data cleaning, missing data handling to the main analysis model, among many others. This can lead to many sensible paths of analysis where each path is scientifically justifiable, yet may yield different results. This variation in results is often perceived to reflect purely individual researchersʹ opinions or biases rather than an inherent feature of complex analyses and may have contributed to the evident skepticism toward science in the general public.
        Current research within our group focuses on methods to quantify this variation of effects due to the multiplicity of analysis strategies. Inspired by existing research we suggest an approach that implies developing a preferred SAP, as well as, a meta-SAP comprising sensible alternative decision at each step of an SAP. Ultimately, we want to present the preferred analysis results in context of a distribution of plausible estimates to demonstrate how and to what extent analytical decision affect the resulting estimates.
        Based on real world data applications involving causal questions in time-to-event settings of varying complexity, I will highlight the importance of evaluating the robustness of results to alternative design and analysis decisions and reflect on conceptional challenges as well as practical hurdles. After all, acknowledging and transparently communicating the influence of analytical decisions is essential to strengthening the credibility of evidence based on observational data.

        75002908084

        Speaker: Susanne Strohmaier (Medical University of Vienna, Center for Public Health, Department of Epidemiology City: Vienna)
    • 10:00 10:45
      Coffee break / Poster session 45m Foyer

      Foyer

    • 10:00 17:15
      Poster session - continued x Poster display area

      x Poster display area

    • 10:45 12:15
      Clinical trials 3 Room 13 B

      Room 13 B

      Convener: Florian Frommlet (Medical University Vienna)
      • 10:45
        A randomized basket trial design for dose optimization based on Bayesian model averaging using spike-and-slab priors 18m

        The FDA initiated Project Optimus and issued guidance for dose optimization, recommending randomized parallel dose-response cohorts to generate additional data at promising dose levels and implies that different dosages may be needed for different indications. In addition to dose optimization, with recent advancements in precision medicine and cancer biology, the development of cancer treatments has shifted toward the search for agents targeted to specific molecular profiles that may appear in more than one type of cancer. The basket trials are clinical trial designs that enable the simultaneous assessment of a new treatment in multiple indications. Concerning the FDA's dose optimization perspective and the recent trend of basket trials in early-phase clinical trials, this paper proposes a dose-ranging basket trial design based on a Bayesian model-averaging approach considering efficacy and toxicity outcomes, where indications and dose levels define baskets. A key benefit of the proposed approach is that it explicitly accounts for the possible heterogeneity of response rates among baskets. Our simulation study shows that the proposed approach outperforms other methods, offering higher statistical power, better control of Type I error rates, precise optimal dose selection, and sample size savings in various scenarios with heterogeneous treatment effects between baskets.

        64288202709

        Speaker: Belay Birlie Yimer (1. Astellas Pharma Europe Ltd., Addlestone, United Kingdom)
      • 11:03
        A Basket Trial Design for Unequal Sample Sizes Based on Power Priors 18m

        Basket trials examine the efficacy of a single intervention simultaneously in several patient subgroups. They are currently mostly applied in oncology, where the subgroup assignment is based on medical characteristics such as a common biomarker. This can result in small sample sizes within subgroups that are also likely to differ. Several designs for the analysis of basket trials have been proposed in the literature that share information across subgroups to increase power. Many designs utilise Bayesian methods, such as hierarchical modelling or model averaging. A recently proposed design based on power priors uses empirical Bayes methods to increase the computational efficacy compared to fully Bayesian designs. The design incorporates data from all subgroups using a weighted likelihood that shares information according to the similarity of the subgroups. However, if the sample sizes differ, there is a risk that the information from the small subgroups will be overlaid by that from the large subgroups.
        We extend the power prior design by applying a weighting method, previously suggested for sharing information from historical data, that accounts for unequal sample sizes by limiting the amount of information shared between subgroups. The new weights take the pairwise ratio of subgroup sample sizes into account, such that the effective sample size that is shared from a subgroup cannot exceed the sample size of the subgroup of interest. Using a simulation study, we systematically compare the power prior design with previously suggested weights and the new information-limiting weighting method to other Bayesian basket trial designs with respect to the expected number of correct decisions, type 1 error rates and power. We consider a range of different scenarios with different true response probabilities and sample sizes across subgroups.
        The results of the simulation study show that the new information-limiting weights improve the results of the original power prior design. In terms of the expected number of correct decisions, the improved power prior design performs slightly better than the competing designs in all sample size scenarios. In scenarios with some active and some inactive baskets, the inflation of the type 1 error rates is less severe than with unlimited sharing.

        75002907288

        Speaker: Lukas Baumann (Institute of Medical Biometry, University of Heidelberg)
      • 11:21
        How to optimize dynamic borrowing in basket trials – A utility-based framework. 18m

        Modern therapeutic agents in cancer therapy often target specific genetic traits of the tumor. Whenever these traits are independent of the tissue in which the tumor is located, the therapeutic agent may be tissue-agnostic, meaning that it can be applied regardless of location. Clinical trials for such tissue-agnostic therapies often have small sample sizes. Hence, it is efficient to recruit patients regardless of their tumor location in a single trial (e.g. NSCLC, colorectal cancer, and multiple myeloma in a single trial). Such a trial is called basket trial as all subcohorts are “collected in a single basket”.
        Basket trials come with a statistical challenge concerning analysis. On the one hand, a completely separate analysis of the different cohorts is guaranteed to be unbiased at the price of low power due to the small sample size. On the other hand, a pooled analysis will have higher power at the price of potential bias in case of heterogeneous treatment effects in the strata. For this reason, a plethora of borrowing methods have been suggested, i.e. statistical methods which dynamically decide on the amount of information that the different cohorts will share with one another.
        The planning of basket trial designs implementing dynamic borrowing is complicated by the fact that their operating characteristics need to be tuned to the specific trial setting and assumed response scenarios. We suggest a framework for tuning basket trial designs, consisting of the choice of an optimization algorithm, a utility function as optimization target, and performance measures of interest. The presented utility functions aim at defining a trade-off between type-I error and power, either locally in the separate cohorts or globally across the trial as a whole. This way both Bayesian and frequentist methods for borrowing can be optimized with respect to frequentist performance measures, which allows for easy communication in clinical settings as well as objective comparison between different borrowing methods. In a comprehensive simulation study, we investigated the framework in the optimization of a Bayesian basket trial design suggested by Fujikawa et al. in 2020. The simulation results highlighted the benefit of optimizing performance measures across a range of possible outcome scenarios (from no stratum responding to all strata responding) and the need for adapting tuning parameters to the particular trial setting.

        96432303128

        Speaker: Lukas D Sauer (Institute of Medical Biometry, Heidelberg University)
      • 11:39
        AI-Assisted Methodology Validation Before Data Collection: The E-PICOS Framework for Robust Clinical Trials 18m

        Background:
        A substantial proportion of clinical research waste originates from fundamental methodological flaws—improper study design, insufficient power, inappropriate statistical methods, and non-compliance with reporting guidelines. While many AI tools attempt to support data analysis, none address the critical upstream phase: validating methodology before data collection. To address this gap, we developed E-PICOS, an AI-assisted framework that ensures methodological rigor at the earliest stages of a clinical trial.

        Methods:
        E-PICOS integrates three intelligent components:
        (1) Protocol Validation AI, which evaluates PICOS structure, identifies risks of selection bias, assesses sample frame adequacy, and recommends appropriate trial designs and estimands;
        (2) Statistical Guidance Engine, which supports sample size calculation based on the Minimum Clinically Important Difference (MCID), ensuring clinical—not only statistical—meaningfulness;
        (3) Reporting Optimization AI, which checks trial protocols and manuscripts for compliance with CONSORT, SPIRIT, ICH E6(R3), and estimand-based reporting frameworks.
        Importantly, E-PICOS does not perform statistical computation itself; analyses (e.g., t-tests, ANOVA, ROC curves, Kaplan–Meier, Cox regression) are executed using established statistical engines, preserving validity and reproducibility. The AI interprets results in accordance with Good Biostatistical Practices and supports transparent reporting.

        Results:
        Across multiple real-world implementations, E-PICOS identified methodological issues at the protocol stage—including underpowered designs, inappropriate estimands, insufficient justification of effect sizes, and missing bias-mitigation strategies. Early correction of these issues improved protocol quality, reduced anticipated research waste, and enhanced compliance with international trial standards. E-PICOS also supported manuscript preparation by detecting major/minor deficiencies and generating structured compliance reports without storing user data.

        Conclusion:
        E-PICOS represents a novel paradigm in clinical trial methodology: AI-assisted validation before data collection. By combining methodological expertise, MCID-based power planning, and guideline-driven oversight of reporting, E-PICOS enhances rigor, transparency, and reproducibility while maintaining full data sovereignty. This framework offers a scalable and ethical model for improving clinical trial quality in the era of data-intensive research.

        85717619448

        Speaker: Arzu Kanik (AB Health Tech)
      • 11:57
        Using Confidence Distributions in Final and Interim Analyses for Single-Arm Studies or Platform Trials Consisting of Single-Arm Studies 18m

        Confidence distributions are a frequentist alternative to Bayesian posterior distributions. They summarize the knowledge and uncertainty about an unknown model parameter in the form of a probability distribution on the parameter space, just like a posterior distribution, without assuming that the parameter of interest is a random variable. Although confidence distributions are a relatively old concept, they are not well known and have not been used much until recently.

        As part of the EU-PEARL project, two platform-basket trials were developed for neurofibromatosis type I and II, which are rare diseases affecting mainly children. These platform-basket trials were designed as a collection of single-arm proof-of-concept or phase II trials with a binary endpoint, and with the option to include an interim analysis allowing for early stopping in case of projected lack of success.

        In this presentation, we provide statistical analysis strategies based on confidence distributions for single-arm proof-of-concept (PoC) or single-arm phase I or phase II studies, and for master protocol trials that are a series of single-arm studies with a binary endpoint. We present analysis rules for the final analysis as well as for interim analyses rules. For interim analyses we focus on rules which allow for early stopping because of projected lack of success at the final analysis, and we use a frequentist predictive distribution to define such rules.

        The operating characteristics of our decision rules can be calculated exactly (no simulations required) in the case of a binary endpoint. We show how this can be done and we also compare the performance of these new rules with that of corresponding Bayesian decision rules, or decision rules based on stochastic curtailment.

        Reference:

        G. Heimann, P. Jacko, and T. Parke, “Using Confidence Distributions in Final and Interim Analyses for Single-Arm Studies or Platform Trials Consisting of Single-Arm Studies”, Statistics in Medicine 44, no. 20-22 (2025): e70251.

        75002912906

        Speaker: Günter Heimann (Independent Consultant)
    • 10:45 12:15
      Ecological and agricultural statistics 1 Room 12

      Room 12

      Convener: Hans-Peter Piepho (University of Hohenheim)
      • 10:45
        Clustering of indicator species according to soil properties in the regeneration phase of pedunculate oak (Quercus robur L.) forests 18m

        Regeneration of forest ecosystems is crucial for preserving their structure, function and long-term stability. This research analyses the correlation and similarity of the occurrence of 11 species characteristic only for the regeneration phase with respect to some soil properties. Aim of research is to perform grouping of species according to soil properties, and to determine differeces in clusters trough Ellenberg's (1974) indicator values. Soil properties were analysed based on the content of carbon and nutrients (Ntot, Ctotal, Corg, B, Ca, Cu, Fe, K, Mg, Mn, Na, P, Zn and pH-values (pHH2O, pHCaCl)) in the soil. Seven categories were used to classify the species' occurrence, 0: 0%, 1: (0,1)%, 2: 1-10)%, 3:10,25)%, 4:25,50)%,5:50,75)%,6: >=75%.
        The research was done on 10 plots (20x20m) in the phases of regeneration (shelterwood cutting) of pedunculate oak forest in eastern part of Croatia (Spačva area).
        Spearman rank correlation coefficients were obtained between soil chemistry variables and species occurrence. Based on these correlation coefficients, species were clustered. First, hierarchical clustering (Euclidean distance and Ward method) was used, where it was concluded that species were grouped into 2 main clusters, after we compare our results with the non-hierarchical k-means procedure assuming the number of clusters as the result of Ward method. Cluster 1 consists of the species Calamagrostis epigejos (L.) Roth, Epilobium angustifolium L. and Solidago gigantea Aiton. These species indicate open habitats, rich in nitrogen, mostly indifferent to temperature and moisture. With regard to soil reaction, these are indifferent species or indicators of heavily acidic soils.
        Cluster 2 consists of 8 species (Urtica dioica L., Erigeron annuus (L.) Desf., Solanum dulcamara L., Quercus robur L., Acer tataricum L., Cirsium palustre (L.) Scop., Eupatorium cannabinum L., Lapsana communis L.) with lower indicator values for light than the species of Cluster 1, and according to Ellenberg somewhat higher values for moisture and soil reaction.
        By decreasing the value of pHH20, pHCaCl, Ca, Mg, Mn, species from Cluster 1 have a higher representation, while Cluster 2 has a decreased representation of the species. For increased values of K, Cluster 1 has increased species representation, and Cluster 2 has decreased. An increase in C in the soil decreases the representation of species in Cluster 1, while it increases in Cluster 2. B, Fe and Zn do not show any significant relationships with species occurrence.

        53573507084

        Speaker: Anamarija Jazbec (University of Zagreb Faculty of Forestry and Wood Technology)
      • 11:03
        Quality over Quantity? - The optimised allocation of quality samples in Bavarian post-registration trials in perennial ryegrass 18m

        In Germany, cultivars are tested for regional recommendations in federal state cultivar trials, taking the form of multi-environment trials (METs). Their primary objective is to identify cultivars that are best suited for regional production in agro-ecological zones. For perennial ryegrass, current selection decisions are predominantly based on yield. Incorporating additional quality characteristics could improve the selection process. In Bavarian state cultivar trials, additional quality parameters were taken for perennial ryegrass to describe cultivar quality. Due to financial constraints, the number of samples sent to the laboratory for analysis annually is limited. Consequently, single-plot samples were partly mixed across replicates within a site and analysed as mixed or composite samples. The decision which single-plot samples were mixed for mixed samples were made by intuition. The objective of the current study was finding an optimal distribution of single-plot and mixed samples for quality parameters in perennial ryegrass for a constrained number of laboratory samples. Data from METs of perennial ryegrass across three sites in Bavaria during the years 2017–2023 were analysed. The analysis comprises two main steps. First, variance components at each site across trial cycles were estimated. In this context, a general strategy for dealing with different year effect in perennial crops is proposed and applied. The second step involves the simulation of data representing alternative sampling designs. Data of alternative designs were analysed with variance components fixed to the estimates obtained in the first step. The precision measured as power and standard error of treatment differences was assessed for each alternative design. The most efficient design in this framework is an even distribution of two mixed samples per cultivar and cut at each site. Where the residual error variance is high, more samples are worthwhile. The results enable the implementation of quality aspects into the trialling system without exceeding the given limits of workload or financial constraints.

        85717609204

        Speaker: Anne-katrin Gorn (Bavarian Research Center for agriculture)
      • 11:21
        Challenges and Perspectives on Using Environmental Covariates in Multi Environment Trials A Case Study in Sugar Beet 18m

        Environmental covariates (ECs) have become increasingly abundant and accessible over the past two decades, driven by advancements in remote sensing, data acquisition technologies, and the declining costs of environmental monitoring. Incorporating ECs into multi-environment trials (METs) has several applications, including improving the understanding of genotype-by-environment interactions, serving as selection criteria, and supporting farmers in variety decisions.
        This study explores the practical and methodological challenges of using ECs in METs of field crops. Approaches are illustrated using commercial sugar beet data on over 4000 genotypes across more than 20 locations spread over several countries, provided by Strube D&S GmbH in collaboration with the German Research Foundation and the University of Hohenheim. The main focus is on strategies for integrating ECs into linear mixed models - both directly and via synthetic approaches. We analyze several EC data sources, including public and private weather stations within Germany and across countries, highlighting issues such as data quality, interpolation uncertainty, and consistency. Additionally, we discuss approaches for averaging ECs over biologically meaningful periods and explore feature engineering techniques to transform raw data into informative predictors. This work aims to support the robust and interpretable use of ECs in METs of field crops conducted across heterogeneous environments.

        85717617297

        Speaker: Maksym Hrachov (University of Hohenheim)
      • 11:39
        Optimizing the allocation of trials to sub-regions in crop variety testing: different conditions in different years 18m

        New crop varieties are extensively tested in multi-environment trials in order to obtain a solid basis for recommendations to farmers. When the target population of environments is large, a division into sub-regions is often advantageous. If the same set of genotypes is tested in each of the sub-regions, a linear mixed model (LMM) may be fitted with random genotype-within-sub-region effects. The first analytical results to optimizing allocation of trials to sub-regions have been obtained in Prus and Piepho (2021) and Prus and Piepho (2024). Prus and Piepho (2021) considered one-year experiments. In Prus and Piepho (2024) multi-year experiments were investigated, for which the same conditions we originally assumed for all years. However, in praxis the number of genotypes or even the total number of locations may change from year to year. In this work the general LMM is considered, where the number of genotypes and the total number of locations for different years are not the same. The latter numbers turn out to have influence on the optimal allocations of trials. The obtained analytical results are illustrated by real data examples.
        Prus, M. and Piepho, H.-P. (2021). Optimizing the allocation of trials to sub-regions in multi-environment crop variety testing. Journal of Agricultural, Biological and Environmental Statistics, 26, 267–288.
        Prus, M. and Piepho, H.-P. (2024). Optimizing the allocation of trials to sub-regions in crop variety testing with multiple years and locations. Journal of Agricultural, Biological and Environmental Statistics.

        85717610248

        Speaker: Maryna Prus (University of Hohenheim)
      • 11:57
        Assessment of a Low-Cost System for 3D Image Acquisition in Beef Cattle 18m

        The use of precision agriculture contrasts with the challenge posed by the high cost of commercial technologies, particularly for small-scale producers. For this reason, it is necessary to develop low-cost, accessible solutions that can be applied directly in the productive environment. Within this context, this work presents the development of a low-cost system for acquiring 3D images of beef cattle, enabling morphometric measurements without displacing the animals from their environment. The validation of the system was carried out through statistical processing, comparing manual measurements with automated ones. The planning of data acquisition included control of internal and external factors that could interfere with collection, as well as detailed recording of experimental conditions to ensure reproducibility.
        Tests were conducted to identify capture and processing configurations, including variations in camera positioning, acquisition environment, and the software employed. Image capture was performed using two and three Kinect v2 cameras operating simultaneously, as it was necessary to determine how the number of cameras influenced the 3D reconstruction of the animal, both in terms of shape quality and processing time. The first challenge was to evaluate image quality and identify which conditions directly influenced noise and the accuracy of point clouds. The second challenge involved selecting an efficient method for image fusion, since manual alignment is impractical given the volume of data generated. Subsequently, efforts focused on establishing a processing pipeline capable of operating within reduced timeframes, respecting the practical limitations of farm use.
        Once the acquisition and fusion of point clouds were stabilized, the final stage concentrated on extracting body measurements with the objective of relating them to the animal’s live weight and carcass weight. Statistical methods were applied to establish correlations between the measurements and weight. Measurements were performed in the animals’ natural environment, avoiding stress and behavioral changes during capture. Furthermore, the entire system was built using open-source software, eliminating additional costs and reinforcing the feasibility of the solution for small and medium-scale producers. The combination of low cost, operation in real farm environments, and absence of direct contact with the animals positions this system as a practical alternative for morphometric measurements, with potential to integrate into management routines, improve production decisions, and expand access to precision agriculture technologies.

        75002910255

        Speaker: Milene Figueira (UFRPE)
    • 10:45 12:15
      Evidence synthesis 1 Room 14

      Room 14

      Convener: Willi Sauerbrei (University of Freiburg)
      • 10:45
        Parametric nonlinear dose-response meta regression 18m

        In epidemiology dose-response meta analysis often refers to fitting a meta regression model that describes a linear trend in the outcome ("response") as a function of the exposure ("dose"), based on aggregated data from a number of studies.

        Fixed- and random-effects extensions for handling nonlinear dose-response for odds ratios, relative risks and differences in means through the use of fractional polynomial models and cubic spline models have been proposed (e.g., Crippa & Orsini, 2016). Multiple correlated estimates from the same studies are handled through derivation of explicit formulas for the corresponding variance-covariance matrix. A related approach, which also involves semi-parametric modelling, has been proposed by Xu & Doi (2020) who proposed the use of a robust sandwich-type variance-covariance estimator as an alternative means for handling correlated estimates.

        In some cases it may, however, be desirable to be able impose more structure on the nonlinear dose-response trend through the use of a parametric nonlinear function. One key advantage of parametric modelling is that interpretable quantities are more readily available, possibly shifting the focus of nonlinear dose-response meta analysis further away from being mostly used for descriptive and graphical purposes and more towards inference useful for informing public health decision making. There has been surprisingly little methodological work on parametric modelling of nonlinear dose-response trends in a meta-analytic context. In one study, a parametric three-parameter sigmoidal log-logistic dose-response model, often referred to as the Emax model, was fitted by means of four different fixed-effects meta-analytic approaches (Langford et al., 2018).

        The aim of this study is to outline a general methodology for fitting a wide range of parametric nonlinear dose-response meta regression models that also include study-specific random effects. Estimation will involve a combination of nonlinear least squares estimation and a profile likelihood approach. Both simulations and data examples will be used to demonstrate the usefulness of the methodology.

        References
        Crippa, A., Orsini, N. (2016). Dose-response meta-analysis of differences in means. BMC Medical Research Methodology, 2, 91.
        doi: 10.1177/0962280218773122

        Langford, O., Aronson, J. K., van Valkenhoef, G., Stevens, R. J. (2018). Methods for meta-analysis of pharmacodynamic dose-response data with application to multi-arm studies of alogliptin. Statistical Methods in Medical Research, 27, 564-578.
        doi: 10.1177/0962280216637093

        Xu, C., Doi, S. A. R. (2020). Dose-Response Meta-Analysis. Chapter 13 in Meta-Analysis: Methods for Health and Experimental Studies (pp. 267-283). Springer: Singapore.
        doi: 10.1007/978-981-15-5032-4_13

        32144110809

        Speaker: Christian Ritz (National Institute of Public Health (SDU))
      • 11:03
        Meta-analyses based on previous meta-analyses 18m

        Updating a meta-analysis (MA) by including additional studies is usually a straightforward exercise, as the relevant data are commonly reported in detail, i.e., effect estimates with standard errors for all studies. Matters are complicated, however, when only the summary of a previous analysis is available, i.e., the overall estimate with standard error. For instance, this is sometimes the case when an individual participant data (IPD) MA only reports limited data. An ad-hoc solution that has sometimes been adopted is to include the previous MA's estimate as a single "study" in the new analysis, but in the commonly adopted hierarchical modelling framework, this will lead to misalignment of hierarchy levels from previous and current analyses. We will discuss the problem in detail, including whether or when the ad-hoc solution may be appropriate, or what adjustments may be made within Bayesian and frequentist frameworks. Approaches are motivated and illustrated using examples from cardiovascular research.

        21429409848

        Speaker: Christian Röver (Department of Medical Statistics, University Medical Center Göttingen)
      • 11:21
        Proper Back-Transformations for the Random-Effects Model in Meta-Analysis 18m

        Meta-analyses often involve transforming bounded effect size measures, such as correlation coefficients or odds ratios, onto a real-valued scale prior to estimation. The results are then back-transformed to the original scale for interpretation purposes. However, in the standard random effects model for meta-analysis, simply applying the inverse transformation function generally does not yield an estimate of the mean but of the median effect size, a phenomenon known as transformation bias. This issue is frequently overlooked in practice, leading to an incorrect definition of the estimate or, equivalently, systematic bias in the mean effect size estimates. Integral back-transformations provide a more accurate approach.

        We give an overview of different types of back-transformations and a general formulation of the integral back-transformation. In addition, methods for deriving corresponding back-transformed confidence intervals (CIs) are presented. Approaches are compared, aiming for CIs for the mean, median, and mode effect size.

        We analyze differences between these back-transformation approaches in a simulation study and visualize the transformation bias analytically and with example data sets. Furthermore, we address an inconsistency that arises when non-symmetrical transformation functions are applied and the CI for the back-transformed mean is interpreted as a hypothesis test. Software implementations of the integral back-transformations in the R package 'metafor' are presented for various effect sizes including correlation coefficients, proportions, odds and risk ratios, and Cronbach's alpha.

        85717600205

        Speaker: Jan-Bernd Igelmann (TU Dortmund University)
      • 11:39
        Bayesian conjugate analysis for federated statistical inference 18m

        In many biomedical research settings, sufficiently large sample sizes can only be achieved by combining data from multiple collection sites (e.g., hospitals). However, pooling individual participant data in a central server is often restricted due to privacy and regulatory constraints. Federated inference addresses this challenge by distributing the statistical analysis across local sites, allowing pooled inference in a central server using privacy-preserving summary statistics. Although federated inference methods exist in a frequentist framework, the full potential of Bayesian approaches in this area has not yet been explored. Bayesian methods offer distinct advantages, including the ability to incorporate prior knowledge and to perform predictive checks for model criticism. A recently published Bayesian method for federated inference relies on approximate solutions even in linear regression scenarios where exact solutions are available. We therefore propose a different approach to federated inference using Bayesian conjugate analysis (BCA), which is communication-efficient and mathematically convenient. Moreover, BCA yields exact, lossless inference for linear regression problems, that is, producing the same posterior distribution as if the pooled data had been analyzed. We further show that BCA can also be used as an approximation for more general inference problems where the parameters to be estimated are asymptotically normal, such as generalized linear models. Finally, the BCA approach naturally lends itself to Reverse-Bayes analysis, which can be used for computationally efficient predictive checks and identification of outlier sites. An implementation of BCA is freely available through the confeR package (conjugate federated analysis in R).

        53573513307

        Speaker: Peter Degen (Center for Reproducible Science and Research Synthesis, University of Zurich)
      • 11:57
        Evaluating Nonparametric Combination Methods for Aggregating N-of-1 Trials: A Simulation-Based Comparison with Meta-Analysis 18m

        Title:
        Evaluating Nonparametric Combination Methods for Aggregating N-of-1 Trials: A Simulation-Based Comparison with Meta-Analysis

        Abstract:
        Aggregating results from multiple N-of-1 trials has become increasingly relevant for evaluating personalized and digital health interventions, where inter-individual heterogeneity and complex temporal structures challenge traditional study designs. In earlier work, we compared the efficiency of three designs—parallel-group randomized controlled trials (RCTs), two-period crossover trials, and meta-analysis of multiple N-of-1 studies—and found that aggregating individual N-of-1 trials through random-effects meta-analysis can achieve comparable power with substantially smaller sample sizes. However, model-based meta-analytic estimators may be sensitive to violations of normality, time dependence, carryover, or incomplete sequences, which frequently arise in digital health applications.
        In this study, we propose a general framework for combining evidence from N-of-1 trials based on the Nonparametric Combination (NPC) methodology. NPC offers a flexible, assumption-light approach that combines p-values from multiple permutation tests without requiring independence or distributional assumptions. We develop a two-level aggregation strategy: (1) at the within-subject level and (2) at the across-subject level.
        To assess the methodological properties of NPC aggregation, we design an extensive simulation study reflecting realistic N-of-1 settings with varying intra- and inter-subject variability, AR(1) autocorrelation, carryover effects, non-Gaussian errors, and missingness. Scenarios are aligned with those used in our previous comparative work, enabling direct evaluation of meta-analysis versus NPC under matched conditions. Performance metrics include type I error control, power, bias and coverage for the overall effect estimate, robustness to misspecification, and computational cost.
        Simulation studies are currently ongoing, and results will be presented at the conference. Preliminary investigations suggest that NPC may offer improved robustness in heterogeneous or highly autocorrelated settings, while retaining competitive power relative to random-effects meta-analysis.

        32144104506

        Speaker: Anna Eleonora Carrozzo (Salzburg Research, Austria / Paris Lodron University of Salzburg, Austria)
    • 10:45 12:15
      IS6: Big Data in Biomedicine: Innovations Across Imaging and Omics Room 1 A

      Room 1 A

      Convener: Malgorzata Bogdan (Lund University)
      • 10:45
        Functional Data Analysis of Head Impact Exposure of College Football Athletes 30m

        Sport-related concussions (SRCs) represent a major public health concern, accounting for more than 200,000 annual Emergency Department visits in the United States. Biomechanically, SRCs arise from head impacts that generate high-magnitude linear and rotational accelerations. Increasing evidence from human studies indicates that repetitive head impact exposure (HIE) reduces concussion tolerance among contact-sport athletes. Despite this, prior efforts to quantify the relationship between HIE and incident concussion have often relied on overly simplistic statistical approaches.
        In this study, we analyze data collected from helmet-mounted accelerometers that record instantaneous head accelerations and detect head acceleration events (HAEs). The longitudinal dataset includes HAEs from collegiate football players across multiple playing positions and institutions. We model HAE counts using modern count-data methods and apply functional data analysis techniques to characterize temporal patterns of HAEs across a competitive season. Our approach evaluates how these patterns vary by player position and across schools.
        Specifically, we employ a Tucker tensor decomposition of a matrix of functional observations, yielding interpretable and stable estimates of school- and position-specific mean HAE trajectories. To fit the proposed model, we develop an efficient estimation procedure that integrates an expectation–maximization (EM) algorithm with nonparametric function estimation via penalized splines. Simulation studies demonstrate strong performance and stable recovery of underlying patterns under diverse data-generating scenarios.

        75002918648

        Speaker: Jaroslaw Harezlak (Indiana University)
      • 11:15
        Quantile regression in genomics: a new lens for genetic discovery and phenotype prediction 20m

        Genome-wide association studies (GWAS) for biomarkers and molecular phenotypes can lead to clinically relevant discoveries. Numerous lines of evidence from both model organisms and human studies suggest that genetic associations can be highly heterogeneous, dynamic and context dependent. Despite twenty years of GWAS, most studies are based on statistical models that fail to account for such heterogeneity. In this talk I will present alternative approaches based on quantile regression (QR) models that naturally extend linear regression models to the analysis of the entire conditional distribution of a phenotype of interest. I will introduce novel, computationally efficient tools that enable scalable genetic discovery across large genomic datasets.

        Furthermore, I will discuss how QR can be applied to quantify uncertainty in polygenic score (PGS) predictions. QR shifts the focus from predicting the conditional phenotypic mean in classical PGS to predicting the conditional phenotypic quantiles. When combined with conformal prediction, this framework offers a natural way to construct prediction intervals with correct coverage.

        42858809484

        Speaker: Iuliana Ionita-Laza (Columbia University)
      • 11:35
        Degradation Graphs Reveal Hidden Proteolytic Activity in Peptidomes 20m

        Protein degradation is a regulated process that reshapes the proteome and generates bioactive peptides. Peptidomics and degradomics enables large-scale measurement of these peptides, yet most
        data analyses approaches treat peptides as isolated endpoints rather than intermediates produced
        by sequential cleavage. Here, we introduce degradation graphs, a probabilistic framework that represents proteolysis as a directed acyclic network of cleavage events with explicit absorption. From
        single-snapshot peptidomes, we infer graph weights by gradient descent or linear-flow optimization, quantify flows through branches and bottlenecks, and correct a core bias in conventional quantification. Across three biological datasets, failure to model downstream trimming leads to ≈3-4-
        fold underestimation of upstream proteolytic activity. Moreover, degradation graphs provide graphstructured features that enable machine learning models to capture protease-specific signatures
        from both graph topology and sequence context. Taken together, these findings establish explicit
        degradation modeling as a practical approach to mechanistic and interpretable peptidomics, bridging the fields of degradomics and peptidomics.

        96432312555

        Speaker: Jonas Wallin (department of statistics)
      • 11:55
        Relative quantification of proteins with shared peptides: a weight-based approach 20m

        Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across various biological conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level, e.g., at the level of proteins or of post-translational modifications, must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins.

        From a statistical perspective, inclusion of shared peptides into the estimates of abundances or proteins induces a data structure in which observations (peptide intensities) may belong to multiple groups defined by proteins. Typically, shared peptides are removed from analysis of MS data, which leads to loss of information. Alternatively, proteins that share peptides are grouped together, eliminating the possibility of estimating their distinct quantitative patterns.

        In this talk, we present a statistical approach for estimating protein abundances based on quantitative information that includes shared peptides. This approach extends the existing MSstatsTMT framework for labeled MS data summarization and differential analysis by treating the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins and estimating the abundance of each source in a sample together with the weights of the combination. We demonstrate the utility of this new summarization method using computer simulations and examples based on data from experiments with diverse biological objectives, including protein degradation, thermal proteome profiling, and modeling post-translational modifications.

        75002910324

        Speaker: Mateusz Staniak (University of Wrocław)
    • 10:45 12:15
      Methods for diagnostics studies 1 Room 13 A

      Room 13 A

      Convener: Werner Vach (Basel Academy for Quality and Research in Medicin)
      • 10:45
        Strategies for dealing with outliers in (semi-)parametric estimation of reference intervals and standard deviation scores 18m

        Reference intervals and standard deviation scores (‘z scores’) are widely used as diagnostic tools in various biomedical fields. They are applied to laboratory parameters in clinical chemistry, psychometric tests in neurology, or parameters of children’s growth in pediatrics. Usually, samples from a ‘normal’ or ‘healthy’ population form the data basis for the estimation of reference distributions.
        (Semi-)parametric estimation of reference distributions is complicated by extreme values that may be outliers relative to the working model, even without dependence on covariables like age. If sample size is moderate, genuine outliers may by chance be over-represented in the sample used for estimating the reference distribution and impair the selection of a suitable model. Second, if the target population is to include unhealthy individuals with their representative share, the reference distribution to be estimated is a mixture of a major ‘healthy’ part plus a ‘contamination’. Finally, the sample may be contaminated by observations that are not members of the target population but remain undetected.
        Often in practice, the origin of the extreme values is unknown, but the extreme tails of the distribution are of interest for diagnostic purposes. In this contribution, Generalized Additive Models of Location, Shape and Scale are used. The simple approaches of including, deleting or winsorizing outliers are compared with estimation using a robustified likelihood (Aeberhard et al, Statist. Comput., 2021) and with two correction methods, one of them previously unpublished, the other proposed by the authors of the WHO children growth standards. Both correction methods remove outliers before estimating the reference distribution and then correct estimated standard deviation scores for the removal of outliers by appropriate rescaling.
        A simulation study with contaminated and heavy-tailed distributions shows that the robust method reduces bias in contaminated scenarios and scenarios with genuine outliers, but is inferior to other methods in case of model misspecification. The new correction method represents a good compromise if misspecification cannot be excluded. A second set of simulations evaluates strategies to arrive at an adequate model for the reference distribution, either starting from a simple model and increasing model complexity depending on residual diagnostics, or model selection based on information criteria.
        Data on body mass index and body proportions from a large Austrian pediatric study are used for demonstration.

        64288206006

        Speaker: Andreas Gleiss (Medical University of Vienna, Center for Medical Data Science)
      • 11:03
        A Novel Approach to Diagnostic Evaluation: Prevalence-Corrected Precision-Recall Curves 18m

        Classification plays a pivotal role in medicine for both diagnostic and prognostic purposes. Traditionally, diagnostic efficacy is evaluated using prevalence-independent metrics, such as sensitivity and specificity. For numerical tests, the Area Under the Receiver Operating Characteristic (ROC) curve is the standard for assessing classification success. However, the rising adoption of machine learning in medicine has popularized metrics like precision (positive predictive value), recall (sensitivity), and the F-measure. While Precision-Recall (PR) curves are increasingly used to evaluate binary classification, it is well established that precision is heavily influenced by disease prevalence. Consequently, using standard PR curves without accounting for prevalence can introduce bias into performance assessments. In this study, we propose prevalence-corrected PR curves as a robust alternative to eliminate this bias. Through simulation scenarios designed to reflect real-world medical contexts, we demonstrate that the proposed method provides a more unbiased evaluation of classification performance compared to standard PR curves.

        64288211557

        Speaker: Ilker Unal (Cukurova University Faculty of Medicine Dept of Biostatistics)
      • 11:21
        Covariate adjustment, factorial designs and clustered data in diagnostic accuracy studies 18m

        The accuracy of diagnostic tests is commonly evaluated by estimating the area under the receiver operating characteristic curve (AUC), as well as sensitivity and specificity at given diagnostic cut-offs. However, many diagnostic trials use factorial designs. For example, different combinations of readers and methods may be used to diagnose a patient. Furthermore, diagnostic studies may generate clustered data by repeated measurements over time or several lesions, for example different brain regions. Dependencies between a person's observations must be taken into account in the analysis in order to prevent variance deflation. Lange [1] developed a nonparametric mathematical framework to deal with both of these design aspects, and Lange and Brunner generalized the approach from the AUC to sensitivity and specificity [3].
        Additionally, it may be of interest to adjust the estimation procedure of the above mentioned accuracy measures for covariates. For example, it may be the case that age, weight or height influence the diagnostic accuracy of a test. Zapf [2] proposed a nonparametric methodological approach to adjust the AUC for such covariates, while also allowing for factorial designs, but not yet for clustered data. In this talk we present a new, unified method that enables covariate adjustment of the AUC, sensitivity and specificity in studies with factorial designs and clustered data. We will show the properties of the approach using simulated data and illustrate the approach with an example study.

        1) Lange, K. (2011, March 4). Nichtparametrische analyse diagnostischer Gütemaße bei Clusterdaten. Retrieved February 27, 2023, from DOI: 10.53846/goediss-3538
        2) Zapf, A. (2009, October 23). Multivariates nichtparametrisches Behrens-Fisher-problem MIT Kovariablen. Retrieved February 27, 2023, DOI: 10.53846/goediss-2488
        3) Lange, K., & Brunner, E. (2012). Sensitivity, specificity and ROC-curves in multiple reader diagnostic trials—a unified, nonparametric approach. Statistical Methodology, 9(4), 490–500. DOI: 10.1016/j.stamet.2011.12.002

        75002919688

        Speaker: Philipp Weber (Institute of Medical Biometry and Epidemiology)
      • 11:39
        Enhancing Efficiency in Cancer Drug Testing using Nonparametric Approaches 18m

        Background
        Triple-negative breast cancer (TNBC) represents one of the most aggressive and treatment-resistant breast cancer subtypes. Patients with locally advanced unresectable or metastatic TNBC (mTNBC) typically face a median overall survival of only 8 to 13 months, highlighting the urgent need for efficient drug evaluation strategies. Conventional statistical methods often assume normality or require complex variance estimation, which limits their applicability to heterogeneous and non-normal patient data. To accelerate drug development and improve therapeutic decision-making, robust nonparametric methods are essential.

        Methods
        We propose a novel nonparametric test based on ranked-set empirical distribution functions and the concept of power divergence between two empirical distributions. This distribution-free approach eliminates the reliance on normality assumptions and avoids estimation of dispersion matrices, which are prone to instability in complex or small samples. Incorporating the permutation principle further enhances reliability. Monte Carlo simulations were conducted to assess the empirical power of the proposed test under various distributional settings, including heavy-tailed, light-tailed, and elliptically asymmetric populations.

        Results
        Simulation results demonstrate that the proposed test achieves superior statistical power compared with conventional alternatives. It remains robust across heavy-tailed and light-tailed distributions and retains performance under elliptically asymmetric population structures. Unlike Hotelling’s T² and Chatterjee and Sen’s bivariate Wilcoxon test, our method does not require matrix inversion, thereby avoiding computational and implementation challenges. Furthermore, it extends beyond the univariate limitations of the Kolmogorov–Smirnov test by offering a true two-sample multivariate framework. Application to real-world TNBC trial data illustrates the method’s practical utility, enabling reliable efficacy comparisons while reducing susceptibility to distributional misspecification.

        Conclusion
        The proposed nonparametric framework enhances efficiency in cancer drug testing by providing a powerful, assumption-free alternative to conventional hypothesis testing. Its robustness to non-normal, heavy-tailed, and asymmetric data structures makes it particularly well-suited for oncology trials, where heterogeneous populations are common. By improving power, reducing reliance on restrictive assumptions, and enabling broader applicability, this approach has the potential to accelerate drug evaluation, optimize resource use, and improve therapeutic decision-making. Beyond TNBC, its versatility extends to a wide range of biomedical research contexts, supporting more efficient and reliable assessment of treatment efficacy.

        42858800408

        Speaker: Sunil Mathur (Weill Cornell Medical College)
      • 11:57
        Comparison of different methods for the meta-analysis of diagnostic test accuracy studies – a simulation study 18m

        Meta-analysis of diagnostic test accuracy (DTA) studies deals with aggregating information from multiple studies on sensitivity and specificity. Classical approaches to this task select a single pair of sensitivity and specificity per study (single threshold methods, STM), possibly ignoring additional information if studies report results on multiple diagnostic thresholds. In recent years, models have been proposed that consider all available information and enable inference on the optimal diagnostic threshold (multiple threshold methods, MTM). We compare five STM and six MTM to each other in a simulation study to evaluate their performance in various situations. For each generated meta-analysis dataset, we estimate a set of summary sensitivity and specificity (either identified directly or using the maximum unweighted Youden-index), and the area under the summary ROC curve (AUC). To cover a broad range of real-life data settings, we vary eight underlying parameter dimensions in the data-generating mechanisms, including continuous or ordinal outcome type, different numbers of diagnostic thresholds per study, and different disease prevalences. Overall, the model performance of STM and MTM is comparable regarding bias in optimal sensitivity, specificity, and AUC, as well as empirical coverage and convergence. However, we observe a binomial generalized linear mixed model with bivariate random effect of the MTM type, that models the sensitivity and specificity with a logit link and additional covariate for the diagnostic threshold, to be slightly superior to the other models in many situations. Model performances depend strongest on the outcome type in the data generation, while the number of thresholds only has minor impact. We thus find the main advantage of using MTM by getting threshold-dependent estimates of sensitivity and specificity. Additionally, we illustrate differences between model estimates in two real-data examples on the diagnosis of type 2 diabetes using the continuous biomarker HbA1c and on the diagnosis of any anxiety disorder using the ordinal questionnaire HADS-A. The applications reveal substantial variations in model estimates within and between STM and MTM, which can be reduced by adjusting for the estimated bias in the simulation settings resembling the real-data situation most closely. Our study highlights the importance of careful model selection when conducting meta-analysis of DTA studies, which should be informed by the observed data structure of the application (e.g., if the diagnostic test is measured on a continuous or ordinal scale).

        75002906444

        Speaker: Ferdinand Valentin Stoye (Biostatistics and Medical Biometry, Medical School OWL, Bielefeld University)
    • 10:45 12:15
      TC4: Biometrics in the era of AI Room 1 B

      Room 1 B

      Convener: Anne-Laure Boulesteix (LMU Munich)
      • 10:45
        On the Role of Biometry in AI Projects 18m

        Artificial intelligence (AI) is intended to support clinicians, therapists, patients, hospital managers, and clinical data scientists at all levels. This includes, for example, making clinical diagnoses, understanding the causes of diseases, and planning clinical studies. The enormous increase in the importance of AI in medicine has led to the development of several guidelines (e.g., CONSORT-AI, TRIPOD-AI, and SPIRIT-AI), which, in addition to general aspects of dealing with AI, also cover computer-aided and intelligent decision support systems. In our work, we aim to explain how biostatistical methodology can contribute (and is already contributing) to advancing these developments in a science-driven manner, thereby enhancing the trustworthiness of AI in medicine. In particular, we address current biostatistical topics that have great potential for AI research and application. These include, for example, (re)sampling and study designs in the generation and interpretation of AI-based analysis results, classical and modern sample size planning for AI-supported scientific studies, aspects of the validation and reproducibility of AI methods and their outputs, suitable data infrastructures, the quantification of uncertainty in AI-based analysis results, and a discussion of estimands in the context of AI-based knowledge gain.

        96432306484

        Speaker: Björn-hergen Laabs (University Medical Center Göttingen)
      • 11:03
        ChatGPT as a Tool for Biostatisticians 18m

        Modern large language models (LLMs) have reshaped workflows of people across countless fields - and biostatistics is no exception. These models offer novel support in drafting study plans, generating software code, or writing reports. However, reliance on LLMs carries the risk of inaccuracies due to potential hallucinations that may produce fabricated "facts", leading to erroneous statistical statements and conclusions. Such errors could compromise the high precision and transparency fundamental to our field.

        In this talk we assess the utilization of ChatGPT for various contemporary biostatistical tasks. We explore both the risks and opportunities presented by this new era of artificial intelligence. We emphasize that advanced applications should only be used in combination with sufficient background knowledge. Regular verifications of LLM outputs may lead to an appropriately calibrated trust in these tools among users.

        85717600705

        Speaker: Dennis Dobler (RWTH Aachen University)
      • 11:21
        Teaching Biostatistics in Times of AI 18m

        Artificial intelligence (AI) is increasingly being used in various disciplines. Examples of this include medical image processing, complex prediction and decision support models and thereby integrating with the field of biostatistics. In this context, integrating AI and machine learning (ML) methods within courses of biostatistics taught to students of medicine, health and life sciences offers a unique opportunity to highlight the relevance of statistics in modern AI applications.

        This talk discusses integrating AI topics into biostatistics education. We argue that statistics is fundamental to AI development, ensuring methodological rigor, robustness, fairness, and interpretability while quantifying uncertainty. Without statistics, AI risks being reduced to "black box" methods. Beyond that, key topics concerning the role of biostatistics in AI applications include the planning of studies and study designs, evaluating data quality, managing missing data and outliers, and critical interpretation of the outcome distinguishing causality from correlation.

        We emphasize the need for targeted educational initiatives that blend biostatistical and AI expertise enhancing AI-related statistical literacy in curricula. This is essential for training future academics to recognize the role of biostatistics in AI methods and to support the responsible and safe application of AI.

        21429412366

        Speaker: Ursula Berger (LMU Munich)
      • 11:39
        Panel discussion 36m
        Speakers: Frank Bretz, Presenters, Sarah Friedrich-Welz
    • 12:15 13:45
      Lunch break 1h 30m
    • 13:45 15:15
      Clinical trials 4 Room 13 B

      Room 13 B

      Convener: Annette Kopp-Schneider (DKFZ)
      • 13:45
        Difference-in-difference estimators in randomized trials with external controls 18m

        Randomized trials often utilize a select group of study participants. This group does not typically represent the general population. Furthermore, sample sizes are often small to reduce cost. To improve power and generalizability, external control groups may be added to the randomized study. It is possible to incorporate a suitably selected external control group into a randomized clinical trial to improve efficiency.

        A clinical trial was conducted to study a novel drug treatment for Spinal Muscular Atrophy. The study randomized trial participants at baseline to treatment or control and followed up for one year. The study used an external control group, obtained from a study of another novel drug for the same disease, conducted at a different center. During year one of the follow up the external controls served as additional controls to increase power. At the end of year 1 the drug was found to be effective. Subsequently, all trial participants were switched to treatment. This group was then followed for another year and at the end of the trial the effectiveness of the drug was assessed by comparison with the external controls. We discuss how the available data can be used to assess potential biases between external controls and study participants. We develop a flexible model to incorporate dependencies between observations on the same subject and demonstrate how difference-in-difference comparisons can be used to assess the assumption of ignorability in assignment top trial.

        96432312568

        Speaker: Christiana Drake (University of California, Davis)
      • 14:03
        Evidence Generation Using External Controls: Opportunities and Challenges – A Regulator’s Perspective 18m

        Randomised controlled trials (RCTs) are the gold standard of evidence to support causal conclusions on the benefits and risks of medicines in regulatory decision making along the lifecycle [1]. However, single-arm trials (SATs) are also frequently used for various reasons during drug development. While RCTs allow adjustment for confounding via design, the contextualization of SATs requires more thoughtful consideration. This triggers the need of guidance. In response, the European Medicines Agency (EMA) has begun developing a concept paper on the use of external controls.

        This presentation will provide a regulator’s perspective on the opportunities and challenges associated with using external controls in evidence generation throughout the lifecycle of a drug.

        1] European Medicines Agency (2025). Draft Concept Paper on the Development of a Reflection Paper on the Use of External Controls for Evidence 5 Generation in Regulatory Decision-Making, EMA/CHMP/225255/2025.

        96432312255

        Speaker: Armin Schüler (BfArM - Federal Institute for Drugs and Medical Devices)
      • 14:21
        Bayesian Methods Integrating Causal Inference Approaches for Borrowing Historical Control Data in RCTs: A Neutral Comparison Study 18m

        Bayesian dynamic borrowing (BDB) methods are popular for incorporating historical data in rare disease or paediatric clinical trials, in particular with regard to control groups. They can be used to leverage the historical information while mitigating the consequences of potential prior-data conflicts to some degree. However, these methods do not consider baseline covariate information that might be prognostic for the outcome and could therefore be relevant to explain discrepancies between the outcomes of the current and historical control groups. To address this, novel methods have been proposed that integrate techniques from the causal inference literature into the BDB framework. They claim to make the borrowing more robust and efficient by, at least partially, relating the discrepancy (agreement) in the outcome distributions to differences (similarities) in baseline characteristics and making corresponding adjustments.

        A number of such methods are now available with propensity score integrated priors [1] forming the largest group. While such methodological developments are desirable, they also pose new challenges, particularly in choosing the appropriate method to apply to a specific clinical trial. The performance of these methods is usually assessed via simulation studies, which can lead to over-optimistic conclusions, making this choice non-trivial. Neutral comparison studies that investigate existing methods can address this issue [2].

        In this work, we apply the idea of neutral comparison studies to Bayesian methods integrating causal inference approaches for borrowing historical control data in clinical trials for continuous outcomes and compare three recently proposed methods, namely: the propensity score integrated commensurate prior [1], propensity score weighted multi-source exchangeability models [3], and Bayesian additive regression trees [4]. We assess their performance in a large simulation study covering, among others, different historical data sample sizes, varying degrees of observed and unobserved confounding, as well as different effect sizes.

        [1]X. Wang, L. Suttner, T. Jemielita, and X. Li, Propensity score-integrated Bayesian prior approaches for augmented control designs: a simulation study, J. Biopharm. Stat. 32, 170 (2022).
        [2]A.-L. Boulesteix, S. Lauer, and M. J. A. Eugster, A Plea for Neutral Comparison Studies in Computational Sciences, PLOS ONE 8, e61562 (2013).
        [3]W. Wei, Y. Zhang, S. Roychoudhury, and the A. D. N. Initiative, Propensity score weighted multi-source exchangeability models for incorporating external control data in randomized clinical trials, Stat. Med. 43, 3815 (2024).
        [4]T. Zhou and Y. Ji, Incorporating external data into the analysis of clinical trials via Bayesian additive regression trees, Stat. Med. 40, 6421 (2021).

        21429408755

        Speaker: David Jesse (F. Hoffmann-La Roche AG, Basel, Switzerland; Department of Medical Statistics, University Medical Center Göttingen, Göttingen, Germany)
      • 14:39
        Identification of subtrial-specific optimal biological dose (OBD) with robust borrowing of information 18m

        Drug development in the era of precision medicine increasingly uses basket trials and other multi-subgroup designs, where targeted therapies are evaluated across biomarker-defined patient subtrials. For many targeted agents and immunotherapies, the objective in early development is no longer the maximum tolerated dose (MTD), but the optimal biological dose (OBD) that achieves the best benefit–risk trade-off between efficacy and toxicity. Identifying an OBD separately within each subtrial is challenging because sample sizes in early-phase trials are small, and toxicity and efficacy need to be considered jointly. At the same time, complete pooling (a one-size-fits-all strategy) across subtrials can yield biased recommendations since patient heterogeneity is ignored. To alleviate these concerns, we are developing a Bayesian framework for subtrial-specific OBD selection that enables robust borrowing of information on two-dimensional parameters representing toxicity and efficacy. Building on the bivariate exchangeable–non-exchangeable (E-BiEXNEX) modelling framework (arXiv:2505.10317), we specify priors on subtrial-specific dose–toxicity and dose–efficacy parameters so that information sharing is data-adaptive and resistant to negative transfer under extreme observations. Choices of prior distributions are calibrated across multiple candidates to promote robust performance over a wide range of plausible scenarios. Our primary setting assumes binary toxicity and continuous efficacy, but the approach can be extended to other endpoint types with minor modifications to the regression components. The dose recommendation is driven by a utility-based decision rule that combines posterior toxicity probabilities and efficacy means under pre-specified safety and futility constraints, yielding subtrial-specific OBD recommendations with quantified uncertainty. Operating characteristics will be evaluated under scenarios with varying degrees of between-subtrial similarity, dose–response shapes, and locations of OBD.

        53573512605

        Speaker: Zhi Cao (MRC Biostatistics Unit, University of Cambridge)
    • 13:45 15:15
      High dimensional data 3 Room 14

      Room 14

      Convener: Sarah Friedrich-Welz (University of Augsburg)
      • 13:45
        Integrating functional motif discovery and statistical learning approaches for advanced blood glucose prediction in real-world conditions 18m

        Functional data analysis has established itself as a powerful framework for analyzing data recorded over continuous domains such as time. Within this context, functional motif discovery refers to the identification of recurrent patterns that appear multiple times across different portions of a single curve and/or within misaligned portions of multiple curves. In this study, we explore the integration of functional motif discovery into statistical learning pipelines to enhance the predictive performance of data-driven models. By identifying recurring and informative temporal patterns within functional data, motif discovery enables the extraction of meaningful features that can improve model accuracy and interpretability. We propose a novel framework that combines functional motif extraction with machine learning algorithms to strengthen forecasting capabilities in predictive tasks. Specifically, we employ two advanced statistical techniques, probKMA (Cremona and Chiaromonte, 2023) and funBIalign (Di Iorio et al., 2025), to uncover recurring motifs in functional data, which are subsequently incorporated as input features in prediction models. The approach is evaluated using continuous glucose monitoring data from individuals with type 1 diabetes in real-world physical activity settings, as collected in the Type 1 Diabetes Exercise Initiative (T1DEXI) study (Jaeb Center for Health Research, 2020). The high variability and complexity of these real-world data can pose substantial challenges for prediction (Neumann et al., 2025), but they also reveal significant potential for improvement through functional motif discovery. Overall, this research investigates how the proposed framework can uncover latent structure in glucose dynamics and support more accurate predictive modeling. More broadly, it highlights the value of integrating functional data analysis, and particularly functional motif discovery, into machine learning workflows to enhance interpretability, robustness, and performance across domains involving complex temporal data.

        References:

        Cremona, M. A. and Chiaromonte, F. (2023). Probabilistic k -means with local alignment for clustering and motif discovery in functional data. Journal of Computational and Graphical Statistics, 32(3):1119-1130.
        Di Iorio, J., Cremona, M. A., and Chiaromonte, F. (2025). funbialign: a hierachical algorithm for functional motif discovery based on mean squared residue scores. Statistics and computing, 35(1):11.
        Jaeb Center for Health Research (2020). Type 1 diabetes exercise initiative: The effect of exercise on glycemic control in type 1 diabetes study.
        Neumann, A., Zghal, Y., Cremona, M. A., Hajji, A., Morin, M., and Rekik, M. (2025). A data-driven personalized approach to predict blood glucose levels in type-1 diabetes patients exercising in free-living conditions. Computers in biology and medicine, 190:110015

        21429405355

        Speaker: Sara Garber (Department of Statistics and Data Science, University of Augsburg)
      • 14:03
        Statistical end-to-end analysis of large-scale microbial growth data with DGrowthR 18m

        Quantitative analysis of microbial growth curves is essential for understanding how bacterial popu-
        lations respond to environmental cues. Traditional analysis approaches make parametric assumptions
        about the functional form of these curves, limiting their usefulness for studying conditions that distort
        standard growth curves. In addition, modern robotics platforms enable the high-throughput collection
        of large volumes of growth data, thus requiring strategies that can analyze large-scale growth data in a
        flexible and efficient manner.

        Here, we introduce DGrowthR, a statistical R and standalone app frame-
        work for the integrative analysis of large growth experiments. DGrowthR comprises methods for data
        pre-processing and standardization, exploratory functional data analysis, and non-parametric modeling of growth curves using Gaussian Process regression. Importantly, DGrowthR includes a rigorous statistical testing framework for differential growth analysis. To illustrate the range of application scenarios of DGrowthR, we analyzed three large-scale bacterial growth datasets that tackle distinct scientific questions.

        On an in-house large-scale growth dataset comprising two pathogens that were subjected to a large chemical perturbation screen, DGrowthR enabled the discovery of compounds with significant growth inhibitory effects as well as compounds that induce non-canonical growth dynamics. We also re-analyzed two publicly available datasets and recovered reported adjuvants and antagonists of antibiotic activity, as well as bacterial genetic factors that determine susceptibility to specific antibiotic treatments. We anticipate that DGrowthR will streamline the analysis of modern high-volume growth experiments, enabling researchers to gain novel biological insights in a standardized and reproducible manner.

        75002900655

        Speaker: Medina Feldl (Institute of Computational Biology, Helmholtz Munich)
      • 14:21
        Population Matching to Enhance Back-Translation Between Clinical Trials and Biobank Data for Drug Target Discovery 18m

        Population-scale genomic biobanks provide unique opportunities for data-driven drug target discovery. However, these resources often lack detailed data on clinical phenotypes, whereas clinical trials offer rich phenotypic information but are limited in omics coverage and mostly lack genotyping. This imbalance creates gaps in the mechanistic interpretation of clinical findings.
        To address this, we explore a recently proposed back-translation framework that links clinical trial data with genomic biobank data, leveraging the complementary strengths of both sources. Since biobank data can be considered representative of the general population, important disease or trial-specific genetic signals may end up being diluted. To mitigate this, we apply population matching strategies to obtain a biobank subpopulation comparable to the clinical trial cohort, based on demographic and disease-related baseline markers.
        This framework applies propensity score-based methods, commonly used for external data integration in clinical research to biobank settings, with a focus on disease-relevant genetic information. We investigate propensity score matching with different matching specifications (e.g. caliper width, a predefined maximum acceptable difference between the matched units) to account for two competing goals: maximizing similarity between matched populations and maintaining sufficient power for carrying out genome-wide association studies. We perform a simulation study to evaluate how different matching designs affect the efficiency and power in detecting true genetic associations between patients’ genotypes and quantitative phenotypes or disease onset labels.
        The matched biobank–trial integration enables the identification of disease-relevant genetic signals that would otherwise remain hidden in heterogeneous populations. Such information can support downstream efforts in target validation, patient stratification, mechanistic studies, and precision medicine development.

        85717617768

        Speaker: Han Chiam (Medical University of Vienna)
      • 14:39
        Entropy Adjusted Graphical Lasso for Sparse Precision Matrix Estimation 18m

        The estimation of a precision matrix is a crucial problem in various research fields, particularly when working with high dimensional data. In such settings, the most common approach is to use the penalized maximum likelihood. The literature typically employs Lasso, Ridge and Elastic-Net norms, which effectively shrink the entries of the estimated precision matrix. Although these shrinkage approaches provide well-conditioned precision matrix estimates, they do not explicitly address the uncertainty associated with these estimated matrices. In fact, as the matrix becomes sparser, the precision matrix imposes fewer restrictions, leading to greater variability in the distribution, and thus, to higher entropy. In this paper, we introduce an entropy-adjusted extension of widely used Graphical Lasso using an additional log-determinant penalty term. The objective of the proposed technique is to impose sparsity on the precision matrix estimate and adjust the uncertainty through the log-determinant term. The advantage of the proposed method compared to the existing ones in the literature is evaluated through comprehensive numerical analyses, including both simulated and real-world datasets. The results demonstrate its benefits compared to existing approaches in the literature, with respect to several evaluation metrics.

        75002904926

        Speaker: Vahe Avagyan (Wageningen University & Research)
    • 13:45 15:15
      IS7: Methods to improve the practical usefulness of biomedical research Room 1 A

      Room 1 A

      Convener: Leonhard Held (University of Zurich)
      • 13:45
        Systematic Review and Meta-analysis of Animal Studies as Tools for Strengthening Research Integrity 30m

        Growing concerns about the reproducibility, generalisability and (more recently) credibility of biomedical research publications underscore the need for methods that both synthesise evidence and diagnose weaknesses in the research ecosystem. Systematic review and meta-analysis of animal studies is traditionally used to evaluate preclinical efficacy and inform future research in animals or humans. This presentation will explore how this methodology can evolve into an essential tool for detecting integrity-related issues within and across research fields. Drawing on over a decade of experience in conducting and refining preclinical systematic reviews, I will illustrate how structured evidence synthesis reveals challenges in preclinical literature, including incomplete reporting, inadequate implementation of measures to reduce bias, and image duplication and manipulation issues. As case studies, I will present our recently published investigation into image-related issues in 608 animal studies of early brain injury after subarachnoid hemorrhage, of which 243 (40.0%) were identified as problematic, as well as other relevant examples.

        53573503966

        Speaker: Kimberley Wever (Radboud university medical center)
      • 14:15
        Quality assessment of proposals for animal studies and corresponding publications 20m

        Preclinical studies tend to suffer from an unacceptably low rate of replicability, which is highly problematic since unreliable results from animal trials cannot provide a sound foundation for subsequent clinical research. Numerous factors contribute to this issue, including the misapplication of statistical methods, poor study design, and inadequate reporting of results. Although specific guidelines for conducting and reporting animal studies exist, entrenched practices within the preclinical research community have proven difficult to change. To our knowledge, this is the first systematic study to assess the quality of research proposals for preclinical studies. We analyzed more than 300 applications for animal trials submitted to the Medical University of Vienna between 2014 and 2019, systematically extracting key indicators based on the ARRIVE guidelines into a database. In addition, we identified over 100 publications resulting from these proposals and again extracted corresponding indicators. Based on these data, we developed quality scores for both proposals and publications and will report results separately for each. Furthermore, we will discuss strategies to improve the current situation. In future work, we will also conduct a joint analysis to compare the quality of proposals with that of the resulting publications.

        85717600604

        Speaker: Florian Frommlet (Medical University Vienna)
      • 14:35
        From the Classroom to the Clinic: Teaching Reproducibility in Statistics Education 20m

        The reproducibility of research results is a cornerstone of trustworthy science. However, failures to reproduce published findings remain widespread across many disciplines. The way in which data analysis and statistics is taught to students often translates into how they later perform research in labs and clinics. Therefore, improving the reproducibility of biomedical research requires not only better methods, but also a change in the way statistics is taught and practiced. This talk will explore strategies for incorporating reproducible research practices into the statistical education of (bio)statisticians and biomedical researchers. The "Good Statistical Practice" course from the Master Program in Biostatistics at the University of Zurich will serve as a case study in how a course can be designed to equip the next generation of statisticians with competencies in research integrity and reproducible research practices. The talk will also discuss emerging challenges, such as students generating code with artificial intelligence tools.

        96432302166

        Speaker: Samuel Pawel (University of Zurich)
      • 14:55
        Discussant 20m
        Speaker: Ulrich Dirnagl
    • 13:45 15:15
      Missing data 1 Room 13 A

      Room 13 A

      Convener: Łukasz Smaga (Adam Mickiewicz University)
      • 13:45
        Prediction in the presence of missing values. Are there credible alternatives to imputation-based use of the predictive density? 18m

        Prediction in the presence of missing values is a complex and still poorly understood problem, particularly when future records also contain missing values.
        Mertens, et al. (2020) demonstrate that with non-linear models (such as logistic regression or Cox survival) and when using imputations, averaging of multiple predictions obtained from distinct models fitted on imputed data should be preferred to use of pooled models. Imputation is often regarded as computationally cumbersome however. It also tends to be poorly understood by applied researchers utilizing statistical methods. For such reasons, the method is often avoided. This raises the question whether other approaches could reasonably be used to handle missing values in prediction problems.

        In this talk we contrast predictive averaging with some potential alternatives, such as complete-case-based model calibration (CC) as well as use of missing-indicator (IDX) and Pattern Submodel (PS) approaches. Connections between these methods are discussed. We focus on the problem of risk prediction. Simulations are used to ensure knowledge of the true risk in a comparison of prediction performance between methods. We demonstrate that only predictive averaging guarantees required coverage levels in prediction. Pattern submodeling as well as indicator methods provide poorly calibrated predictions, with no obvious methods to correct for these deficiencies.

        64288220328

        Speaker: Bart Mertens (Leiden University Medical Centre)
      • 14:03
        Impact of imputation on individual cough alert system with incomplete baseline monitoring 18m

        In a prospective study of patients with muco-obstructive lung disease, aiming to develop a cough alert system based on nocturnal cough monitoring, to identify patient-individual thresholds at which cough frequency exceeds normal variability, so far 92 of intended 220 patients were included.
        From den Brinker et al. in a study with 30 COPD patients it is known, that the day-to-day variation of the number of coughs is high and that instances of high counts occur at isolated days (1). Both observations motivate the use of temporal smoothing - advantageous to reduce the noise-like variations - using a first-order infinite impulse response (IIR) filter with pole at 0.75 for the cough count on the B-scale. The linear relation between average and standard deviation of the cough count was assumed to hold in general and motivated the mapping of the cough count onto a new scale B. This scale B is constructed such that 1 unit step on the B-scale approximately corresponds to a step of one standard deviation at the original scale. The B-scale then is the natural basis for the aspired alert system. They noted that the independence of cough counts between consecutive 24 h periods implies that time-averaging could be an effective means for uncertainty (noise) reduction and that arithmetic averaging is a more proper operation here than in the original domain where, in view of the logarithmic nature, geometric averaging would be the preferred operation. Furthermore, they opted in a heuristic approach for a 9-days consecutive period (out of at most 90 days of observation) having a minimum average value for a baseline definition.
        We observe that our patients tend to monitor their nocturnal cough less consistently than intended. Of the 75/92 patients with at least 9 days of follow-up data, only 2/3 have data for a 9-days consecutive period. Depending on the missingness pattern for patients with less than 9 consecutive days different multiple imputation options will be applied and their impact on resulting baseline definitions will be investigated. Potential alert frequencies will be compared between these baseline-definitions for patients with continued follow-up after an incomplete “baseline”-period. (data collection/ work in progress)
        1 den Brinker et.al. Alert system design based on experimental findings from long-term unobtrusive monitoring in COPD. Biomed Signal Process Control. 2021 Jan;63: 102205.

        42858810986

        Speaker: Dörte Huscher (Institute of Biometry and Clinical Epidemiology, and Berlin Institute of Health, Charité - Universitätsmedizin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin)
      • 14:21
        The performance paradox: understanding the discrepancy between the performance of imputation approaches for survival models with missing covariates in simulation studies and real data 18m

        Multiple imputation (MI) continues to be a popular approach to deal with missing at-random covariate data. For MI to perform well, it is advisable to ensure that the imputation model for a given covariate does not make conflicting assumptions with substantive/analysis model. In the case of substantive models that assume proportional hazards (e.g., the standard Cox model for a single time-to-event outcome), it is not straightforward to correctly specify an imputation model: even with simple time-constant log-linear effects, the conditional distribution of a partially observed covariate given the remaining covariates and outcomes will usually have non-linear expectation, and non-constant variance. This means that default approaches, such as for example the use of MI using chained equations (MICE) with predictive mean matching, are prone to bias when estimating quantities such as hazard ratios or survival probabilities.

        In order to draw imputed values from a conditional distribution that is instead consistent with the assumptions made by the specified substantive model, a variant of MICE called substantive-model-compatible fully conditional specification (SMC-FCS) was developed [Bartlett et al., 2015, SMMR]. Over the past decade, this methodology has been adapted to accommodate different kinds of proportional hazards models, such as cause-specific Cox models, the Fine–Gray model, flexible parametric excess hazard models, and more. The relevant simulation studies, which tend to compare SMC-FCS with complete-case analysis (CCA) and the competing MICE approach, all point in the same direction: both MI approaches outperform CCA in terms of efficiency gains, but SMC-FCS is preferable to MICE in terms of bias. In contrast, the results of the applied data examples from these publications paint a more neutral picture: the differences between SMC-FCS and MICE are often negligible.

        In this work, we take a critical look at the reasons behind the disconnect between these simulation studies and their associated real data examples, and reflect on neutral or ‘honest’ ways in which we could more efficiently build up empirical evidence in this setting. Due to the vast parameter space in methodological research about missing data, simulation studies are often restricted to relatively narrow and unrealistic settings. Producing more generalisable evidence may instead require enriching data examples using thorough performance benchmarking (e.g., applying SMC-FCS and competing methods for a range of increasingly complex substantive models, in addition to the original ‘illustrative’ one), or using these datasets as a basis for plasmode simulations (i.e., resampling covariate data, and controlling the outcome-generating process).

        42858806726

        Speaker: Edouard F. Bonneville (LMU Munich, and Munich Center for Machine Learning)
      • 14:39
        Handling Missing Data in Life Science: A Comparative Study of Imputation Methods for Medical Data. 18m

        Handling of missing data is a crucial aspect when preparing data sets for further analyses in several research areas. Previous studies have shown that the choice of imputation method can have a high influence on subsequent analyses, especially in medical research, where missing values often occur due to study design or data collection challenges.
        In this study, we conduct a comparative simulation study of commonly used imputation methods. The study includes methods such as missRanger, mixgb (both Random Forest based imputation methods), MICE (Multiple Imputation by Chained Equations) and the naive imputation (based on the arithmetic mean and mode) as a benchmark.
        We use both, simulated and real-world data sets from medical research. Based on these data sets, we show how to assess the imputation methods based on their imputation accuracy. Since there is no unique definition of the accuracy of an imputation method, we focus on different goals that researchers might have when imputing missing values. We assess the predictive accuracy (reconstruction of the actual values) by using normalized root squared error (NRMSE) and the proportion of false classification/imputation (PFC). To assess how well the original distribution is reconstructed, we use distribution distance measures such as a uni- and multivariate Kolmogorov Smirnov Statistic.
        While previous studies often find tree-based methods to perform “best“ , our results demonstrate that no single method consistently outperforms others. The optimal choice depends on the analysis goal and evaluation criteria.
        This study can be seen as a guide for researchers for selecting imputation methods aligned with different research goals with particular relevance for medical research and beyond.

        85717600924

        Speaker: Maria Thurow (TU Dortmund University)
      • 14:57
        Evaluating the Impact of Missing Data Imputation Methods on Bias and Covariate Balance in Propensity Score Analysis: A Simulation Study 18m

        Missing covariate data is a significant source of bias in observational studies that use propensity score (PS) analysis to make causal inference. The accuracy of treatment effect estimation is determined not just by how missing data is handled, but also by the method used to calculate propensity scores. A variety of methods for handling missing covariate data in propensity score analyses have been investigated in earlier research. Complete-case analysis and multiple imputation (MI) are frequently used methods. However, the majority of known studies compare traditional MI approaches (e.g., MICE) with logistic regression-based propensity score estimation, leaving unresolved questions concerning the effectiveness of more flexible machine-learning-based algorithms(2). At the same time, new research demonstrates that tree-based models like random forests might enhance propensity score estimation and minimize bias in complicated, nonlinear data structures(3). Nonetheless, few simulation studies have investigated how imputation and PS estimation approaches interact to affect bias, covariate balance, and overlap. In this talk, we present a simulation study comparing the efficacy of three missing-data handling methods with the complete-case analysis (CC), namely MICE, miss Forest, and random-forest imputation with a combination of two propensity-score estimation techniques: logistic regression and random forest. The performance of each combination of missing data method and PS estimation is evaluated using three criteria: (i) bias in the estimated average treatment effect, (ii) standardized mean differences (SMDs) of variables after weighting, and (iii) overlap between treatment and control propensity score distributions.

        75002905406

        Speaker: Saghar Garayemi (Augsburg University)
    • 13:45 15:15
      Other 1 Room 12

      Room 12

      Convener: Chris Jennison (University of Bath)
      • 13:45
        Statistical methods to reduce Selection Bias in Dose-Finding Studies with Binary Endpoints 18m

        In oncology drug development, phase II dose-finding studies are essential to identify the most promising dose levels for confirmatory phase III trials. Traditionally, dose selection is based on the maximum tolerated dose, which does not necessarily correspond to the optimal dose in terms of efficacy and safety. To address the challenge of dose optimization, the Oncology Center of Excellence of the U.S. Food and Drug Administration launched the OPTIMUS project. One challenge that arises in this context is the accurate estimation of therapeutic effects. In practice, the true treatment effect is often overestimated in phase II studies, and results cannot always be confirmed in subsequent phase III trials. While computational bias-correction methods for dose-finding trials have been presented for normally distributed outcomes (e.g., bootstrap-based methods, [1]), none have so far been presented for binary endpoints in this context, which are common in oncological trials (e.g., tumor response or occurrence of toxicities).
        Objective:
        This work aims to evaluate statistical methods for reducing selection bias in dose-finding studies with binary outcomes. Specifically, the performance of bootstrap-based methods, adapted from approaches presented for normally distributed endpoints, as well as additional Bayesian hierarchical approaches is investigated through simulation studies focusing on tumor response as the binary endpoint.
        Methods:
        A simulation study is conducted considering different underlying dose-response relationships (e.g. Emax, linear, logistic, etc.).
        First, single and double (non-)parametric bootstrap methods, originally proposed for normally distributed endpoints, are evaluated. The impact of different numbers of bootstrap repetitions on the bias and mean squared error (MSE) of the true maximal dose is compared across methods and dose–response relationships.
        For Bayesian hierarchical models, various hyperpriors for the variance parameter in a Bayesian hierarchical model are investigated, including Gamma, Half-t, and Uniform distributions with different parameterizations. The final model will be based on the hyperprior that minimizes bias and MSE of the estimated maximum response.
        Based on these results, bootstrap and Bayesian methods will be jointly evaluated and compared with other approaches, such as additive und multiplicate shrinkage methods.
        Outlook:
        The simulation framework enables a systematic comparison of bias-reduction methods for binary endpoints under a range of dose-response patterns. Results will be presented and may provide guidance on selecting appropriate methods to minimize bias in dose selection.

        [1] Zhan T. A class of computational methods to reduce selection bias when designing
        Phase 3 clinical trials. Statistics in Medicine. 2024;43(10):1993-2006. doi: 10.1002/sim.10041

        Speaker: Alexandra Balzer (Institute of Medical Biometry Heidelberg University Hospital)
      • 14:03
        Comparison of ANOVA methods for experiments in the nested block design 18m

        Experimental designs with orthogonal block structures are commonly used in many areas of science in order to control the external sources of variability. The aim of this study is to compare several analysis of variance (ANOVA) methods applicable to such structures. Comparing these approaches is of practical importance, as the choice of the analytical method may influence inference about treatment effects and the estimation of variance components in complex experimental structures. The experiments were conducted using a nested block design, where the treatments are distributed over blocks. Each block consists of a certain number of experimental units (plots), which are further grouped into superblocks. This framework allows the total variation to be decomposed into orthogonal components corresponding to successive strata, which naturally leads to the mixed model representation of treatment and block effects.

        Three analytical methods were compared in this study. The first method is based on decomposing the model into several submodels, in accordance with the stratification of the experimental units. Then, information from the individual strata was then taken into account in the analysis of variance. This method was thoroughly described by Caliński and Kageyama (2000). The second method is based on the residual maximum likelihood (REML) approach. The Kenward–Roger method for estimating degrees of freedom was applied. Finally, taking advantage of the orthogonal block structure, the analysis of variance can be performed directly, without combining results from intra-block and inter-block analyses, as described by Caliński and Siatkowski (2018).

        The main goal of this research was to assess the effectiveness of the aforementioned methods. They were applied to datasets gathered from several experiments involving different numbers of plots, different block sizes, different numbers of treatments, and various sizes of superblocks. The results of analyses of variance were compared and discrepancies were investigated by identifying issues arising from the estimation of variance components, especially cases where the REML method omitted certain effects. Moreover, the run-times of the software and the numbers of iterations required to obtain variance component estimates were compared.

        References:

        1.    Caliński, T., Kageyama, S. (2000) Block Designs: A Randomization Approach: Volume I: Analysis. Springer New York

        2.    Caliński, T., Siatkowski, I. (2018). On a new approach to the analysis of variance for experiments with orthogonal block structure. II. Experiments in nested block designs. Biometrical Letters, 55(2), 147-178.

        3.    Searle, S. R., Casella, G., & McCulloch, C. E. (2009). Variance components. John Wiley & Sons.

        96432307564

        Speaker: Konrad Banaś (Department of Mathematical and Statistical Methods City: Poznań)
      • 14:21
        Rectangular augmented row-column designs generated from contractions 18m

        Row-column designs play an important role in applications where two orthogonal sources of error need to be controlled for by blocking. Field or greenhouse experiments, in which experimental units are arranged as a rectangular array of experimental units are a prominent example. In plant breeding, the amount of seed available for the treatments to be tested may be so limited that only one experimental unit per treatment can be accommodated. In such settings, augmented designs become an interesting option, where a small set of treatments, for which sufficient seed is available, are replicated across the rectangular layout so that row and column effects, as well as the error variance can be estimated. Here, we consider the use of an auxiliary design, also known as a contraction, to generate an augmented row-column design. We make use of the fact that the efficiency factors of the contraction and the associated augmented design are closely interlinked. A major advantage of this approach is that an efficient contraction can be found by computer search at much higher computational speed than is required for direct search for an efficient augmented design. Two examples are used to illustrate the proposed method.

        Rectangular augmented row-column designs generated from contractions

        Speaker: Hans-peter Piepho (Universiy of Hohenheim City: Stuttgart)
      • 14:39
        Effect of using textbook field plans without randomization 18m

        Fisher (1925) introduced the three principles of experimental design: (i) true replicates, (ii) randomization, and (iii) blocking. The former two are strictly required while blocking often increases precision. That is what we tell our agricultural students. However, in practice, randomization is often ignored, either in the first replicate (van Santen and West, 2012) or completely. Often, the design is directly taken from textbooks and thus no randomization at all is performed. In these cases, the same design is sometimes/often repeatedly used across trials and years. The reason for this practice is convenience. Moreover, it allows the first replicate to serve as demonstration plots. A common argument is that the systematic order could have occurred by chance, too. It is currently unclear how large potential effects of the systematic order in series of experiments are, especially a systematic order in the first replicate. We therefore analysed a series of trials conducted at one location across ten years. Eleven or twelve historical winter wheat cultivars were tested each year in four replicates. Depending on the number of cultivars, two different textbook field plans were used across years. Additionally, some recent cultivars were changed over time. However, nine of the cultivars were constant across all ten years, and the order of these cultivars was identical across all years. The current study investigates how large the effects of not randomizing cultivars is. The study shows that mean estimates are biased. Moreover, the variance of treatment comparisons can be under- or over-estimated. The study evaluated the consequences of under- and over estimating treatment comparisons on the chance of selecting the truly best cultivar.

        75002910488

        Speaker: Jens Hartung (University of applied science Weihenstephan-Triesdorf)
      • 14:54
        Socio-spatial characterization of sub-sewersheds for wastewater-based epidemiology: Developing and evaluating two estimators for population-related variables 18m

        Wastewater-based epidemiology (WBE) offers a promising approach to assess populationhealth by analysing health related [SM1] markers in sewage. Interpreting such data at fine spatial scalesrequires accurate [DS2] numbers of the contributing population. However, allocating population information tosewersheds is complicated by the lack of spatial congruence between administrative boundaries and sewernetworks. To date, no standardized method exists for resolving this mismatch.

        This study presents and evaluates two novel approaches for estimating sub-sewershed populations:
        Proportional Building-based Population Estimation (PBPE) and Spatial Grid Population Estimation(SGPE). PBPE redistributes population data from administrative units proportionally to the number ofresidential buildings intersecting each sub-sewershed, while SGPE applies inverse-distance weighting tointerpolate population density across a hexagonal grid informed by residential land-use data.

        Both estimators were implemented for a large German metropolitan area comprising 195 sub-sewersheds.Their performance was assessed by comparing estimated populations [DS3] to reported reference data and toa simple spatial overlay baseline. Despite differing data requirements and assumptions, both methodsproduced consistent and plausible results across all parameters.

        PBPE tends to offer higher precision when detailed building data are available, while SGPE provides a flexibleand transferable alternative when such data are incomplete or unavailable. The comparison highlights trade-offs between data granularity, computational complexity, and estimation stability. By providing reproducible andscalable estimation frameworks, this study contributes to improving small-scale population inference for WBEand other spatially disaggregated health applications. The proposed methods enhance the interpretability ofbiometrically relevant indicators derived from wastewater data and support the integration of WBE into publichealth surveillance and environmental epidemiology.

        21429406705

        Speaker: Yassine Talleb (TU Dortmund University)
    • 13:45 15:15
      TC5: Statistical issues in health care provider comparisons Room 1 B

      Room 1 B

      Convener: Johannes Rauh (IQTIG)
      • 13:45
        Causal Inference for Healthcare Profiling in Low-Event Settings 18m

        In healthcare provider profiling, accurately assessing hospital performance is crucial for informed decision-making and quality improvement. Traditional approaches rely heavily on parametric regression models for risk adjustment, but these methods often fail to account for between-center heterogeneity and may produce biased estimates, especially in the presence of low event rates or small provider sample sizes. This talk reviews statistical approaches for provider profiling, offering a unified perspective across several approaches. We cast the problem in a causal inference framework and focus on balancing weight methods using constrained optimization algorithms. A case study using a dataset of nearly 43,000 congenital heart surgeries undertaken between 2016 and 2022 examining operative mortality across 115 U.S. centers illustrates issues. We describe a flexible framework with robust estimation of nuisance functions that account for between-center heterogeneity in treatments and patient confounders, particularly when positivity violations and low event rates complicate inferences. Various estimation strategies using the congenital heart data are employed, providing an implementation strategy across the different estimation approaches (funded by Grant R01HL162893 from the U.S. National Institutes of Health).

        53573502324

        Speaker: Sharon-lise Normand (Harvard Medical School)
      • 14:03
        Measuring performance for end-of-life care 18m

        Although not without controversy, readmission is entrenched as a hospital quality metric with statistical analyses generally based on fitting a logistic-Normal generalized linear mixed model. Such analyses, however, ignore death as a competing risk, although doing so for clinical conditions with
        high mortality can have profound effects; a hospital’s seemingly good performance for readmission may be an artifact of it having poor performance for mortality. In this paper we propose novel multivariate hospital-level performance measures for readmission and mortality that derive from framing the analysis as one of cluster-correlated semi-competing risks data. We also consider a number of profiling-related goals, including the identification of extreme performers and a bivariate classification of whether the hospital has higher-/lower-than-expected readmission and mortality rates via a Bayesian decision-theoretic approach that characterizes hospitals on the basis of minimizing the posterior expected loss for an appropriate loss function. In some settings, particularly if the number of hospitals is large, the computational burden may be prohibitive. To resolve this, we propose a series of analysis strategies that will be useful in practice. Throughout, the methods are illustrated with data from CMS on N= 17,685 patients diagnosed with pancreatic cancer between 2000–2012 at one of J= 264 hospitals in California.

        42858802084

        Speaker: Sebastien Haneuse (Harvard T.H. Chan School of Public Health)
      • 14:21
        Timely yearly assessment of follow-up outcomes using a period-based survival data approach 18m

        Quality assessment in healthcare frequently relies on quality indicators based on follow-up data tracking patient outcomes after treatment. However, conventional cohort-based indicators require complete follow-up, which can result in substantial lag between data collection and analysis. To enable more timely yearly assessment, we propose a period-based approach, in which all data collected within a defined time period (in our case one calendar year) are evaluated jointly. This way we consider each follow-up event in exactly one annual evaluation, ensuring a clear allocation of follow-up events to a single reporting year. This design results in left-truncated and right-censored survival data, which requires the use of specific statistical methods for the analysis. If there are no relevant patient-related risk faktors, we use the Kaplan-Meier estimator. Otherwise, we use risk adjusted rates, that take truncation and censoring into account when estimating the risk of each case. If the duration between the treatment and the follow-up event is of interest, we estimate hazard ratios. Finally, we demonstrate how Bayesian models for these indicators can be used to quantify uncertainty and to identify hospital providers with poor performance.

        42858802004

        Speaker: Lisa Steyer (Federal Institute for Quality Assurance and Transparency in Healthcare)
      • 14:39
        Improved predictions of quality of care indicators in the tail of its distribution 18m

        Improved predictions of quality of care indicators in the tail of its distribution

        Els Goetghebeur, Ghent University

        Standard mixed methods have been popular for evaluating performance across care centers in terms of indicators that summarize residents’ outcomes. Their results tend to lack power, however, for the detection of poor performance [1]. This stems from regression to the mean when estimating center performance based on the usual BLUP estimation of the center-specific random effect. To avoid this, one has turned to fixed effects models that may be Firth corrected.

        Recently, optimally weighted BLUPS methods' have been proposed [2] that allow for better prediction of extreme (e.g. poor) outcome indicators after fitting the mixed model for continuous outcomes. In this talk, we adapt this approach to find improved standardized QoL measures for diagnosing outlying center performance. We evaluate how these results compare with those ofstandard random effects models' or `adapted fixed effects models' for the purpose of diagnosing centers with poor outcomes. We apply the new method to evaluate causal estimands for quality of life in care centers and discuss extensions that allow for death as an intercurrent event when residents are followed over time.

        References
        1. M. Varewyck, E. Goetghebeur, M. Eriksson and S. Vansteelandt. (2014) On shrinkage and model extrapolation in the evaluation of clinical center performance. Biostatistics, 15: 651-664
        2. C. E. McCulloch and J. M. Neuhaus. (2023) Improving predictions when interest focuses on extreme random effects. Journal of the American Statistical Association, 118(541):504–513.

        53573504767

        Speaker: Els Goetghebeur (Ghent University)
      • 14:57
        Comparing the usefulness of patient and clinician reported outcomes measures to compare providers of arthroscopic rotator cuff repair – Methodological considerations 18m

        Background: There is an increasing interest in making use of patient reported outcome measures for provider comparisons, However, guidance on the choice of outcomes and selection of variables for case-mix adjustment for specific patient groups is lacking.

        Material: In the ACRF-pred study 973 patients from 19 different clinics were followed after arthroscopic rotator cuff repair for 24 months with several patient reported outcomes and for 12 months with several clinician reported outcomes. Fifty-two potential indicators for provider comparisons could be defined. Five approaches to select variables (out of 55 available) for case-mix adjustment were predefined to study the robustness of the comparison of indicators.

        Methods: The usefulness of an indicator was conceptually defined as the ability to discriminate between clinics. This ability was assessed by the variation explained by the clinics after case-mix adjustments. To judge the need for case mix adjustment, the reduction in the standard deviation of the case-mix adjusted clinic-specific mean values was considered. The direct impact of the approach to variable selection was measured by the change in position in funnel plots.
        To shift the focus on average performance towards occasional low-level performance percentile-based transformations of indicators were considered giving higher weights to unfavourable outcome values. The same analytical steps were performed with focus on the shift of results with increasing weight given to unfavourable outcomes. The validity of analysing percentile-based transformed variables by mixed models was investigated in a simulation study.

        Results: Both patient reported outcomes as well as clinician reported outcomes were able to discriminate between clinics to a substantial degree. The impact of case-mix adjustment was rather moderate, which could be explained by a lack of variables predictive for the outcomes of interest AND varying in distribution across clinics.
        Moving the focus from average performance towards occasional low-level performance can be approached in a reliable manner using percentile-based transformation of indicators. It can have an impact on the choice of outcomes and the positioning of single clinics.
        Choice of outcomes for quality assessment and monitoring can benefit from further analyses addressing the question whether different outcomes / transformations address the same or different underlying quality constructs.

        42858804326

        Speaker: Werner Vach (Basel Academy for Quality and Research in Medicine)
    • 15:15 15:45
      Coffee break 30m
    • 15:45 17:15
      Clinical trials 5 Room 13 B

      Room 13 B

      Convener: Tomasz Burzykowski (Hasselt University)
      • 15:45
        Assessment of Global Evidence Against Homogeneity for Exhaustive Subgroup Treatment Effect Plots 18m

        Assessment of treatment effect heterogeneity is a challenging problem in biostatistics, particularly in clinical trials: Estimation of treatment effects within subgroups in an exploratory setting is often unreliable due to limited sample sizes and multiplicity issues. Through the past decades, many efforts have been made to address this problem. Among them, Muysers et al. (2020) considered generating a graphical display that presents numerous subgroups on the same figure and could potentially illustrate homogeneity or heterogeneity. This interactive plot has outcome variable (treatment effect measure) on the y-axis and subgroup size on the x-axis. We refer to this plot as an exhaustive subgroup treatment effect plot. While the original plot purposely avoids inferential statistics, we believe that there could be interest in guiding the interpretation of the observed heterogeneity. For example whether the observed heterogeneity is expected or larger than expected under global homogeneity. In this presentation, we will introduce a computationally efficient method to derive homogeneity regions in such exhaustive subgroup treatment effect plots based on the double robust learner approach (Kennedy, 2023). We also conduct a comprehensive simulation study to evaluate the validity of the approach and illustrate the methodology with a real case study.

        References:
        Kennedy, E.H., 2023. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics Vol. 17 (2023) 3008-3049.

        Muysers, C., Dmitrienko, A., Kulmann, H., Kirsch, B., Lippert, S., Schmelter, T., Schulz, A., Mentenich, N., Schmitz, H., Schaefers, M., Meinhardt, G., Keil, T., Roll, S., 2020. A Systematic
        Approach for Post Hoc Subgroup Analyses With Applications in Clinical Case Studies. Ther Innov Regul Sci 54, 507.518.

        96432302204

        Speaker: Björn Bornkamp (Novartis Pharma AG)
      • 16:03
        Comparing adverse event probabilities in a hypothetical world without consent withdrawals or treatment switches 18m

        A question that arises in the analysis of adverse events is how to account for patients who withdraw their consent or switch treatment. One approach is to consider consent withdrawal and treatment switch as competing events. Alternatively, patients who withdraw from the study or switch treatment could be censored, but this implies that one assumes censoring due to treatment switch or consent withdrawal not to be related to the treatment or any disease stage, i.e., to be random. In other words, this approach assumes that patients who do not withdraw consent and do not switch treatment are representative of those who do, which may be invalid. These two approaches to handling consent withdrawals and treatment switches in the analysis of adverse events were also discussed by the SAVVY project [1]. As an alternative, inverse probability of censoring weighting (IPCW) can be used to answer questions in a hypothetical world in which a treatment switch or consent withdrawal does not occur. For this, patients who do not withdraw their consent or switch treatments will be up-weighted to represent those who do. In this talk, we will discuss the construction of IPCW estimators in competing events analyses and the assumptions under which IPCW can be used to causally analyse the hypothetical scenario in which a competing event does not occur using data from a randomised study in elderly AML patients investigating the effects of valproate and retinoic acid [2]. Our approach distinguishes between ‘hard’ and ‘soft’ competing events in that hypothetical IPCW calculations are not applied to competing mortality.

        References
        [1] Stegherr R, Beyersmann J, Jehl V, Rufibach K, Leverkus F, Schmoor C, and Friede T. Survival
        analysis for AdVerse events with VarYing follow-up times (SAVVY): Rationale and statistical concept
        of a meta-analytic study. Biometrical Journal, 63:650–670, 2021.
        [2] Lübbert M, Grishina O, Schmoor C, et al. Valproate and retinoic acid in combination with decitabine in elderly nonfit patients with acute myeloid leukemia: Results of a multicenter, randomized, 2 × 2, phase II trial. Journal of Clinical Oncology, 38:257–270, 2020.

        85717608337

        Speaker: Judith Vilsmeier (Institute of Statistics, Ulm University)
      • 16:21
        Counterfactual Uncertainty Quantification in Personalized Medicine: A Statistical Framework for RCTs 18m

        As medicine enters an era of precision, the challenge for statistics is no longer whether personalized care is possible, but how best to translate its potential into clinical practice. Zhao et al. (2012) formulated the personalized medicine problem as finding the optimal individual treatment rule (ITR) by maximizing the expected clinical responses. More recently, Lei and Candès (2021) developed interval estimates for individual treatment effects using conformal inference in observational studies. There is also a line of research on Digital Health using data from wearable devices.

        Different from the above lines of research, this presentation contributes to personalized medicine by addressing uncertainty quantification in randomized controlled trials (RCTs). While point estimation of counterfactual efficacy has been understood by Jerzy Neyman since 1923, uncertainty quantification in this setting has remained unresolved. We introduce Counterfactual Uncertainty Quantification (CUQ), enabled by a new statistical modeling principle called ETZ, which often yields lower variability than traditional UQ methods in personalized medicine. We also highlight the risks of using predictors measured with error and discuss conditions under which counterfactual estimates remain unbiased. Finally, we emphasize the need for caution when estimating subgroup effects, as bias can arise in both Real Human approach and the Digital Twin AI technique.

        85717604986

        Speaker: Xingya Wang (Department of Mathematics, The University of Manchester)
      • 16:39
        Continuous Monitoring in Early Phase Oncology: A Standardized, Patient-Centric Approach 18m

        Continuous monitoring (CM) at AstraZeneca is the systematic review and evaluation of accumulating study data to inform timely decisions. Rather than waiting for formal interims or study completion, our CM approach in early phase oncology studies, enables earlier data-driven decisions to stop for futility or safety, minimising exposure to ineffective or unsafe treatments, and to accelerate promising therapies. We present a framework that aligns statistical decision-making with operational feasibility, improving patient centricity, consistency, and auditability.
        The proposed Bayesian frameworks utilise posterior or predictive probability decision rules and evaluate operating characteristics to guide when to start CM and how frequently to analyse data; balancing performance and operational feasibility. We assess probabilities of making correct decisions based on prespecified benchmarks, expected sample size, and patient allocation at suboptimal doses across clinically relevant scenarios. Given dose optimisation is often conducted whilst a compound’s full safety profile emerges, the framework monitors both binomial safety (frequency of ≥ grade 3 adverse events) and efficacy endpoints (e.g. objective response rate). Optimal statistical considerations are presented alongside potential operational constraints, including data entry lag, transfer timelines, tolerance for unclean/missing data and scope of the outputs to support decision making.
        Health authority feedback has guided our recommendations for when to implement CM, ensuring alignment with regulatory expectations. The framework provides standards and templates for rigorous, reproducible CM plans for early-phase study design and conduct. By standardising CM, we deliver reliable safety and efficacy decisions, reducing time and exposure at non-optimal doses, supporting complex development plans.

        32144117919

        Speaker: Kuzko Aleksandra (AstraZeneca)
      • 16:57
        Blinded continuous monitoring for continuous outcomes 18m

        Continuous monitoring is becoming more popular due to its significant benefits, including reducing sample sizes and reaching earlier conclusions. In general, it involves monitoring nuisance parameters (e.g., the variance of outcomes) until a specific condition is satisfied. The blinded method, which does not require revealing group assignments, was recommended because it maintains the integrity of the experiment and mitigates potential bias. Although Friede and Miller (2012) investigated the characteristics of blinded continuous monitoring through simulation studies, its theoretical properties are not fully explored. In this paper, we aim to fill this gap by presenting the asymptotic and finite-sample properties of the blinded continuous monitoring for continuous outcomes. Furthermore, we examine the impact of using blinded versus unblinded variance estimators in the context of continuous monitoring. Simulation results are also provided to evaluate finite-sample performance and to support the theoretical findings.

        21429409667

        Speaker: Longhao Xu (Department of Medical Statistics, University Medical Center Göttingen)
    • 15:45 17:15
      IS8: Regularization methods and their applications in the work of IQWiG and IQTIG Room 1 A

      Room 1 A

      Convener: Ralf Bender (IQWiG)
      • 15:45
        Regularization methods in clinical biostatistics: State-of-the-art and possibilities for improvement 30m

        A range of regularization approaches have been proposed in the literature to overcome overfitting, to exploit sparsity or to improve prediction. Using a broad definition of regularization, namely controlling model complexity by adding information in order to solve ill-posed problems or to prevent overfitting, we review a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. We investigate the extent to which these methods are applied in clinical medicine, discuss current limitations and point out possibilities for improvement.

        64288200884

        Speaker: Sarah Friedrich-Welz (University of Augsburg)
      • 16:15
        The Firth correction - a recap 20m

        In statistical analyses of binary outcomes for medical procedures performed by multiple health care providers, provider-specific effects are commonly handled using conditional models with random effects or using marginal models with generalized estimating equations (GEEs). While convenient, these models treat provider effects primarily as nuisance parameters, even though they may themselves be of substantive interest—for example, when evaluating performance variation across providers. Moreover, the shrinkage inherent in random-effects estimation typically leads to underestimation of between-provider differences.
        In this talk, we will discuss how the Firth correction could be embedded in such models. The Firth correction is a penalized-likelihood method originally introduced to reduce bias of maximum likelihood estimates which may result from high predictor dimensionality, near-multicollinearity, and sparse outcome events. An extreme form of this bias is separation, caused, e.g., if no events are observed at one level of a categorical predictor variable. Separation leads to non-existence of the maximum likelihood solution, but can be entirely prevented by the Firth correction. This attractive property may have contributed to that method's high popularity. Unlike other penalized-likelihood methods, the Firth correction does not have a tuning parameter. However, a well-known limitation of the Firth correction is its tendency to shrink predicted probabilities toward 0.5. To mitigate this, two straightforward modifications—FLIC and FLAC—have been proposed and will be reviewed in this talk.
        In provider-specific models, the Firth correction may be considered (i) to replace mixed-effects methods by treating providers as fixed effects, or (ii) to solve a separation problem related to fixed effects of providers or other variables that perfectly explain the outcome.
        For case (i) the Firth correction is expected to shrink provider effects less than mixed-effects methods. Moreover, it is transformation-invariant and does not require any distributional assumptions on the provider effects. However, its attractive bias-correcting property theoretically only holds if the number of subjects per health care provider is not too small. For case (ii) extensions of GEEs and mixed effects logistic regression models incorporating the Firth correction were recently proposed to mitigate separation problems. For mixed models, these novel methods can simultaneously deal with convergence issues caused by either random or fixed effects. We will give an overview over these recent developments.
        Finally, we will provide some illustrative analyses of data examples to compare the operation characteristics of these models in health care provider-specific applications.

        32144104808

        Speaker: Georg Heinze (Medical University of Vienna, Center for Medical Data Science, Institute of Clinical Biometrics)
      • 16:35
        Regularization methods in the evaluation of hospital quality 20m

        The IQTIG measures, compares and evaluates hospital quality using quality indicators. These usually consist of a population and a binary outcome of interest, such as whether complications have occurred after elective knee replacement. For a fair assessment and comparison of hospital quality, we need to adjust for the hospital’s case mix, i.e., for the patient-specific risk factors such as age or previous surgeries on the same knee. We jointly model the effect of these risk factors and the hospital effect on the outcome probability in a regression model based on individual patient data. Since usually there are hospitals with low caseloads and possibly even no outcomes of interest, it is often necessary to penalize the hospital effect to avoid separation issues.
        To this end, the hospital effect is often modeled as a random intercept, but Firth regression has also been proposed. Clearly, the choice of penalization influences the estimated hospital effect. Moreover, it also influences the estimates of other effects, such as patient-specific risk factors. We study the effects of the choice of penalty in two applications: In the first application, we want to estimate how the treatment quality depends on the hospital's caseload ("volume-outcome relationship"). In the second application, we aim to quantify the heterogeneity of hospital effects to estimate the potential for improvement when quality improvement measures decrease heterogeneity in treatment quality. While in the latter application, we are primarily interested in the hospital effects themselves, in the former the main interest lies in the (smooth) effect of the caseload.

        96432303717

        Speaker: Jona Cederbaum (Federal Institute for Quality Assurance and Transparency in Healthcare (IQTIG))
      • 16:55
        Regularization methods in clinical biostatistics: Evaluation of adverse events in early benefit assessment using Firth correction for Cox models in the case of zero events 20m

        For the early benefit assessment of drugs in Germany, the pharmaceutical company must describe the extent of an added benefit of the drug to be assessed compared with an appropriate comparator therapy [1]. The confidence interval of a significant effect must lie completely outside a certain corridor around the null effect for the extent of the effect to be regarded as minor, considerable or major. The corridors are defined by different thresholds depending on outcome category. For endpoints in the category of adverse events regularly, no events in one of the arms are observed, and thus the standard Cox proportional hazard regression does not provide valid effect estimates with corresponding confidence intervals, while the log rank test provides appropriate p-values. Thus, in the case of a statistically significant effect, the extent cannot be determined, which can lead to an inadequate overall assessment of the early benefit.
        Heinze and Schemper proposed an adaption of the Firth correction to reduce bias from maximum likelihood estimation for the Cox proportional hazard [2].
        To assess the applicability of this approach, we performed a simulation study of time to event analyses with zero events. The simulations are based on example cases from previous dossier evaluations. We will present results from this study and discuss the situations, in which the application of the Firth correction provides reliable estimates. We further describe the impact that the use of the Firth correction can have on endpoint-level benefit assessment.
        1. IQWiG. General Methods 7.0 [online]. (2020 )
        2. Heinze and Schemper (2001). Biometrics 57(1):114–119.

        32144106237

        Speaker: Lars Beckmann (IQWiG)
    • 15:45 17:15
      Machine learning and data science 4 Room 14

      Room 14

      Convener: Andreas Ziegler (Cardio-CARE)
      • 15:45
        Generative Adversarial Networks For Mortality Modelling 18m

        Mortality risk modeling and forecasting is one of the key tasks of social security institutions and insurance companies. Traditionally used stochastic mortality models, such as the Lee-Carter model, require fulfillment of formal assumptions that cannot always be met in real-life scenarios (e.g. time independence of age-specific improvement rates). Alternative approaches are based on deep neural networks. Previous work in the field primarily covers recurrent neural networks, typically used in time series forecasting problems, as well as convolutional neural networks and, more recently, also transformer-based architectures. This work aims to analyze the effectiveness of generative adversarial networks in generating and forecasting mortality data based on mortality population data from the Human Mortality Database. The time series specific generative adversarial network (GAN) architecture adjusted for mortality model mechanics is discussed with a focus on the ability to generate new, realistic mortality rates trajectories based on source data, taking into account diversity, fidelity and usefulness criteria. As GAN models do not make any initial assumptions on the distribution of the modeled phenomena, they might serve as an compelling alternative to currently used models for mortality rates simulation and forecasting used by insurance companies.

        53573508284

        Speaker: Łukasz Głąb (SGH Warsaw School of Economics)
      • 16:03
        Lesion Network Mapping based Convolutional Neural Networks: Predictive Performance and Interpretability 18m

        Stroke can lead to a wide range of symptoms including acute motor impairment, post-stroke depression or cognitive impairment. Anticipating these outcomes early on and understanding their underlying causes would enable clinicians to initiate appropriate targeted treatments, which could not only help reduce the severity of symptoms but also improve long-term recovery. Convolutional neural networks (CNNs) are often used for these types of tasks and commonly rely on either computed tomography (CT) or magnetic resonance imaging (MRI) images as data for training. However, it's increasingly understood that complex neurological symptoms often arise from diaschisis, that is, from disconnections between brain regions, rather than being attributable to a single focal neuroanatomical substrate. Lesion-network mapping (LNM) uses normative rs-MRI data to derive so-called connectivity maps indicating functional connections from brain lesions. In this project we investigated 1) whether CNNs trained on connectivity maps (LNM-CNN) outperform those trained on lesion masks (Lesion-CNN) in predicting post-stroke motor-impairment, and assessed 2) whether layer-wise relevance propagation (LRP) can be employed to extract valuable insights into model decisions. Finally, 3) we compared regional importance extracted from the LNM-CNN to those of a current standard procedure (permutation-based LNM).
        Both models performed similarly, though the LNM-CNN showed higher recall and F1-Score indicating a higher sensitivity in identifying affected patients as well as better balance between precision and recall. LRP showed that importance often clustered near the lesion area, even in connectivity-based models, providing valuable context for the interpretation of the prediction results. The LNM-CNN revealed broader importance across cortical, subcortical, and white matter regions, while the permutation-based analysis predominantly identified cortical regions as important.
        Overall our study showed that in predicting acute motor impairments after a stroke, connectivity-based features slightly improved predictive performance compared to lesion masks in CNN models. Additionally, our results show that connectivity-based CNN models may offer complementary insights into the functional impact of brain lesions and highlight the need for further investigation of their clinical relevance

        96432307266

        Speaker: Matthias Becher (Charité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institut für Biometrie und klinische Epidemiologie, Germany)
      • 16:21
        Deep Learning Strategies for Rare and Common Brain Disease Diagnostics from Medical Imaging 18m

        This presentation explores the statistical challenges and comparative performance of various deep learning models for the automated detection and classification of neurological diseases from Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) scans. Building upon initial findings that demonstrated the potential of Convolutional Neural Networks (CNNs) in recognizing rare brain pathologies, such as Fahr's disease, this Master's research aims to develop a robust system capable of identifying a spectrum of conditions, ranging from common disorders (e.g., stroke, brain cancer, Alzheimer's disease) to less prevalent ones (e.g., Moyamoya disease).
        A primary focus of this work lies in the statistical rigor of model evaluation, especially in imbalanced data settings characteristic of rare diseases. The study will employ and statistically compare the efficacy of transfer learning approaches with custom-built CNN architectures (e.g., VGG, ResNet) using metrics particularly sensitive to low-prevalence classes, such as F1-score and Precision-Recall curves. Methodological details will be provided on the strategies used for dataset construction, augmentation, and bias mitigation - crucial steps in ensuring model generalizability and reliability.
        The research culminates in a comparative statistical analysis to determine the optimal CNN strategy for this challenging, multi-class classification problem, underscoring the potential of advanced statistical modeling and machine learning in improving diagnostic speed and increasing awareness of challenging neurological conditions in preventive healthcare.

        96432311448

        Speaker: Anna Mamchych (Adam Mickiewicz University)
      • 16:39
        From Data to Decisions: Enhancing the Reliability of Random Forest Predictions with optRF 18m

        Random forest is a widely used machine learning method across the life sciences due to its high predictive performance, minimal assumptions, and flexibility in handling diverse data types. However, a critical yet often overlooked property of random forest is its inherent non-determinism: repeated runs on the same data set can produce different prediction models. This variability can compromise the reproducibility of prediction based decision-making processes, especially in high-stakes applications such as medical, agricultural, or environmental prediction tasks.

        Random forest builds an ensemble of decision or regression trees and aggregates their outputs. A key but underexplored parameter in this process is the number of trees. While widely used implementations employ a fixed default value of 500 trees, this parameter strongly affects model stability and computational efficiency.

        This study provides a measure that describes the prediction stability and the effect that non-determinism can have on prediction-based decision-making processes. Furthermore, we show that the relationship between the number of trees and prediction stability is non-linear: while the prediction stability increases rapidly with an initiate increase in the number of trees, it forms a plateau with higher number of trees beyond which the prediction stability increases only marginally while computation time continues to increase linearly. To address this trade-off, we developed the R package optRF which models the relationship between the prediction stability and the number of trees to determine the optimal number of trees for a given data set.

        By providing a systematic way to quantify and optimise random forest stability, optRF contributes to more reliable and transparent machine learning analyses. While originally developed with genomic selection in mind, the approach is broadly applicable across the life sciences, where reproducibility and computational efficiency are critical for shaping data-driven decisions of the future.

        42858803609

        Speaker: Thomas Martin Lange (Breeding Informatics, Georg August University of Göttingen)
    • 15:45 17:15
      Open and reproducible research 1 Room 13 A

      Room 13 A

      Convener: Susanne Strohmaier (Medical University of Vienna)
      • 15:45
        Addressing the researchers' degree of freedom using multiple marginal models 18m

        For a given research question and observational dataset, there are often numerous ways to specify the data analysis pipeline that leads from raw data to the result of interest. Data analysts must make a series of choices concerning data preprocessing, variable definitions, and statistical model specifications. For example, analysis pipelines may differ in their inclusion or exclusion criteria (e.g., the exclusion of a small subgroup of patients suspected to behave differently), in preprocessing steps such as data transformation (e.g., log-transformation or collapsing categories of categorical variables), or in the methods used for imputing missing values. They may also vary in the selection of adjustment variables in a multivariable regression model when estimating an effect of interest. In this work, we use the term "researchers’ degrees of freedom" to denote these analytic choices required when specifying a complete data analysis pipeline, and focus on studies that involve hypothesis testing and effect estimation for explanatory purposes.
        We demonstrate how a class of methods known as multiple marginal models (MMM) - originally developed to control for multiple testing in the context of different-scaled, multiple correlated endpoints - can be adapted to address multiplicity arising from the researchers’ degrees of freedom described above. Specifically, we propose that researchers may explore various analytical specifications and focus on the one yielding the smallest p-value, provided they appropriately adjust for the resulting multiplicity of tests using the MMM framework. This approach allows analytical flexibility and adaptation to the data at hand (as opposed to strict statistical analysis protocols specified in advance), while preserving the nominal Type I error rate (as opposed to the practice known as "p-hacking") and provides a single interpretable result (as opposed to the reporting of the results of a multitude of analysis pipelines). It is illustrated through real data examples for various types of degrees of freedom, including but not limited to tuning parameters of statistical methods, the handling of outlying values, missing values and transformations of variables, and the specification of adjustment covariates in multivariable regression.

        21429407209

        Speaker: Anne-Laure Boulesteix (Ludwig-Maximilian University Munich and Munich Center of Machine Learning)
      • 16:03
        OMOP ETL Pipeline Implementation for Tuberculosis Data Standardisation at Douala General Hospital, Cameroon 18m

        Douala General Hospital, a first-class healthcare facility in Cameroon, serves thousands of patients yearly through its multidisciplinary medical teams. The hospital hosts numerous patient records that hold significant potential for public health research. However, most records remain paper-based, limiting their accessibility and reuse. In departments such as pulmonology, patient data are often stored in heterogeneous data sheets lacking uniform structure or standardisation, which constrains their use for clinical research, care management, and evidence-based decision-making. Moreover, the absence of standardisation hinders data integration within broader health systems, restricting secure sharing and interoperability.
        To address these challenges, we implemented a complete Extract, Transform, and Load (ETL) pipeline aligned with the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) version 5.4, an internationally recognised framework for health data standardisation. The objective was to transform and integrate patient data from the tuberculosis department into a database compliant with FAIR (Findable, Accessible, Interoperable, Reusable) principles, thereby enhancing data quality, interoperability, and reusability for research and clinical monitoring.
        The dataset included over 80 clinical and administrative variables such as sociodemographic data, medical history, symptoms, laboratory results, and diagnoses. These data were extracted from varied paper-based sources, presenting differences in completeness and structure. For standardisation, several Observational Health Data Sciences and Informatics (OHDSI) tools were employed: WhiteRabbit for data profiling, USAGI for vocabulary mapping, and Rabbit-in-a-Hat for defining table mappings to the OMOP CDM structure. The populated tables included Person, care_site, Measurement, Visit_Occurrence, Condition_Occurrence, Observation_period and Observation. Concept mappings were derived from SNOMED CT, LOINC, and RxNorm, with contextual adaptations to local data.
        The ETL pipeline was developed using SQL scripts generated from Rabbit-in-a-Hat and executed in pgAdmin for PostgreSQL. The OMOP tables were created using scripts from the OHDSI GitHub repository, and the transformed data were loaded accordingly.
        Data quality was evaluated using the Achilles tool, which automatically assessed completeness, conformance, and plausibility, achieving an overall score of 99%, demonstrating the reliability of the pipeline. This work represents a pioneering effort in applying OMOP CDM within the African context, promoting collaboration, interoperability, and data-driven decision-making to strengthen tuberculosis care and research in Cameroon.

        53573500355

        Speaker: Brenda Yankam Mbouamba (Ruhr University Bochum)
      • 16:21
        Living Synthetic Benchmarks: A Neutral and Cumulative Framework for Simulation Studies 18m

        Simulation studies are widely used to evaluate statistical methods. However, new methods are often introduced and evaluated using data-generating mechanisms (DGMs) devised by the same authors. This coupling creates misaligned incentives, e.g., the need to demonstrate the superiority of new methods, potentially compromising the neutrality of simulation studies. Furthermore, results of simulation studies are often difficult to compare due to differences in DGMs, competing methods, and performance measures. This fragmentation can lead to conflicting conclusions, hinder methodological progress, and delay the adoption of effective methods. To address these challenges, we introduce the concept of living synthetic benchmarks. The key idea is to disentangle method and simulation study development and continuously update the benchmark whenever a new DGM, method, or performance measure becomes available. This separation benefits the neutrality of method evaluation, emphasizes the development of both methods and DGMs, and enables systematic comparisons. In this paper, we outline a blueprint for building and maintaining such benchmarks, discuss the technical and organizational challenges of implementation, and demonstrate feasibility with a prototype benchmark for publication bias adjustment methods. We conclude that living synthetic benchmarks have the potential to foster neutral, reproducible, and cumulative evaluation of methods, benefiting both method developers and users.

        21429403524

        Speaker: František Bartoš (University of Amsterdam)
      • 16:39
        How many subgroup analyses are false (-positives or negatives)? – Evidence from p-value distributions of interaction tests and mixture models in diabetes research 18m

        Subgroup analyses are frequently reported results from randomized trials. They help to identify heterogeneity in the average treatment effect, which occurs when this average effect varies across different categories of a subgroup factor, like age, sex, or disease severity. If treatment effects are different across subgroups, this information can help to personalize treatment decisions. However, the limitations of subgroup analyses are also well known. Since each subgroup analysis involves a separate statistical test, performing many of them increases the likelihood of finding false-positive results, that is, statistically significant results for actually true null hypotheses (type I error). On the other hand, subgroups have smaller sample sizes than the overall trial population, which means that they often lack the statistical power to detect a true subgroup effect, leading to a higher risk of false-negative results (type II error). Despite the high risk of false findings in subgroup analyses, there is surprisingly little empirical research quantifying the actual proportion of false subgroup analyses.

        We reiterate the basic idea of these analyses, which was developed more than 20 years ago for microarray analyses but has apparently never been applied to subgroup analyses in clinical research. The basic building blocks for our analyses are the p-values from the subgroups' interaction tests. It is assumed that these p-values originate from two component distributions: the first describes the true null hypotheses (yielding a uniform distribution of p-values), and the second describes the false null hypotheses—that is, the true subgroup effects. These two distributions are combined in a 2-component mixture model. We fit the model to obtain estimates of the proportion of true null hypotheses and the parameters of the second distribution of p-values from the false null hypotheses. Simultaneously, the proportions of false-positive and false-negative results are estimated as predictive values, treating the subgroup interaction test as a standard diagnostic test.

        In our motivating example, we collected 292 p-values of interaction tests from 17 large randomized trials, utilizing data from 141,695 study participants. We further introduce some new distributions with varying numbers of parameters to extend the initially proposed restricted beta-uniform mixture (BUM) model. Depending on the mixture model, the proportion of false-positive results lies between 53% and 60%, the proportion of false-negative results between 13% and 25%, signaling that exaggerrating subgroup effects is a more serious problem than missing them in diabetes research.

        32144101605

        Speaker: Oliver Kuss (German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University Düsseldorf, Institute for Biometrics and Epidemiology)
      • 16:57
        Improving clinical and methodological research in the health sciences – on the crucial role of reporting guidelines and structured reporting 18m

        Background:
        For many years, health research has faced substantial criticism regarding its quality. Appropriate reporting guidelines are available with the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) network acting as an umbrella organization to address reporting issues in health sciences. Nevertheless, many reviews have shown that reporting quality remains poor, which biases the impression conveyed by published literature, and reduces the validity of systematic reviews and meta-analyses.
        The REporting recommendations for tumor MARKer prognostic studies (REMARK) discuss in detail different steps of an analysis and stress the importance of reporting all analyses. The two-part REMARK profile, a structured summary that highlights key aspects of a prognostic marker study, with an emphasis on all analyses performed, was proposed (Altman et al 2012). Related profiles were proposed for other types of studies in clinical and methodological research.
        Methods:
        Using a simple clinical example, we introduce the REMARK profile. Two examples from a review of prognostic factor studies are used to illustrate severe weaknesses in analyses reported in published papers (Sauerbrei et al 2022a). Core principles of the REMARK profile can be used to derive similar profiles for methodological studies. This is illustrated in two examples: a study on the multivariable fractional polynomial (MFP) approach (Sauerbrei et al 2023), and an individual patient data (IPD) meta-analysis investigating treatment interaction with a continuous variable (Sauerbrei et al 2022b). We illustrate and discuss the importance of following reporting guidelines and summarizing key aspects of a study in a structured way.
        Results:
        Structured reporting can be used for clinical and methodological research in the health sciences. It can improve the reporting of research and reveal severe weaknesses in some analyses.
        Conclusions:
        In health and methodological research, good reporting is generally feasible and straightforward to implement. A carefully designed structured profile can improve reporting, help reviewers and readers better understand analysis strategies and their weaknesses, and facilitate the interpretation of results. Structured reporting is a simple and effective way to improve reporting and should be used broadly.

        Altman et al 2012, doi 10.1186/1741-7015-10-51
        Sauerbrei et al 2022a, doi: 10.1186/s12916-022-02304-5
        Sauerbrei et al 2022b, doi: 10.1186/s12874-022-01516-w
        Sauerbrei et al 2023, doi: 10.1186/s41512-023-00145-1

        42858804084

        Speaker: Willi Sauerbrei (Institute of Medical Biometry and Statistics, Medical Center - University of Freiburg)
    • 17:30 18:00
      Closing Ceremony Room 1 A

      Room 1 A