Speaker
Description
Large and high-dimensional biomedical datasets (large n and p) such as genotype data containing hundreds of thousands of genetic variants (SNPs) measured across many individuals require scalable algorithms to enable efficient model training. In this work, we address this challenge by leveraging principles of optimal design and informative subsampling.
We investigate the applicability of existing optimal-design-based informative sampling techniques, including D-optimality-based IBOSS (information-based optimal subdata selection) and A-optimality-based OSMAC (optimal subsampling motivated from the A-optimality criterion) method for regression, as a means to reduce training time while maintaining prediction and estimation accuracy. Although these methods perform well in classical low-dimensional settings, we find that their effectiveness tends to diminish in complex real-world scenarios characterized by heterogeneous signal-to-noise ratios and correlation structures. Furthermore, these methods are not directly applicable to high-dimensional settings with a larger number of variables than observations (p>n).
To address these limitations, we propose new strategies that integrate optimal-design-based subsampling with variable selection methods for high-dimensional data. In this framework, the subsample is selected based on a reduced (screened) set of variables, while subsequent model training can still leverage the full set of covariates. Finally, as subsampling reduces computation time but often cannot fully match full-data accuracy, we extend these subsampling-based approaches to ensemble methods based on multiple subsamples, demonstrating competitive performance while maintaining low model training times. Our findings also underscore the need for new subsampling strategies that account for linkage disequilibrium (LD) patterns across diverse populations, which is an important direction for future work.
75002906724