Speaker
Description
Reference intervals and standard deviation scores (‘z scores’) are widely used as diagnostic tools in various biomedical fields. They are applied to laboratory parameters in clinical chemistry, psychometric tests in neurology, or parameters of children’s growth in pediatrics. Usually, samples from a ‘normal’ or ‘healthy’ population form the data basis for the estimation of reference distributions.
(Semi-)parametric estimation of reference distributions is complicated by extreme values that may be outliers relative to the working model, even without dependence on covariables like age. If sample size is moderate, genuine outliers may by chance be over-represented in the sample used for estimating the reference distribution and impair the selection of a suitable model. Second, if the target population is to include unhealthy individuals with their representative share, the reference distribution to be estimated is a mixture of a major ‘healthy’ part plus a ‘contamination’. Finally, the sample may be contaminated by observations that are not members of the target population but remain undetected.
Often in practice, the origin of the extreme values is unknown, but the extreme tails of the distribution are of interest for diagnostic purposes. In this contribution, Generalized Additive Models of Location, Shape and Scale are used. The simple approaches of including, deleting or winsorizing outliers are compared with estimation using a robustified likelihood (Aeberhard et al, Statist. Comput., 2021) and with two correction methods, one of them previously unpublished, the other proposed by the authors of the WHO children growth standards. Both correction methods remove outliers before estimating the reference distribution and then correct estimated standard deviation scores for the removal of outliers by appropriate rescaling.
A simulation study with contaminated and heavy-tailed distributions shows that the robust method reduces bias in contaminated scenarios and scenarios with genuine outliers, but is inferior to other methods in case of model misspecification. The new correction method represents a good compromise if misspecification cannot be excluded. A second set of simulations evaluates strategies to arrive at an adequate model for the reference distribution, either starting from a simple model and increasing model complexity depending on residual diagnostics, or model selection based on information criteria.
Data on body mass index and body proportions from a large Austrian pediatric study are used for demonstration.
64288206006