18–21 May 2026
Europe/Warsaw timezone

A Comprehensive Comparison of Methods for Quantifying Similarity of Datasets

20 May 2026, 13:45
18m
Room 14

Room 14

Speaker

Marieke Stolte (TU Dortmund University, Department of Statistics)

Description

Quantifying the similarity between two or more datasets is an important task in statistics and machine learning. In meta-learning, it enables the transfer of knowledge across tasks and datasets. In simulation studies, the similarity between the distributions assumed in the simulation and the distributions of the datasets for which the performance of methods is assessed is crucial. Similarly, in the context of synthetic data, the similarity of the generated data to a real-world dataset is typically evaluated to assess the quality of the data generation. In various applications, statistical two- or k-sample tests can be used to check whether the underlying distributions of two or more datasets coincide.

Many approaches for quantifying dataset similarity have been proposed in the literature. The choice of a suitable method is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies. In previous work, we systematically reviewed 118 methods applicable to multivariate data that make no parametric assumptions and consider the full underlying data distribution. We provided a taxonomy of the methods based on the underlying ideas and compared them theoretically regarding their applicability, interpretability, and theoretical properties.

Here, we compare the most promising methods identified in the theoretical comparison regarding their performance in practice. We conduct a comprehensive simulation study to assess the practical performance of the methods across diverse scenarios, including two- and multi-sample settings for both categorical and numerical datasets. We evaluate how well the methods detect certain differences, e.g., differences in location, scale, or higher moments, between datasets. Moreover, we analyze computational aspects such as runtime, memory consumption, and numerical stability. Based on the results, we give recommendations for selecting appropriate methods. We propose method combinations that are able to detect a wide spectrum of differences between datasets.

53573507749

Author

Marieke Stolte (TU Dortmund University, Department of Statistics)

Co-authors

Andrea Bommert (TU Dortmund University, Department of Statistics) Jörg Rahnenführer (TU Dortmund University, Department of Statistics)

Presentation materials

There are no materials yet.