Speakers
Description
Collocations are a well-covered research area in lexicography. With the advent of evidence-based lexicography and the availability of large text corpora, computational methods of extracting typical co-occurrences from such corpora and supporting lexicographers in identifying collocations among them became a research focus. Especially the statistical properties of collocations (i.e. application of various association measures) have been evaluated for different languages, collocation types, gold standards and corpora (e.g. Evert et al. 2017; Garcia, García Salido, and Alonso-Ramos 2019). In hindsight though, and despite the undisputed heuristic value of statistical methods for the task at hand, the overall results of such studies do not provide clear conclusions, especially with respect to the practical implications for lexicographic work. Combined, they highlight the dependency of the results on available datasets, investigated collocation types, as well as the underlying corpora in terms of their composition and the affordable preprocessing (Uhrig, Evert, and Proisl 2018). Some results even indicate that for high-quality, dependency-annotated corpora – in contrast to large but scarcely annotated web corpora used in previous studies – raw frequency data can be as indicative for extracting collocations as association measures. Consequently and given recent advances in deep learning, the focus shifted from the evaluation of association measures to the adaptation of increasingly capable statistical language models for the identification and classification of collocations (Espinosa-Anke, Codina-Filbà, and Wanner 2021; Falk et al. 2021; Ljubešić, Logar, and Kosem 2021).
In this study, we examine a more fundamental question that is addressed only in passing by the aforementioned work. This question becomes more important as the focus shifts from the precision of association measures to the recall required when constructing representative datasets for training classifiers: Which type of corpora are actually suitable for extracting collocation candidates and exemplifying their usage? To this end, we compare several corpora of the vast corpus collection of the ‘Digitales Wörterbuch der deutschen Sprache’ (DWDS), that comprises more than 70 billion tokens of German texts, including reference corpora, web corpora and high-quality print newspapers. In order to study the coverage of collocations by these corpora, we assembled a gold standard from three lexical resources of collocations of contemporary German: the collocations described in DWDS entries, a dictionary of German collocations (Quasthoff 2011), and a dataset from a recent dissertation (Strakatova 2024), yielding in total approximately 350,000 collocations of different syntactic types. We verify the presence of these collocations in various corpora of the DWDS corpus collection. Comparing the coverage of our gold standard datasets by those corpora, we conduct a case study to answer questions such as: a) How good is the coverage of common collocations by carefully selected but small reference corpora? b) Are giga-token web corpora sufficient to cover a broad set of collocations as documented in comprehensive reference dictionaries? c) Do high-quality newspapers surpass web-corpora or can they be replaced by well-curated web corpora?