Speakers
Description
Introduction In 1911, Berlin missionary Karl Heinrich Julius Endemann, published his dictionary of the Sotho language Wörterbuch der Sotho Sprache, 1911. This dictionary faced scholarly neglect due to its rare combination of source and target languages, i.e., Sotho and German respectively, and also its missionary focus. Obsolete orthography, high user skill demands, and a lack of alignment with modern lexicographic principles contributed to its marginalization. This paper re-evaluates the dictionary within the context of bilingual Sepedi dictionaries, emphasizing historical and cultural aspects rather than a contemporary comparison. It explores macro- and microstructures of the dictionary, assesses accessibility to modern users, and proposes digitization strategies for improved usability, envisioning a multiphase approach with varied electronic features. Macro- and microstructures of Endemann (1911) Kosch (2011) criticizes Endemann’s Sotho language dictionary for disregarding some good lexicographic principles in the compilation of the dictionary. In our reassessment within the context of other bilingual Sepedi dictionaries, we focus on key issues: treatment of grammatical formatives, alphabetical categories, highfrequency lemmas, semantically related paradigms, and lemmas with cultural significance. By comparing Endemann’s dictionary with selected Sepedi reference works that span almost five decades (1967 to 2015), a balanced perspective on its lexicographic value and utility is established. • Treatment of grammatical formatives Grammatical formatives are usually notoriously undertreated in bilingual dictionaries in which the source language is a Bantu language, since these formatives are typically not carriers of lexical meaning. Our investigation shows that Endemann’s treatment of these formatives matches and, in some instances, even surpasses their treatment in modern Sepedi dictionaries.
• Alphabetical categories
Data collection carried out from 1861 to 1873 was not meant expressly for lexicography. Still, it will be demonstrated that Endemann avoided common pitfalls, where even contemporary lexicographers face challenges like the overtreatment of alphabetical categories.
• High-frequency lemmas
Inclusion of lemmas based on their frequency of use is a feature of modern corpus-based lexicography. Even so, experiments done, show that generally speaking, Endemann’s dictionary compares very well with existing Sepedi dictionaries with regard to lemmatization of high frequency items.
• Semantically related paradigms
Lexical sets, as defined by Atkins and Rundell (2008), are groups of words sharing a common element of meaning, often rooted in sense relations like synonymy or hyponymy. Using the days of the week as a prototypical example, the study investigates Endemann’s awareness of completing semantically related paradigms. Surprisingly, all selected Sepedi dictionaries lemmatize weekdays, except for Endemann (1911), raising questions about conceptual differences in the Sepedi-speaking community during his data collection from 1861 to 1873. The absence of the word ‘week’ and likely lack of standardized weekday names before 1930 add historical context to Endemann’s compilation challenges.
• Lemmas with cultural significance
Endemann’s lexicographic approach is marked by rich sense distinctions and detailed definitions, particularly concerning culturally-bound lemmas. The detailed treatment of such culturally significant lemmas contributes to Endemann’s dictionary as a valuable and distinctive resource that needs to be preserved and made accessible to new generation users.
Accessibility to modern users
Despite its favourable comparison with existing Sepedi dictionaries, users’ accessibility to Endemann’s dictionary is hindered by various factors. The dictionary has been out of hard-copy print for over a century, while the e-version is prohibitively expensive. The publisher, De Gruyter Mouton (Verlag) however granted us their permission to digitize and publish a good part of the dictionary for free online use. The main challenge lies in Endemann’s orthography and the unconventional ordering of alphabetical categories, detailed in German only in the introduction, demanding a deep understanding of phonetics for effective use.
Digitization strategies for enhanced usability
In the last section of our paper, we investigate digitization of the dictionary on various levels of complexity and sophistication, and also indicate which resources are necessary for these different levels of digitization. Digitization will include the creation of a manual gold standard and in parallel, deployment of OCR4all (Reul et al., 2019). Initial results of the OCR process were surprisingly good, especially when considering the numerous diacritical signs found in the dictionary: an accuracy figure of 99.93% was obtained, as calculated by OCR4all. The basis for this calculation is the comparison of 10 manually transliterated pages with the predictions of the OCR model for these pages. The accuracy improved iteratively with each training. The overall aim is to determine to what extent the challenges outlined in the preceding sections can be addressed by means of selective digitization strategies in order to make the dictionary accessible to modern day users.