Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

Automated Transcription of Mixed-Script Dialectal Materials

Nov 19, 2025, 3:00 PM
30m
Sonce hall

Sonce hall

Speaker

Markus Kunzmann

Description

The project Dictionary of Bavarian Dialects in Austria "Wörterbuch der bairischen Mundarten in Österreich"(WBÖ) project maintains an archive of approximately 3.6 million handwritten dialectal paper slips documenting dialectal evidence. While 2.4 million entries have been manually digitized and converted to TEI format, the remaining 1.2 million paper slips from sections A-C require automated processing. This paper presents a novel three-stage workflow concept combining Handwritten Text Recognition (HTR) technology with existing digitized holdings to overcome the challenges posed by heterogeneous writing systems, multiple scribes, and poor material condition. Initial tests with existing HTR models yielded unsatisfactory results. The proposed solution leverages the existing Database of Bavarian Dialects "Datenbank der bairischen Mundarten in Österreich" (DBÖ) to automatically correct HTR transcription errors through similarity-based alignment and N-gram matching algorithms. The corrected transcriptions serve as a gold standard or a kind of ground truth for training a specialized HTR model tailored to historical dialect materials. This methodology enables the creation of substantial training datasets without manual transcription, potentially generating 33.6 million words for model training. The approach promises complete digital access to the WBÖ archive and provides a transferable template for similar lexicographic projects with historical slip collections.

Presentation materials

There are no materials yet.