Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

Vision-Enabled Language Models in Lexicographical Digitisation: A Case Study of Anton Thor Helle's 1732 Dictionary

Nov 18, 2025, 11:30 AM
30m
Zrak hall

Zrak hall

Speakers

Madis Jürviste Tiina Paet

Description

Traditionally, historical texts’ optical character recognition (OCR) has primarily been conducted using specialised software such as Transkribus, eScriptorium, Kraken, and similar tools. To achieve accurate character recognition, these systems require extensive pre-training and the creation of a refined "ground truth" dataset. The comprehensiveness of model pre-training directly correlates with the precision of results. Large language models (LLMs) promise a potential breakthrough in this domain, offering high-quality output without pre-training through their "zero-shot" capabilities.

Within the framework of a dedicated research programme, "Application of Large Language Models in Lexicography: New Opportunities and Challenges", we have conducted experiments employing untrained language models for the optical character recognition and data structuring of the dictionary section of Anton Thor Helle's 1732 grammar. The recent introduction of vision-capable language models proved decisive, enabling significantly more efficient processing of scanned documents than previously possible.

Preliminary tests demonstrated that Anthropic's Claude 3.5 Sonnet model could generate a structured table from a scanned dictionary file containing Gothic script (Fraktur) based on a simple prompt, recognising the text and appropriately categorising headword entries into relevant columns. Our comparative analysis of various generative language models (Anthropic's Claude, OpenAI's GPT models, Google's Gemini 2.0, and Mistral) revealed that Claude significantly outperforms other models in processing 17th and 18th-century Estonian texts printed in Gothic typeface. Following our preliminary experiments, Anthropic released Claude Sonnet version 3.7, with which we conducted a more comprehensive test to digitise Helle's entire dictionary.

Our presentation examines how effectively the language model transforms a scanned dictionary into a structured, editable document. We assess the accuracy of character recognition for Estonian headwords, German equivalents, and expressions at both character and word levels (CER and WER, respectively) and the precision of data structuring. Additionally, we explore the most common errors made by the model, factors influencing recognition accuracy, and challenges in adherence to provided prompt instructions.

Claude achieved the highest recognition accuracy with German translation equivalents, as it possesses substantially more training data for German than for Estonian. With both Estonian headwords and German equivalents, Claude frequently modernised word forms. In some instances, the LLM produced "hallucinations" that appeared plausible but bore no relation to the original text. In essence, the LLM tidied the image according to its own understanding — a tendency also observed in experiments with Stahl, Gutslaff, and Göseken (Author 1, Author 2, Author 3, 2025).

The primary advantage of our approach over conventional OCR methods lies in the significant time savings, considering both character recognition and automatic post-structuring capabilities. Whilst the classical method requires extensive ground truth creation and sometimes manual text segmentation, the language model-based approach delivers excellent results with substantially less preparation. Even paid language models such as Claude 3.7 Sonnet prove highly cost-effective.

LLM-based character recognition (and, when necessary, automatic post-structuring) can be applied to digitising other historical texts where prevalent methods would be impractical due to time constraints. This opens new prospects for digitising historical textual heritage and creates prerequisites for more extensive research of old textual sources.

Presentation materials

There are no materials yet.