Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

The Mangalam Dictionary of Buddhist Sanskrit: automating lexicographic data with generative LLMs

Nov 18, 2025, 3:00 PM
30m
Zrak hall

Zrak hall

Speaker

Ligeia Lugli

Description

This paper reports on recent advancements in the development of the Mangalam Dictionary of Buddhist Sanskrit, the first corpus-driven dictionary dedicated to Buddhist Sanskrit. This is a low-resource, historical, and domain-specific language variety instantiated in South Asian Buddhist literature dating from approximately the first millennium CE. The paper focusses on advances in the automation of this dictionary's data with generative Large Language Models (LLMs), with a view to share our solutions with scholars working with other low-resource historical languages. Specific doomed to fail ally, the paper addresses the effectiveness and viability of leveraging latest generation LLMs to automate three tasks that are central to our lexicographic work: semantic annotation of corpus sentences, identification of a headword's semantic prosody in different contexts, and comparison of a headword's synonyms. The paper first evaluates the relative performance of different commercially available models (including GPT 4.1, Sonnet4 and Gemini 2.5) on a semantic tagging task and then details different approaches we experimented with for enriching our corpus with word-sense and semantic prosody tags using LLMs. It concludes with a brief discussion of commercial LLMs' ability to compare Sanskrit synonyms on the basis of corpus sentences.

Presentation materials

There are no materials yet.