Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

The role of subjectivity in lexicography: Experiments towards data-driven labeling of informality

Nov 19, 2025, 3:00 PM
30m
Arnold hall

Arnold hall

Speakers

Lydia Risberg Eleri Aedmaa Maria Tuulik Margit Langemets Ene Vainik Esta Prangel Kristina Koppel Hanna Pook

Description

Language corpora have long been used in linguistics and lexicography, but recent developments now allow large language models (LLMs) to support or even transform these fields. This study investigates the potential of LLMs for annotating informal language use in Estonian – a language underrepresented in LLM training data yet supported by a large corpus. Focusing on the informal register label used in the Dictionary of Standard Estonian, we explore whether LLMs can assist lexicographers in determining the informal label. This paper describes two experiments that make use of LLMs, including GPT, Gemini, and Claude. The first experiment yielded useful insights but also highlighted necessary improvements. In the second experiment, we evaluated the LLMs’ consistency and accuracy in categorizing words as informal or neutral/formal. Results showed that LLMs achieved around 76% agreement with expert human annotators, significantly above random chance, suggesting their usefulness as a supplementary resource in lexicography. GPT-4o demonstrated high accuracy, stability, and cost-efficiency, making it a reliable candidate for such a lexicographic task. The study highlights the inherent subjectivity in register labeling and the value of combining corpus data, expert judgment, and LLM output. Overall, LLMs represent a promising tool for modern dictionary work.

Presentation materials

There are no materials yet.