eLex 2025

Europe/Ljubljana
Bled, Slovenia

Bled, Slovenia

Iztok Kosem (Faculty of Computer and Information Science, University of Ljubljana), Karolina Zgaga (University of Ljubljana)
Description

electronic lexicography in the 21st century: intelligent lexicography

    • 2:00 PM 4:00 PM
      Workshop: CLASSLA-Express 2.0: Corpora vs. LLMs
      Conveners: Dr Ivana Filipović Petrović, Dr Polona Gantar
      • 2:00 PM
        Opening 15m
      • 2:15 PM
        Large language models and generative artificial intelligence: introduction 25m
        Speaker: Slobodan Beliga
      • 2:40 PM
        Some examples of using AI tools in Slovenian lexicography 25m
        Speaker: Polona Gantar
      • 3:05 PM
        Applying ChatGPT to Croatian phraseology and related lexicographic tasks 25m
        Speaker: Ivana Filipović Petrović (Croatian Academy of Sciences and Arts)
      • 3:30 PM
        Coffee Break 30m
    • 4:00 PM 7:00 PM
      Workshop: CLASSLA-express 2.0: Hands-on exercises
      • 4:00 PM
        Corpora and AI-driven interfaces: Extracting linguistic data 1h
      • 5:00 PM
        Corpora and AI-driven interfaces: Creating definitions and providing usage examples of phraseological units for dictionaries 1h
      • 6:00 PM
        Corpora and AI-driven interfaces: Distinguishing literal and figurative uses of phraseological units 1h
    • 8:00 AM 9:00 AM
      Registration
    • 9:00 AM 9:30 AM
      Opening & Welcome Arnold hall

      Arnold hall

    • 9:30 AM 10:30 AM
      Keynote: Keynote 1 Arnold hall

      Arnold hall

      Convener: Špela Arhar Holdt
      • 9:30 AM
        Large language models for lexicography 1h

        Currently, large language models (LLMs) are redefining methodological approaches in many scientific areas, including linguistics and lexicography. LLMs are pretrained on huge text corpora by predicting the next tokens and adapted for human interaction with the instruction following datasets. This does not make them immune to hallucinations and biases, requiring a human-in-the-loop approach. In the context of lexicography, LLMs can be used to support several tasks. We will present how the information contained in language databases can be utilized to improve LLMs on lexicographic tasks. Our current methodology is based on knowledge graph extraction, continued pretraining of LLMs, prompt engineering, and semi-automatic evaluation.

        Speaker: Marko Robnik-Šikonja
    • 10:30 AM 11:00 AM
      Coffee break 30m Lobby

      Lobby

    • 11:00 AM 1:00 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Tanara Zingano Kuhn
      • 11:00 AM
        Retention of English words from interaction with dictionaries and GenAI Chatbots 30m

        The public release of ChatGPT in late 2022 made an impact on many professional domains. Notwithstanding the many controversies surrounding Generative Artificial Intelligence (GenAI), such as ethics, copyright, accountability, or ecology, we need to acknowledge an important and relevant feature of Large Language Models and chatbot systems built around them: their ability to produce mostly natural-sounding, smooth English prose. This ability makes AI Chatbots an attractive option in the learning (and teaching) of English, and thus a serious competitor to dictionaries seen as traditional learning (and teaching) aids, especially when it comes to vocabulary: the natural focus of lexicography and dictionaries. Effective use of dictionaries requires specific dictionary skills (e.g. Nesi, 1999), whereas AI Chatbots are generally believed to be straightforward and quick to use. A few recent studies have indeed found that ChatGPT may result in better student performance on English vocabulary tasks compared to traditional bilingual and monolingual dictionaries, at least for production tasks, if not always in reception (Lew et al., 2024; Ptasznik et al., 2024; Rees and Lew, 2024). These studies focused on immediate success, but we are not aware of any studies that would investigate vocabulary retention. It is quite possible that the ease and speed with which Chatbots facilitate the immediate completion of language-related tasks might not promote learning (a concern in fact we often hear from AI critics).

        In our eLex 2025 presentation, we report on two ongoing studies looking beyond immediate success and at delayed retention. Both studies tested the reception and production of infrequent and semantically opaque English phrasal verbs (20 in Study One, 19 in Study Two). Polish students majoring in English were randomized to one of three tools and completed reception and production tasks focuses on phrasal verbs. Two to three weeks later they were re-tested, but now without access to any lexical tools. Study One tested the bilingual dictionary bab.la, the monolingual Collins Online Dictionary, and ChatGPT and found modest but significant and similar learning gains with all three tools in a reception task. For delayed production, ChatGPT was the only tool to result in significant learning. Study 2 used a larger sample (223 participants) and two different chatbots as well as the bilingual dictionary diki.pl which had been found effective in an earlier study (Lew et al., 2024). In delayed reception tests, the bilingual dictionary significantly outperformed both MS Copilot and Gemini, whereas for production, no significant differences were found between any of the tools, just an effect of the year of study. Our general tentative conclusion is that completing lexically oriented tasks with the help of AI chatbots does not seriously disadvantage longer-term vocabulary retention, compared to dictionaries.

        Speakers: Mr Robert Lew, Bartosz Ptasznik
      • 11:30 AM
        Automating Adjectival Microstructures in Monolingual Dictionaries: A New Method Combining Embeddings and LLMs 30m

        Recent findings indicate that current large language models (LLMs) face difficulties in generating clear-cut, well-motivated definitions in a consistent way. This shortcoming is the consequence of their reliance on opaque data sources and their inherently unstable, non-deterministic outputs. In response, this research aims to develop an LLM-based methodology for producing adjectival microstructures in monolingual dictionaries in a way that is both more consistent and aligned with lexicographic standards. Building on the hypothesis that prompts enriched with contextual information can enhance definition quality, the study employs a graph-based, interpretable, and unsupervised method starting out from static adjectival embeddings. The approach has previously demonstrated the ability to formalize traditional lexical semantic relations, detect adjectival senses from corpus data, and identify the most salient nominal contexts for each sense. The ultimate goal is to integrate these results into practical lexicographic workflows and assess how LLMs, when properly guided, can support dictionary compilation.

        Speakers: Enikő Héja, László Simon, Veronika Lipp
      • 12:00 PM
        Automatic Non-recorded Sense Detection for Swedish through Word Sense Induction with fine-tuned Word-in-Context models 30m

        Finding non-recorded senses is important for dictionary maintenance, where using automatic methods helps reduce manual efforts. We use automatic Word Sense Induction (WSI) to compare recorded sense numbers among a sample of headwords in a comprehensive Swedish monolingual dictionary with induced sense numbers for the same words in a Swedish corpus. We propose this as a simple technique to find words to prioritize for post-hoc manual checks, which can be done in a simple Online-User-Interface bypassing the need for programming knowledge. We perform a thorough manual evaluation of the proposed methodology enabling us to show statistically that using automatic WSI increases the odds of finding non-recorded senses compared to a random selection of words. We further (i) evaluate predictions according to potential inclusion in the dictionary providing strong evidence for usefulness in practical lexicography, and (ii) analyze model predictions in-depth to point towards future improvements. We, finally, integrate lessons learned from our analysis into a large-scale prediction effort, providing the first high-quality large-scale WSI predictions for Swedish. These are a valuable resource for future research in Swedish lexicography.

        Speakers: Dominik Schlechtweg, Emma Sköldberg, Shafqat Mumtaz Virk, James White, Simon Hengchen
      • 12:30 PM
        Automatic Detection of Word Sense Shift from Corpus Data 30m

        Language evolves continuously, rendering static dictionaries quickly outdated. While previous research has addressed the automatic detection of new words, identifying subtler semantic changes in existing words remains a challenge. In this work, we propose a robust, language-independent methodology for the automatic detection of word sense shifts using diachronic corpus data. Our approach builds on the Adaptive Skip-Gram algorithm for word sense induction, enabling us to model polysemy directly from raw text without reliance on external sense inventories.

        We calculate the temporal distribution of induced senses and apply trend estimation techniques—specifically linear regression and the Theil–Sen estimator—to detect statistically significant shifts. This two-stage architecture decouples sense induction from trend analysis, increasing overall robustness and interpretability. Unlike traditional methods in lexical semantic change detection, which often target dramatic historical shifts, our method is designed to detect emerging or evolving senses over shorter timescales using large web corpora.

        We evaluate our method on Timestamped corpora in English and Czech and present several examples of detected sense shifts. The results demonstrate the feasibility of scalable, automatic sense shift detection and its potential applications in lexicography and linguistic research.

        Speaker: Ondřej Herman
    • 11:00 AM 1:00 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Mojca Stritar Kučuk
      • 11:00 AM
        Compiling bilingual dictionaries: AI-Assisted translation of Italian Multiword Expressions into English and French 30m

        The present research explores the use of large language models (LLMs) in digital lexicography, specifically for translating Italian multiword expressions (MWEs) into English and French.

        The study aims to assess the capability of contemporary LLMs in providing accurate and reliable translation equivalents, examples and definitions of Italian MWEs into English and French, while also evaluating the need for expert validation in refining AI-generated lexicographic resources. We seek to develop a digital resource tailored for language learners, offering frequently attested translations.

        Methodologically, 120 expressions were evaluated by human experts and compared across two LLMs (Gemini 2.0 Flash and Mistral-Large-2411) using different metrics aimed at assessing including correctness, accuracy and contextual suitability, along with the capacity to produce meaning explanations and usage examples. Results show that English translations received higher expert ratings than French ones, with high correlation between human and AI evaluations in the case of English, and significantly lower agreement in the case of French translations. The findings indicate that LLMs provide generally reliable translations, though expert oversight remains crucial.

        Speakers: Annalisa Greco, Matteo Delsanto, Andrea Di Fabio, Lorenzo Mori, Cristina Onesti, Daniele Paolo Radicioni, Calogero Jerik Scozzaro
      • 11:30 AM
        Neology in Practice: Lexicographic and Terminological Approaches to Lexical Innovation 30m

        The COST Action ‘European Network on Lexical Innovation’ (ENEOLI) has conducted a comprehensive survey in October-November 2024 regarding the methods, practices, tools, and resources used in the study and documentation of lexical innovations, including neologisms and novel senses. The 249 respondents from 50 countries represented linguists, lexicographers, terminologists, translators, software developers, and educators. Respondents could indicate more than one field of expertise, and 169 noted theirs as linguistics (70%), 107 lexicography (44%) and 105 terminology (43%). In this paper, we focus on the responses of those indicating their field of expertise as lexicography and/or terminology, and we analyzed their approaches to the identification and documentation of neologisms, the composition of project teams and the use of corpora and digital tools. Special attention is given to training pathways and professional needs, offering insights into the evolving skills required in the field of lexical innovation.

        Speakers: Jelena Kallas, Kristina Koppel, Kris Heylen, Ilan Kernerman, Ana Ostroški Anić, Federica Vezzani, Špela Arhar Holdt
      • 12:30 PM
        Enhancing Lexicographic Access for Deaf and Hard-of-Hearing Learners: A Digital Greek Sign Language Dictionary with AI-Powered Language Support 30m

        The purpose of the presentation is to explore the design and development of an innovative online pedagogical dictionary of Greek Sign Language, specifically tailored to the linguistic and educational needs of Deaf and Hard-of-Hearing (DHH) learners in Greece. Emphasizing accessibility and pedagogical usability, the dictionary integrates Artificial Intelligence (AI) technologies to support multimodal interaction and facilitate bilingual proficiency in both Greek and Greek Sign Language (GSL).

        Implemented as a web-based platform, the dictionary ensures broad accessibility for secondary and tertiary-level students through an intuitive, learner-centered interface. Key features include:
        • An interactive chatbot, enabling users to ask questions via either spoken/written Greek or sign language, receiving responses in both modalities.
        • AI-assisted exercise generation, which adapts vocabulary and grammar tasks based on individual learner profiles and performance metrics.
        • Neural-network-based text-to-sign translation modules, allowing for real-time rendering of written Greek input into Greek Sign Language.

        The presentation is structured in three main parts: it begins with a discussion on the significance of inclusive lexicography and the imperative to develop language resources that address the accessibility needs of diverse user groups. It then outlines the lexicographic protocol adopted for compiling the dictionary, followed by an in-depth description of the platform’s functionalities. The final section analyzes the integration of AI technologies and their role in enhancing both linguistic accessibility and pedagogical personalization.

        The contribution of the paper is twofold: first, it provides a concrete model of inclusive digital lexicography for sign languages; second, it highlights how AI can be leveraged not merely as a technical enhancement, but as a transformative tool in promoting equitable access to language resources for underrepresented communities.

        Speakers: Isidora Despotidou, Zoe Gavriilidou
    • 11:00 AM 1:00 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Miloš Jakubíček
      • 11:00 AM
        Parsing of Explanatory dictionary 30m

        The paper outlines technological and methodological ways to arrange the dictionary parsing process. The Spanish Dictionary (Diccionario de la lengua Española 23 ed. – DLE 23) website (https://dle.rae.es/) serves as a basis for the research. First of all, asthe most complex multi-parameter lexicographic frameworks, explanatory dictionaries of national languages are of the most interest because they offer the most comprehensive lexicographic description of a language, are produced by top experts (linguists and IT engineers), and offer numerous opportunities to fully utilize contemporary digital technologies.

        Ultimately, our goal is to create a digital version of the Dictionary of Spanish that can be easily adjusted to the user's evolving demands using a built-in research toolbox. Toachieve it we started the project named as Virtual Lexicographic Laboratory of the Dictionary of Spanish (VLL DLE 23) is the title of the project.

        The first step was to build up a formal model that would serve as a basis to elaborate parsing algorithm, XML schema, database schema and interfaces. The formal model of DLE 23 was built based on analyzing the structure of dictionary entries of the online version and the printed variant of DLE 23.

        The second step is to create a lexicographic database. Since the dictionary entries have a strictly defined structure, it makes sense to represent them as classes in object-oriented programming languages with subsequent processing, editing and storage in explicit form. NoSQL databases (document-oriented databases) provide such apossibility. LiteDB database (http://www.litedb.org/) was chosen for our project.

        The final stage of the trial version was creating a web application to work with the VLL DLE database The application was created on the basis of .Net Core 2.1 technology. A set of HTML, CSS templates and JavaScript Bootstrap scripts was used for convenience and modification of interface elements.

        The DLE 23 VLL project is realized in two stages: 1) creation of a VLL pilot version to test specific technological solutions and clarify the structure of the dictionary entry; 2) development of a final application with a full-scale interface. Currently, the first stage has been completed. The pilot version demonstrates more possibilities for the user than the original online version of DLE 23. Streaming version of DLE 23 is available at https://svc2.ulif.org.ua/Dics/ResIntSpanish (captcha is used).

        Further parameterization of dictionary entries was done in order to construct the pilot version of the VLL. A collection of parameters is associated with each headword: 1) headword variations; 2) headword structure; 3) headword type; 4) homonymy; 5) number of meanings; 6) number of word combinations, and some others. Each parameter was identified using the dictionary entry's HTML text as a baseline. To create a selection, the user can enter any combination of these parameters. Articles are shown in a manner akin to the original edition, and the HTML-formatted text is also displayed. Statistics are produced for every selection. Full-text search is an additional option that can be combined with parametric search. You can specify any line of HTML text as a search string.

        Speakers: Iryna Ostapova, Yevhen Kupriianov, Mykyta Yablochkov
      • 11:30 AM
        Vision-Enabled Language Models in Lexicographical Digitisation: A Case Study of Anton Thor Helle's 1732 Dictionary 30m

        Traditionally, historical texts’ optical character recognition (OCR) has primarily been conducted using specialised software such as Transkribus, eScriptorium, Kraken, and similar tools. To achieve accurate character recognition, these systems require extensive pre-training and the creation of a refined "ground truth" dataset. The comprehensiveness of model pre-training directly correlates with the precision of results. Large language models (LLMs) promise a potential breakthrough in this domain, offering high-quality output without pre-training through their "zero-shot" capabilities.

        Within the framework of a dedicated research programme, "Application of Large Language Models in Lexicography: New Opportunities and Challenges", we have conducted experiments employing untrained language models for the optical character recognition and data structuring of the dictionary section of Anton Thor Helle's 1732 grammar. The recent introduction of vision-capable language models proved decisive, enabling significantly more efficient processing of scanned documents than previously possible.

        Preliminary tests demonstrated that Anthropic's Claude 3.5 Sonnet model could generate a structured table from a scanned dictionary file containing Gothic script (Fraktur) based on a simple prompt, recognising the text and appropriately categorising headword entries into relevant columns. Our comparative analysis of various generative language models (Anthropic's Claude, OpenAI's GPT models, Google's Gemini 2.0, and Mistral) revealed that Claude significantly outperforms other models in processing 17th and 18th-century Estonian texts printed in Gothic typeface. Following our preliminary experiments, Anthropic released Claude Sonnet version 3.7, with which we conducted a more comprehensive test to digitise Helle's entire dictionary.

        Our presentation examines how effectively the language model transforms a scanned dictionary into a structured, editable document. We assess the accuracy of character recognition for Estonian headwords, German equivalents, and expressions at both character and word levels (CER and WER, respectively) and the precision of data structuring. Additionally, we explore the most common errors made by the model, factors influencing recognition accuracy, and challenges in adherence to provided prompt instructions.

        Claude achieved the highest recognition accuracy with German translation equivalents, as it possesses substantially more training data for German than for Estonian. With both Estonian headwords and German equivalents, Claude frequently modernised word forms. In some instances, the LLM produced "hallucinations" that appeared plausible but bore no relation to the original text. In essence, the LLM tidied the image according to its own understanding — a tendency also observed in experiments with Stahl, Gutslaff, and Göseken (Author 1, Author 2, Author 3, 2025).

        The primary advantage of our approach over conventional OCR methods lies in the significant time savings, considering both character recognition and automatic post-structuring capabilities. Whilst the classical method requires extensive ground truth creation and sometimes manual text segmentation, the language model-based approach delivers excellent results with substantially less preparation. Even paid language models such as Claude 3.7 Sonnet prove highly cost-effective.

        LLM-based character recognition (and, when necessary, automatic post-structuring) can be applied to digitising other historical texts where prevalent methods would be impractical due to time constraints. This opens new prospects for digitising historical textual heritage and creates prerequisites for more extensive research of old textual sources.

        Speakers: Madis Jürviste, Tiina Paet
      • 12:00 PM
        GramatiKat: A Corpus-Based Tool for Detecting Morphological Anomalies and Paradigm Variation 30m

        GramatiKat is a freely accessible online application designed to support lexicographic and grammatical work on morphologically rich languages. It provides grammatical profiles, a frequency distribution of lemmas inflected forms, for thousands of Czech nouns, adjectives, and verbs based on large annotated corpora. The concept of grammatical profiling is rooted in the work of Janda and Lyashevskaya (2011), who demonstrated that the distribution of inflected forms can reflect both grammatical structure and semantic properties of lexemes. In GramatiKat, these profiles are compared against a statistically computed Reference Grammatical Profile (RGP), which captures the expected distribution of forms for a given part of speech (Kováříková & Nikolaev, in preparation). This allows users to immediately see whether a given word follows the expected distributional pattern or deviates from it in meaningful ways. Such deviations can signal lexicographically relevant features such as semantic anomalies or collocational behaviour (e.g. participation in multi-word terms, idioms, or other multi-word units).

        The information in GramatiKat is derived from two representative corpora of contemporary written Czech, SYN2015 and SYN2020 (each containing 100 million words). Deviations from the norm, i.e. forms that are unusually frequent, infrequent, or entirely missing, are automatically highlighted using standard boxplot methodology (Kováříková & Kovářík 2023). Such anomalies can point to a wide range of lexicographically relevant information, including semantic constraints, syntactic preference, or idiomatic usage, all of which are valuable both for dictionary authors and for their audiences, particularly language learners.

        The value of the tool for lexicographers is twofold. First, it offers empirical support for deciding whether certain grammatical forms should be included, exemplified, or specially marked in a dictionary entry. For instance, the noun brva ‘eyelash’ appears almost exclusively in the instrumental singular, as part of the idiom nepohnout ani brvou (‘not to bat an eyelash’), which suggests that it is effectively defective in other forms (Kováříková et al. 2024), which is an information that should be included in the dictionary. Second, even when no overt anomaly is present, the grammatical profile provides a reliable picture of how a word behaves in real usage, for example showing the grammatical roles (nominative for subject, accusative for object). This supports more nuanced dictionary descriptions in line with corpus-driven approaches that aim to derive linguistic generalizations directly from data (Tognini-Bonelli 2001).

        From a technical perspective, GramatiKat lowers the barrier to corpus-based grammatical analysis by offering fully preprocessed, transparent, and reproducible data visualizations. The interface supports interactive exploration, filtering, and data export, making it accessible even to those without programming skills. The tool has already been successfully adapted to Slovak and Croatian, demonstrating that, given sufficient high-quality corpus data, the approach is transferable to other morphologically rich languages. Its development is grounded in principles of Open Science and reproducible research (Chromý & Cvrček 2021).

        By combining grammatical profiling with robust statistical interpretation, GramatiKat equips lexicographers with a precise and efficient method for exploring morphological behavior across the lexicon. The presentation will illustrate the tool’s functionality through real-world examples, showing both regular and anomalous grammatical profiles, and discussing how these can inform dictionary writing, editing, and revision.

        Speaker: Dominika Kovarikova
      • 12:30 PM
        Modeling and structuring of a bilingual French-Chinese phraseological dictionary: neural automatic approach for ontology and lexicography 30m

        ONLINE PRESENTATION

        The creation of ontologies—traditionally the domain of linguists and knowledge engineers—is undergoing a significant transformation thanks to advances in artificial intelligence and natural language processing (NLP). These developments open new avenues for phraseology, a field where multi-word expressions (MWEs)—often opaque and non-compositional—must be identified, classified, and linked to abstract concepts or discourse contexts (Constant 2012: 6). Despite their linguistic richness, idiomatic expressions remain a major challenge for NLP due to their syntactic variability, semantic ambiguity, and context-dependence (Gross 1996; Mejri 1997; Polguère 2002; Chen 2021).

        This study presents an approach for modeling a bilingual French–Chinese phraseological dictionary by combining lexicographic theory, ontology design, and neural NLP techniques. We focus specifically on idiomatic expressions related to the human body and animals, domains in which words such as main (hand) can carry both literal and figurative meanings—e.g., as symbols of work, strength, or authority (Rey & Chantreau 2003; Rey 2019).

        To overcome the limitations of manual ontology construction tools like Protégé (Kapoor & Sharma, 2010), we follow the principles of the Ontology Layer Cake (Despres & Szulman 2008; Tiwari & Jain 2014) and implement a semi-automatic pipeline. Our methodology includes: (1) statistical extraction of idioms using TF-IDF, PMI, and RAKE; (2) syntactic filtering of candidate MWEs; (3) visualization and annotation through an interactive Streamlit interface; (4) semantic relation modeling using fine-tuned neural models (BilBERT and Sentence-BERT); and (5) export in OWL/RDF format using the OntoLex-Lemon standard, with SKOS for conceptual hierarchies and VarTrans for bilingual alignments.

        A central challenge lies in extracting semantic triplets of the form (idiom, keyword, relation)—e.g., donner un coup de main → (main, aide)—which requires addressing the idioms’ non-compositionality, structural variation, and semantic opacity. We rely on syntactic grammars (Tesnière 1959), semantic mapping, and machine learning to formalize these triplets into interpretable ontological structures (Chen & Gasparini 2025).

        The resulting resource is a multilingual, interoperable, and dynamic dictionary of idiomatic expressions, accessible via an interface that supports exploration, sorting, and export to Protégé or SPARQL-compatible systems. This work bridges NLP and lexicography, contributing to AI-enhanced auto-lexicography, semantic modeling, and the generation of context-aware bilingual examples (González-Rey 2002; Mel’čuk 2008, 2011; Mejri 2011; Sułkowska 2016; Chen 2023).

        Our project aims to achieve six interconnected objectives. First, we design a semi-automatic pipeline for extracting and identifying idiomatic expressions from authentic French corpora, with a particular focus on thematic categories such as the human body and animals. Second, we construct semantic triplets that link idioms to keywords and conceptual categories, enabling fine-grained semantic interpretation. Third, we fine-tune a multilingual BERT-based model (BilBERT) to classify the semantic relations between idioms and their components. Fourth, we formally model the extracted data as an ontology using the OntoLex-Lemon framework, enriched with SKOS hierarchies and VarTrans modules to support bilingual alignment with Chinese equivalents. Fifth, we develop an interactive Streamlit interface that allows users to visualize idiomatic relationships, perform manual annotations, and export the data in RDF/OWL format. Finally, our project contributes to ongoing research in multilingual phraseology and AI-assisted lexicography, offering practical tools and resources for Semantic Web applications and advanced NLP tasks.

        Here are several illustrations of the results obtained throughout the project, including visualizations of idiomatic triplets, conceptual mappings, and semantic graphs generated during the modeling and classification phases.

        Speaker: Lian Chen
    • 1:00 PM 2:30 PM
      Lunch 1h 30m
    • 2:30 PM 4:00 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Jaka Čibej
      • 2:30 PM
        The Dictionary of Contemporary Serbian Language (RSSJ): Advanced Automation and Other Challenges 30m

        This paper introduces the Dictionary of Contemporary Serbian Language (RSSJ), an ongoing large-scale digital lexicographic project designed to serve both human users via web and mobile applications and machines through APIs. Coordinated by the diaspora association “Gathered around the Language” and the Society for Language Resources and Technologies (JeRTeh), RSSJ aims to produce a dictionary of approximately 50,000 frequently used words, reflecting vocabulary used over the past fifty years across diverse functional styles. The headword list is automatically extracted from corpora (SrpKor2013, SrpKor2021), then manually curated and enriched with data from the LeXimirka database. The project implements advanced automation at multiple stages, employing language models and static embeddings (Word2Vec, FastText, Dict2Vec) to identify synonyms, while large language models assisted in generating draft definitions. Additional methods include automated extraction of collocations, syntactic patterns, and exemplary usage via GDEX algorithms, all managed within a DMLex-inspired PostgreSQL data model. The custom web interface enables seamless integration of dictionary editing and corpus querying. Preliminary results demonstrate that automated drafting accelerates to some extent dictionary development, requiring at the same time lexicographers to adopt more dynamic, data-driven workflows and redefine traditional lexicographic practices.

        Speakers: Ranka Stanković, Rada Stijović, Mihailo Škorić, Cvetana Krstev
      • 3:00 PM
        Lexical-Semantic Resources as a Culture-Aware Basis for Benchmarking and Evaluation of LLMs 30m

        Large Language Models (LLMs) tend to expose severe language and cultural biases when working in medium- and low-resourced languages. In this paper, we present our work on Danish benchmarking and evaluation of LLMs to more precisely diagnose and potentially remedy such bias. To this aim, we apply available lexical-semantic resources to compile a set of Natural Language Understanding (NLU) tasks in Danish that reflect the breadth and nuances of the Danish vocabulary, thereby capturing also implicit traits of Danish values and culture. Currently the benchmark comprises nine NLU tasks, including tasks such as disambiguating words in context, determining semantic outliers, inferencing and interpretation tasks based on semantic relations, as well as selecting the correct explanation of culture-related metaphorical idioms. The large-scale benchmark (currently approx. 8,000 data instances) is supplemented by a selection of a much smaller dataset prepared for human evaluation of LLM-generated explanations, thereby enabling a more careful study of the language generation and interpretation abilities of the models from a lexical-semantic perspective.

        Speakers: Nathalie Norman, Sanni Nimb, Sussi Olsen, Nina Schneidermann, Bolette S. Pedersen
      • 3:30 PM
        Do dictionary users prefer definitions by lexicographers or by LLM-s? 30m

        The use of corpora is well established in lexicography, also in Estonia, but since the analysis of corpus data and the post-editing of automatically generated data from the corpus is labour-intensive, the use of large language models (LLMs) has led to growing interest in lexicography (e.g., Evert et al. 2024; Kosem, Gantar et al. 2024; Tiberius et al. 2024). In 2024, the Institute of the Estonian Language launched a project in which we explore how LLMs can assist in compiling dictionary entries (e.g., definitions, register labels, examples).

        In the first year, we tested whether LLMs can help lexicographers in the task of explaining word meanings in Estonian, a language with around 1 million speakers and underrepresented in LLMs. The results showed that lexicographers rated 85% of the GPT-4o (highest rated LLM in the study) generated meaning descriptions as useful or somewhat useful for their work. While our first study focused on lexicographers’ preferences and requirements for LLM-generated definitions, in the current study we concentrate on users’ preferences and requirements for both, LLM-generated and lexicographer-compiled definitions.

        According to a survey conducted in 2023 (Langemets et al. 2024: 750-751), the Estonian Language Institute's language portal Sõnaveeb (Koppel et al. 2019) is searched most for information on meanings. This coincides with the results of a pan-European study (Kosem et al., 2019), according to which meanings in general are the most searched units in dictionaries. However, both studies were carried out before the wider use of LLMs. No research has been carried out on the Estonian language to investigate whether and how preferences for obtaining information about meanings have changed with the increasing use of LLMs. In the presentation, we will introduce the results of a survey carried out among the users of Sõnaveeb, where LLM generated definitions were presented side-by-side to lexicographer compiled definitions, and users had to mark their preference and list the reasons for it. The evaluation is conducted blindly, with users not being informed which explanation is human-made. The lexicographic meaning descriptions used in the survey are the definitions from the the EKI Combined Dictionary (Tavast et al. 2020), which is the backbone of Sõnaveeb and presents a monolingual detailed description of meaning that defines the content of the concept as exhaustively as possible. Words from different parts of speech and with varying degrees of polysemy were included in the study.

        We tested the following LLMs: GPT-4o, o1mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 2.0 ja Euro LLM. Based on expert evaluations, the best-performing model was selected for the final user test. In the presentation, we introduce the tested prompts and examine how users’ dictionary and LLM usage habits relate to their preferences. But mainly, how do users rate the LLM-generated definitions, and do they prefer them to the ones lexicographers compiled? What do lexicographers still do better than LLMs, and what, intriguingly, do users believe LLMs do better than lexicographers?

        Speakers: Maria Tuulik, Ene Vainik, Margit Langemets, Eleri Aedmaa, Lydia Risberg, Esta Prangel, Kristina Koppel, Sirli Zupping
    • 2:30 PM 4:00 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Valeria Caruso
      • 2:30 PM
        Documenting the Final Days of Monolingual English Learners’ Dictionaries Using the Archived Web 30m

        Online dictionaries have many advantages over their physical counterparts. However, the ephemeral nature of web content means that they are often changed without notice and no ostensible record of what came before remains. This makes research on historical online dictionaries difficult and perhaps explains why, while the history of printed monolingual English learners’ dictionaries (MELDs) has been comprehensively explored, studies of online dictionaries have tended to take a cross-sectional rather than longitudinal view. This is not ideal since it means that a large period of MELD history is yet to be explored. Moreover, given recent predictions of the decline of MELDs, as we know them, in light of developments with AI chatbots and other digital tools, this gap is all the more significant. In an attempt to remedy this situation, this study applies Brügger’s (2018) framework for archived web research to explore the feasibility of using the web archive, the Wayback Machine, to trace the development of websites that give, or have given, access to ‘the big five’ MELDs. Some key challenges of using archived web material to conduct lexicographic research are discussed along with suggestions for potential solutions.

        Speaker: Geraint Paul Rees
      • 3:00 PM
        Mapping Slovene Learner Vocabulary to CEFR Scales with AI-assisted Methods 30m

        This paper examines how a learner corpus can support lexicographic work by classifying learner vocabulary according to the CEFR scale. Using a corpus-driven methodology, I explore the potential of AI to complement traditional analysis. The study focuses on a selection of texts from the Slovene learner corpus KOST, balanced according to the pragmatically assigned levels of learners’ language proficiency: non-Slavic beginners, South Slavic beginners, other Slavic beginners, intermediate and advanced learners. Lemma lists were generated using Sketch Engine and compared with the core vocabulary for Slovene as L2 (up to level B1) and other reference sources. Two advanced language models (ChatGPT and Copilot) were then used to automatically assign CEFR levels to the lemmas. The study compares traditional corpus-derived classifications with AI-generated classifications, evaluates their accuracy and bias, and aims to assess the feasibility of using LLMs in corpus-based CEFR annotation and vocabulary profiling in a lesser-resourced language such as Slovene.

        Speaker: Mojca Stritar Kučuk
      • 3:30 PM
        AI- and Corpus-Based Strategies for Identifying Phraseme Constructions: A Pilot Study on Croatian Repetitive Constructions 30m

        The paper introduces a hybrid methodology for cross-linguistic identification of phraseme constructions, developed within the scope of a pilot study on Croatian repetitive constructions. The study explores how artificial intelligence and corpus technologies can be systematically combined to uncover functionally equivalent patterns across languages. The proposed strategy rests on three interdependent layers: (1) the AI layer, which harnesses large language models to generate candidate constructions, paraphrases, and corpus query formulations; (2) the corpus layer, which provides empirical validation through frequency data, authentic usage, and syntactic patterns; (3) and the human expert layer, which supervises prompt engineering, interprets outputs, and ensures linguistic adequacy. These layers operate in an iterative workflow, enabling dynamic interaction between computational and expert insights. The methodology is exemplified through the analysis of the German construction X über X ‘X after X’, for which the Croatian equivalent X za X-om (e.g., dan za danom ‘day after day’) is identified as structurally and semantically appropriate. The study compares outputs of two LLMs (GPT-4o and o3), revealing performance differences in idiomatic sensitivity. It also demonstrates how LLMs can assist in filtering corpus concordances to identify phraseologically valid examples. The study highlights both the strengths (e.g., scalability, reduced expert workload) and limitations (e.g., LLMs’ sensitivity to prompt design and formal syntax) of the approach. It concludes that this layered strategy offers a viable path toward the semi-automatic processing of additional constructions and the development of multilingual phraseological resources.

        Speakers: Slobodan Beliga, Ivana Filipović Petrović
    • 2:30 PM 4:00 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Margit Langemets
      • 2:30 PM
        Exploring the constructicographic potential of lexicographic data and language models: The case of the Estonian Nominal Quantifier Construction 30m

        Constructicography, or the description of grammatical constructions in a lexicographic format, is an emerging field currently in the stage of developing and automating methods for treating large numbers of (semi-)schematic constructions. This study explores how existing lexicographic data and language models can be used to facilitate the constructicographic workflow. Our results suggest that (1) collocations and semantic relations represented in a lexicographic database can be used to identify the collexemes of constructions, that is, the lexemes occurring in the open slot(s) of schematic constructions, (2) BERT-based language models can be trained to identify instances of constructions in corpora, using collocations as the starting point to create appropriate training data, and (3) commercial large language models can be prompted to identify constructional instances, using a small number of examples. The identification of the collexemes and corpus instances of constructions provide several pieces of information that can be represented in constructicon entries: the meaning, form, frequency and productivity of constructions, the frequency and association strength of particular collexemes, the CEFR-level of the construction, etc.

        Speakers: Heete Sahkai, Geda Paulsen, Ene Vainik, Jelena Kallas, Ahto Kiil, Katrin Tsepelina, Kertu Saul, Arvi Tavast
      • 3:00 PM
        The Mangalam Dictionary of Buddhist Sanskrit: automating lexicographic data with generative LLMs 30m

        This paper reports on recent advancements in the development of the Mangalam Dictionary of Buddhist Sanskrit, the first corpus-driven dictionary dedicated to Buddhist Sanskrit. This is a low-resource, historical, and domain-specific language variety instantiated in South Asian Buddhist literature dating from approximately the first millennium CE. The paper focusses on advances in the automation of this dictionary's data with generative Large Language Models (LLMs), with a view to share our solutions with scholars working with other low-resource historical languages. Specific doomed to fail ally, the paper addresses the effectiveness and viability of leveraging latest generation LLMs to automate three tasks that are central to our lexicographic work: semantic annotation of corpus sentences, identification of a headword's semantic prosody in different contexts, and comparison of a headword's synonyms. The paper first evaluates the relative performance of different commercially available models (including GPT 4.1, Sonnet4 and Gemini 2.5) on a semantic tagging task and then details different approaches we experimented with for enriching our corpus with word-sense and semantic prosody tags using LLMs. It concludes with a brief discussion of commercial LLMs' ability to compare Sanskrit synonyms on the basis of corpus sentences.

        Speaker: Ligeia Lugli
      • 3:30 PM
        Why a dedicated dictionary device is more appropriate than an app for primary school learners 30m

        South Africa is in a literacy crisis, with learners not progressing in school because they are being taught in a second language when they are not functionally literate in their first language. Fewer than 10% of South Africans have English as a home language, but 90% of learners are being taught in English. Many South African schools are under resourced and are not able to give learners the support they need. An e-dictionary has been designed to combat literacy amongst primary school learners. This dictionary contains audio for the pronunciation of the headword, meaning, and examples; hyperlinks connect semantically related entries; full colour illustrations illustrate every sense of every word; and home language translation equivalents of the headword are presented at each sense. These are some of the features that provide extra support for learners learning in their second language. In terms of the medium on which to supply an e-dictionary to learners, there are three options: an online dictionary accessible to anyone with a device and internet access; an app that is accessible to anyone with a smart phone or tablet; and a dedicated dictionary device that does not require electricity or access to the internet. Many people suggest that since almost all adults are in possession of a smart phone, an app would be the most obvious solution. This paper shows that for South African primary school learners living under the circumstances described above, a dedicated dictionary device is the better option. This conclusion is based on research that has been done in under resourced primary schools in three provinces in South Africa. This research comprised of classroom observations of Grade 5 and 6 learners using a model dictionary on a stand-in device; focus group discussions with learners who had been using these devices; interviews with class and language teachers; and interviews with South African literacy experts. The reasons given for the preference for a device over an app include firstly, that it minimises distractions typically associated with smart phones and tablets, such as a camera and other apps. The device would need to be cost-effective, addressing the financial constraints faced by most South African schools, and it would need to be more robust than smart phones and tablets, to ensure durability in diverse and often challenging environments. These reasons were echoed by learners, teachers, and literacy experts. The paper will present the results of the research and show why a dedicated dictionary device is more suitable than an app for primary school learners.

        Speaker: Lorna Morris
    • 4:00 PM 4:30 PM
      Coffee break 30m
    • 4:30 PM 5:00 PM
      Group discussions: A - part 1 Arnold hall

      Arnold hall

    • 4:30 PM 5:00 PM
      Group discussions: B - part 1 Sonce hall

      Sonce hall

    • 4:30 PM 5:00 PM
      Group discussions: C - part 1 Zrak hall

      Zrak hall

    • 5:00 PM 5:30 PM
      Group discussions: A - part 2 Arnold hall

      Arnold hall

    • 5:00 PM 5:30 PM
      Group discussions: B - part 2 Sonce hall

      Sonce hall

    • 5:00 PM 5:30 PM
      Group discussions: C - part 2 Zrak hall

      Zrak hall

    • 5:30 PM 6:00 PM
      Group discussions: A - part 3 Arnold hall

      Arnold hall

    • 5:30 PM 6:00 PM
      Group discussions: B - part 3 Sonce hall

      Sonce hall

    • 5:30 PM 6:00 PM
      Group discussions: C - part 3 Zrak hall

      Zrak hall

    • 7:00 PM 8:00 PM
      Reception (sponsored by Cambridge University Press) 1h Lobby

      Lobby

    • 9:00 AM 10:00 AM
      Keynote: Keynote 2 Arnold hall

      Arnold hall

      Convener: Iztok Kosem
      • 9:00 AM
        LLMs and Lexicography at the Dutch Language Institute 1h

        The Dutch Language Institute (INT) has a long tradition compiling historic and contemporary dictionaries and other types of lexicographic databases, mainly for Dutch but also for some other languages with a relation to Dutch. Lexicographic work at the institute is computer-supported but there is still a great deal of manual work involved. Therefore, INT is exploring how new technologies (including LLMs) can be used for optimising different parts of the lexicographic work without compromising data quality and reliability. After a brief overview of various pilot studies conducted at the institute, we will take a closer look at how we can make the implementation of Hanks’ Corpus Pattern Analysis procedure (as it is used in the context of the project Woordcombinaties) more intelligent. This way, we hope to ultimately realise Patrick Hanks’ vision that “it seems likely that a large part of the work that is currently being carried out by hand will be automated in the not-too-distant future” (Hanks 2013;247).

        Speakers: Carole Tiberius, Jesse de Does
    • 10:05 AM 10:30 AM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Kristina Kocijan
      • 10:05 AM
        The lemma dilemma, Slovene version 25m

        In lexicography, one of the long-standing issues is understanding the nature of its core element of description commonly referred to as the headword (in DMLex and traditional lexicography), canonical form (in OntoLex and the Lexical Markup Framework – LMF), orthographic form (in the Text Encoding Initiative – TEI Lex0), lemma (in Wikidata), or lexical unit. With the transition from paper to digital environments, both the nature of this element and its description have evolved. At the heart of the “lemma dilemma” lies the relationship between form (particularly in logographic writing systems) and sense—the (description of a) concept intended to be meaningful to humans.

        In this paper, we describe how the headword/lemma phenomenon is addressed in the Digital Dictionary Database for Slovene (DDDS). The DDDS includes two types of lexical units: concepts and named entities. The latter are defined lexicographically in the same manner as concepts and are included in the DDDS due to the need to provide information on inflection, pronunciation, normative status, or other linguistic factors.

        Lexical units are mechanically divided into single lexeme units and multiword expressions (MWEs), based on their single-word or multi-word status in the Slovene writing system. Typologically, MWEs (excluding multiword named entities) are further divided into compounds and phrases.

        The ultimate goal of the DDDS is to compile all types of information about the Slovene lexicon in a single database with a unified data model. Like other Slavic languages, Slovene has a very rich morphology, which often presents a dilemma for lexicographers when choosing the most appropriate word form to represent a concept—i.e., the headword. The DDDS includes a vast number of word forms with morphological data, including pronunciation and stress. Currently, this number stands at 9,312,865.

        In the data model, a collection of morphologically linked word forms is defined as a LEXEME. According to this principle, a typical Slovene noun (associated with a unique LEXEME ID) includes 18 word forms, combining three grammatical numbers (singular, dual, plural) and six grammatical cases (nominative, genitive, dative, accusative, locative, instrumental).

        As of now, the DDDS contains 395,613 lexemes. When forming a LEXICAL UNIT—which adds the conceptual or semantic layer of description—one word form must be selected to represent the lexical unit. This selected form is traditionally considered the headword, canonical form, or lemma. Consequently, the same LEXEME ID can be used for multiple LEXICAL UNITS, even if different word forms serve as the "headword" for each.

        A practical example of this situation is a singular–plural noun pair where the same LEXEME ID and two different word forms are used as headwords to define two distinct concepts: "jajce" (Eng. egg, nominative singular) and "jajca" (Eng. testicles, nominative plural).

        In the paper, we will provide a more detailed explanation of these principles, supported by additional examples.

        Speakers: Polona Gantar, Cyprian Laskowski, Simon Krek
    • 10:05 AM 10:30 AM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Philipp Stöckle
      • 10:05 AM
        Lexicography and Generative Artificial Intelligence for contextualised meaning 25m

        The focus of this paper is on Generative Artificial Intelligence (GenAI), chatbots and some implications for lexicography and dictionary use. It has been well documented that chatbots originally tended to “hallucinate” if they did not have an answer to the prompt put to them. Much larger training databases have, however, been developed and chatbots have become more accurate. Multiple iterations of chatbots from a variety of companies have been released, including specialised chatbots for different environments. AI and chatbots have also been frequent topics in recent lexicographic research and have been employed in dictionary compilation and the preparation of writing assistants (cf., e.g., Li et al. (2023), De Schryver (2023), Fuertes-Olivera (2024), Lew 2024 & Li & Tarp (2025)). From a lexicographic perspective, the importance of linking between dictionaries and other information tools (cf., e.g., Bothma and Gouws 2022, Bothma and Fourie 2024, Bothma and Fourie 2025) also becomes relevant for lexicographic uses of chatbots.

        The use of GenAI as an information tool to provide information to end-users (readers) who have a specific information need when reading a text, i.e., a text reception information need, is discussed in detail. It has been shown that GenAI can provide content similar to a dictionary, but that it cannot provide contextualised answers, i.e., the reader is still dependent on their own evaluation of the GenAI-provided content to determine the meaning of the word or phrase in context. If sufficient context is provided in the prompt, the chatbot often provides only a single meaning / sense. If the chatbot misunderstood the context provided in the prompt, it could easily provide an incorrect meaning. If then queried (through a follow-up prompt) why it chose a specific meaning, it could not provide any explanation. Quite recently, however, this changed, and most chatbots now have two modes, a “search” mode and a “thinking / reasoning mode”, i.e., it is able to argue logically about its different proposed meanings in context and tends to offer a solution. This feature is discussed at the hand of a number of examples containing specific keywords that determine the correct interpretation in context, as well as examples with potentially ambiguous part-of-speech and syntactic analyses, using two different chatbots, viz. ChatGPT o3-mini and DeepSeek-V3 (DeepThink-R1). Based on the limited number of examples, it seems as if the chatbots can provide correct contextual meaning and logically motivate the choice of meaning in context, based on their critical analysis and thinking skills, typically associated with humans. Unfortunately, however, it still “hallucinates” if it has no answer, as will be shown from one non-lexicographic example, and the reader remains responsible to critically evaluate any GenAI responses – “lector caveat.” Nevertheless, in slightly more than two years, tremendous progress has been made, and one can only speculate what next developments would be.

        These developments raise the question of what the role of dictionaries and the role of lexicographers will be in future in an AI-enhanced world. In conclusion, a few suggestions will be offered about lexicographic databases, appropriate interfaces, access to additional lexicographic and non-lexicographic data, refining dictionary definitions, multifunctional dictionaries, and the reuse of lexicographic information in different applications. The traditional role of dictionaries to document the status and history of a language is still a very important function and needs to be encouraged, especially in environments with limited language resources. However, exploring new commercial ventures, incorporating latest technologies, would be essential to the future of the discipline and industry.

        Speakers: Theo J.D. Bothma, Rufus H. Gouws
    • 10:05 AM 10:30 AM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Markus Kunzmann
      • 10:05 AM
        An Electronic Ukrainian Dictionary as a Derussification Tool 25m

        Due to the policy of Russification in the 20th century, the Ukrainian language underwent an influx of Russianisms, among other forms of interference with its structure. Today, many Ukrainians require guidance regarding non-Russified usage, and a Large Electronic Dictionary of Ukrainian (VESUM, vesum.nlp.net.ua) is designed to meet this need. With a register of over 430,000 lemmas, it is the most comprehensive morphological dictionary of Ukrainian. VESUM contains over 9,300 Russianisms, listed alongside their non-Russified equivalents. The decisions on what counts as a Russified item in need of replacement are based on multiple reputable sources, including dictionaries on the r2u.org.ua dictionary portal.

        VESUM is the centerpiece of Pravopysnyk, the Ukrainian module of the LanguageTool text checker (check.nlp.net.ua, languagetool.org/uk). The role of VESUM is threefold. First, it supplies single-word Russified items and their replacements. Second, as a machine-readable dictionary, it serves as the source of data for lemmatization and morphological tagging, which are necessary for advanced text checking. Finally, VESUM can also be consulted as a stand-alone online dictionary via a web interface with flexible search options. As part of the Pravopysnyk tool, this electronic dictionary provides users with guidance on derussification when and where such advice is needed.

        Speakers: Vasyl Starko, Andriy Rysin
    • 10:30 AM 11:00 AM
      Coffee break 30m Lobby

      Lobby

    • 11:00 AM 12:00 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Kristina Kocijan
      • 11:00 AM
        Making Sense of the Past: AI-Assisted Historical Word Sense Disambiguation and the OED 30m

        This paper presents the Oxford English Dictionary’s (OED) current exploration into the application of artificial intelligence to historical Word Sense Disambiguation (WSD), a fundamental aspect of OED’s core research. Building on a longstanding tradition of technological innovation, the OED is investigating how Large Language Models (LLMs) can support the identification and retrieval of illustrative quotations that accurately reflect word sense usage through time – at present one of the most labour-intensive aspects of entry drafting.

        The quotation paragraph in OED entries provides readers with a curated timeline of usage, illustrating the emergence, evolution, and typical contexts of a word sense. Constructing these paragraphs requires editors to search historical corpora and databases for relevant material, disambiguate search results to isolate the targeted sense, then select quotations that are both representative and informative and meet OED’s selection criteria. This task is particularly complex when searching content from earlier time periods, where historical variation in spelling and inflection can further complicate retrieval. Editors currently construct complex iterative search strategies across databases such as Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO), and Google Books, often crafting extensive Boolean queries to find relevant material.

        To address these challenges, the OED is developing an AI-assisted tool that leverages LLMs to retrieve quotations in specified senses from historical corpora. Rather than relying on manually constructed search strings, the tool allows editors to query the model in natural language, with the LLM returning candidate quotations that match the targeted sense. This approach has the potential to reduce reliance on collocational heuristics, automate the handling of spelling and inflection variants, thus improving the efficiency and accuracy of quotation retrieval.

        The paper outlines the technical components of this initiative, including model selection and evaluation, data formatting strategies, prompt engineering strategies, and the quotation retrieval mechanism. Prototype applications are under development to test these components, primarily using EEBO as a foundational dataset. Initial testing reveals promising results, though challenges remain, particularly in mitigating LLM overconfidence and ensuring interpretive caution in ambiguous cases.

        In addition to supporting editorial staff, the OED is exploring how this tool can benefit subscribers to OED.com. Survey data from academic users indicates strong interest in expanded access to historical quotations, provided the tool is transparent, trustworthy, and well-cited. The paper gives a preview of how the tool might be accessed online, and discusses how the tool might grow from a “Minimum Viable Product” to something more powerful, whilst maintaining the distinction between viewing quotations that have been selected by editors and those that have been automatically retrieved by the tool. The paper concludes by reflecting on the broader potential of AI-assisted WSD in digital humanities research and lexicography, and outlines future directions for development, including expanded corpus coverage and enhanced user functionality.

        Speakers: Elinor Hawkes, Phoebe Nicholson, Will Rogers
      • 11:30 AM
        Bridging human and AI perspectives: semantic annotation of generic nouns in German 30m

        Generic nouns such as Sache and Ding pose a challenge for semantic annotation due to their referential underspecification and context-dependent meaning. Although frequently classified under categories like {artefact} or {object}, their actual referents often belong to abstract or cognitive domains, as in Der Placeboeffekt ist eines der faszinierendsten Dinge in der Welt der Medizin. Drawing on valency grammar, this study shows that these nouns activate different argument structures depending on their syntagmatic environment, reflecting semantic flexibility and combinatorial variability. Lexical databases such as GalNet or GermaNet frequently assign multiple synsets to these nouns, illustrating their ontological ambiguity. This paper examines whether large language models (LLMs) can replicate this nuanced classification. Using a gold standard corpus annotated by linguists, we implement a two-step prompting strategy —supplying LLMs with predefined semantic tags and contextual windows— to test their performance. The results underscore the limitations of current LLMs in dealing with the lexical underspecification of generic nouns, even when provided with an extended context window. These findings contribute to ongoing discussions on the automation of semantic tagging and point to meaningful ways in which AI systems can complement human expertise in natural language processing tasks.

        Speakers: Iván Arias-Arias, Elena Martín-Cancela
    • 11:00 AM 12:00 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Philipp Stöckle
      • 11:00 AM
        Exploring the power of generative artificial intelligence for automatic term extraction from small samples 30m

        This study explores the use of several chatbots based on recent generative large language models for automatic term extraction (ATE) from smaller text samples. The samples were selected from three domains: board games, ice hockey, and kitesurfing; and they cover three languages: English, French, and Portuguese. We used four prompting strategies: zero shot, one shot, few shots, and few shots with context. A single prompt with placeholders for language, domain and examples (when available) was used for all settings, and, in the case of French and Portuguese, we tested the ATE prompt in English and in the respective language. Results were calculated in terms of f-measure, and we further tested the best models with five consecutive runs to calculate a mean fmeasure and a standard deviation. No clear best system was verified for the task. Each of the domains and languages had different best systems. In terms of prompting strategy, more information did not always lead to better results, as zero-shot and one-shot attempts had the best results in several scenarios. The main contribution of the study is an overview of the ATE capacity of several chatbot systems across multiple scenarios.

        Speakers: Lena De Pourcq, Marie Grégoire, Leonardo Zilio
      • 11:30 AM
        Lexicom at 25: reflections on the changing world of lexicography and language technology 30m

        In this paper we show how the academic content and computational tools featured in Lexicom form a parallel history of the last 25 years of innovation in lexicography. Lexicom is a 5-day intensive workshop offering handson training in corpus-based dictionary creation, from collecting and annotating language data to publishing the final product. Since it was launched in 2001, by Sue Atkins, Adam Kilgarriff, and Michael Rundell, Lexicom has adapted (sometimes incrementally, sometimes substantially), to reflect ongoing developments in linguistic theory, corpus tools, and NLP. Lexicom’s curriculum integrates theoretical grounding with practical tasks such as corpus analysis, regular expressions, word sense disambiguation, and definition-writing. It provides an introduction to all of the key components of dictionary-creation and to the current state of the art in our field. The lexicographic landscape has seen transformative changes during Lexicom’s 25-year lifetime. In 2001, corpora were relatively small even for well-resourced languages and non-existent for others; querying tools were quite basic; and the end-product was almost invariably a printed book. We now use billion-word corpora and sophisticated software to produce mainly digital dictionaries. Lexicom has mirrored these shifts, most recently incorporating AI and large language models. Amid all these dramatic changes, some constants in the dictionary-making process remain, and Lexicom continues to serve as both a reflection of and a guide through this ongoing evolution.

        Speakers: Michael Rundell, Miloš Jakubíček, Vojtěch Kovář, Ondřej Matuška, Michal Cukr
    • 11:00 AM 12:00 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Markus Kunzmann
      • 11:00 AM
        Navigating linguistic diversity: modelling diatopic and bibliographic information with TEI Lex-0 30m

        The Vienna Corpus of Arabic Varieties (VICAV) is a digital research infrastructure for the documentation and analysis of the linguistic diversity of Arabic varieties^. Integrating methods from language technology and the digital humanities, VICAV provides a modular, sustainable platform for the creation, management, and publication of heterogeneous language resources within a shared data architecture (Budin et al. 2012; Moerth et al. 2015). At its core lies a commitment to openness, interoperability, and adherence to community standards, in particular the Guidelines of the Text Encoding Initiative (TEI Consortium 2025). Through a text-centered, standards-based design, VICAV enables the representation of diverse types of data—including an extensive bibliography, linguistic profiles, sample texts, and digital dictionaries—within a unified technical framework and a user-friendly web application (https://vicav.acdh.oeaw.ac.at).

        Among VICAV’s key components are dictionaries of four Arabic varieties—Baghdad, Cairo, Damascus, Tunis—next to a dictionary of Modern Standard Arabic which mainly serves as a point of reference for the others (Procházka & Moerth 2015). These compact lexical databases, containing up to 8,000 entries each, provide structured lexicographic information enriched with English translations and, in some cases, also German, French, or Spanish. All are built on a shared TEI-based model ensuring consistent encoding and comparability across varieties.

        The newest addition to the VICAV family of lexicographic resources is the SHAWI Dictionary, developed within the SHAWI Project (The Shawi-type Arabic dialects spoken in South-eastern Anatolia and the Middle Euphrates region, FWF P-33574, 2021–2027). The project investigates the varieties spoken by Bedouin communities in Turkey, Syria, Lebanon, and Iraq—which so far received little systematic attention by linguistic research. These dialects display internal variation which shows significant geographic and sociolinguistic distribution—dimensions that require fine-grained modelling beyond the capabilities of standard TEI constructs. The SHAWI Dictionary, scheduled for a beta release in late 2025, represents the first VICAV dictionary encoded entirely in TEI Lex-0, a refinement of the TEI Dictionary Module developed by the DARIAH Working Group on Lexical Resources which aims at harmonizing the representation of lexical data and facilitating interoperability across projects (Tasovac et al., 2018ff.).

        The adoption of TEI Lex-0 allows for both greater formal consistency and project-specific adaptability. The SHAWI Dictionary extends Lex-0 through the TEI mechanism of ODD chaining (Rahtz 2014), producing a VICAV-wide generic dictionary schema that forms a common backbone for future resources. The SHAWI Dictionary’s project-specific adaption of this schema introduces several innovations:

        (1) Encoding structures for diatopic and sociocultural variation: The element <usg type="geographic"> serves as a wrapper to embedded <name> elements for places and tribes alike which are further linked to entities in local reference resources established in the project WIBARAB (What is Bedouin-Type Arabic? 2021-2026; ERC 101020127-WIRARAB).

        (2) Refined bibliographic integration: While TEI Lex-0 (and TEI P5) support citation of sources at the dictionary level, this is too coarse-grained for the needs of the SHAWI dictionary. To address this, <entry> elements in the SHAWI customization may include a <listBibl> element which contains placeholders for records from the VICAV bibliography. This allows for the addition of context-specific bibliographic details (like page numbers or comments) while at the same time avoiding multiplication of bibliographic information.

        (3) Extended encoding of features specific to Arabic varieties: So far, the TEI Lex-0 specification offers no dedicated mechanism for representing morphological structures characteristic of Semitic languages. The SHAWI customization therefore introduces new attribute values for @type on <gram> to capture phenomena such as root-based derivation, morphological patterns, and verbal stem classes.

        By applying the TEI Lex-0 Schema to dialectological context, the SHAWI Dictionary demonstrates the adaptability of community standards to non-Indo-European linguistic data. It contributes both to the ongoing consolidation of digital lexicographic practices and to the sustainable documentation of previously underdescribed Arabic varieties, giving an example of how TEI-based infrastructures can bridge linguistic research, digital humanities, and language technology.

        Speakers: Veronika Engler, Karlheinz Mörth, Stephan Procházka, Michaela Rausch-Supola, Daniel Schopper
      • 11:30 AM
        Artificial intelligence in English dictionary entries compiled in Slovar krajšav 30m

        The article describes the use of artificial intelligence in compiling English dictionary entries for a dictionary of abbreviations (Slovar krajšav), published in 2025 and financed by the Slovenian Research and Innovation Agency (ARIS). Together with the Slovenian dictionary of abbreviations (Slovenski slovar krajšav) published in 2023, the mentioned dictionary adopted a pioneering approach to the compilation of dictionaries in Slovenia; namely, they are the first contemporary dictionaries of abbreviations. The dictionary of abbreviations was compiled in line with an analysis of the characteristics of English dictionary entries for abbreviations and according to the characteristics of the compilation process used for bilingual dictionaries. The dictionary of abbreviations comprises entries in over 20 languages, the most frequent being English, Italian, French etc. In the article, we focus on compiling English dictionary entries and using artificial intelligence as part of that, namely the Krajšavar algorithm. The article describes the application of artificial intelligence – the Krajšavar algorithm – in the process of compiling the dictionary of abbreviations Slovar krajšav and shows the need for a dictionary of abbreviations to be compiled in the Slovenian language. Dictionaries of abbreviations for the Slovenian language are presented in a synchronic and diachronic framework (cf. Kompara Lukančič 2018), namely, two outdated dictionaries Kratice (Župančič 1948) and Rečnik jugoslovenskih skraćenica (Zidar 1971), and two more recent online dictionary attempts Slovarček krajšav (Kompara Lukančič 2006) and Slovar krajšav (Kompara Lukančič 2011), and the most recently published Slovenian dictionary of abbreviations Slovenski slovar krajšav (Kompara Lukančič 2023). The Slovenian dictionary of abbreviations Slovenski slovar krajšav (Kompara Lukančič 2023) led to the compiling of the dictionary of abbreviations Slovar krajšav, a collection of 3,500 alphabetically ordered dictionary entries and over 4,200 expansions gathered in a single volume encompassing over 20 foreign languages. In the article, the overall compilation of the dictionary of abbreviations Slovar krajšav is presented, and examples of dictionary entries, namely for English abbreviations, are outlined and discussed. As shown by the presented examples, a dictionary entry is composed following the compilation process used in previously published dictionaries Slovarček krajšav (Kompara Lukančič 2006), Slovar krajšav (Kompara Lukančič 2011) and the Slovenian dictionary of abbreviations Slovenski slovar krajšav (Kompara Lukančič 2023), coupled with the characteristics of a range of English dictionaries of abbreviations (Kompara Lukančič 2009, 2018). The compilation process took almost two decades to complete and included the application of several algorithms, that is, for lemmatisation, language detection, and the automatic recognition of abbreviations. In the final steps of preparation, the dictionary was compiled manually and with the help of AI that permitted abbreviations on a specialised field to be included, as well as relevant abbreviations obtained from a range of texts following the text typology and under the Krajšavar algorithm. The dictionary of abbreviations Slovar krajšav together with the Slovenian dictionary of abbreviations Slovenski slovar krajšav (Kompara Lukančič 2023) therefore represents an important work in the linguistic framework of abbreviations for the Slovenian language.

        Speaker: Mojca Kompara Lukančič
    • 12:00 PM 1:00 PM
      Poster and demo session Lobby

      Lobby

      • 12:00 PM
        A Bazaar Among Cathedrals – Leveraging Wikidata as an Open Marketplace for Lexicographic Data 1h

        POSTER

        Eric Raymond’s influential essay (Raymond 1999) about the community-based software development as practiced in the Open Source movement vs. the previously dominant, closed, top-down approach mostly preferred in the commercial realm proved also instructive for the Wikiverse. Its flagship project Wikipedia with a comparable approach to knowledge production and dissemination disrupted the market of encyclopedic offerings to the extent that it became the primary source of information in that context, driving previous commercial market leaders out of business. While Wiktionary, the lexicographic equivalent of Wikipedia, did not have the same effect on its established competitors, it has drawn considerable academic interest as a lexical resource, from favorable comparisons to controlled or closed-source resources (Meyer and Gurevych 2010; 2012) over integrations with such resources (McCrae, Montiel-Ponsoda, and Cimiano 2012) to its conversion and augmentation as a comprehensive, multilingual Linked Open Data resource in its own right (Sérasset 2015). The Wikiverse picked up this research-driven development of structured, machine-readable lexical datasets by incorporating lexicographic information in Wikidata (Lindemann 2025), basing the data model in turn on Ontolex Lemon, the lexicon model for ontologies which originated in a research collaboration.

        The Digitales Wörterbuch der deutschen Sprache (DWDS) wanted to further explore this relationship between the academic realm on the one hand, with its lexicographic projects more akin to Raymond’s cathedrals, and the bazaar-like, dynamic and community-driven approach on the other, which informs the construction of Wikidata’s knowledge graph. In January 2023 the DWDS conducted a data donation of about 185,000 German lexemes to Wikidata. In line with previous studies (Kosem et al. 2021), the facts donated to Wikidata comprised lexical information most likely to be liberally licensed by projects like the DWDS (lexical category, written representations, grammatical features), while other copyrighted information (sense glosses, etymology etc.) was deliberately excluded. The poster presents the challenges of this data donation, for example impedances in mapping the different data models, organizing support in the community or overcoming technical obstacles. It also reports on the first results: Since the initial data import two years ago, the German lexeme inventory of Wikidata grew to over 200,000 entries. By now it registers over 550,000 links of those entries to external lexical resources beside the DWDS, and last but not least over 11,000 community-contributed links to concepts on the sense level, that in turn link to about 175,000 lexemes in other languages.

        Speaker: Gregor Middell
      • 12:00 PM
        Accelerating the lexicographic process with automatic methods and AI 1h

        POSTER

        Writing dictionary entries is not only time-consuming but also an expensive process due to the highly specialized knowledge and experience required of the lexicographer. To facilitate the task of compiling the Danish monolingual dictionary DDO (ordnet.dk/ddo), we aim to establish an automatic assistant based on applied language technology (e.g. n-gram analysis, word embeddings, etc.) and generative AI. DDO contains 105,000 lemmas and is continuously updated with new lemmas twice a year. In this presentation, we focus on morphological and phonetic information in the dictionary, on synonyms and finally on an experiment with automatic writing of definitions.

        The assistant, which we have named the Article Accelerator, automatically generates XML-tagged drafts of the subsections of a complete dictionary article in DDO. When the assistant gets a new word for the dictionary as input, it will automatically present suggestions for inflection, phonetic transcription, and synonyms. We assume that most new words in our case are compound nouns. In Danish, these are usually written together as a single word, and we therefore base the suggestions on a compound splitter. If the final part of the compound is already described in the dictionary, the assistant extracts the conjugation paradigms from the relevant entry or entries, and the user (i.e. the lexicographer) can then choose the appropriate one. Likewise, the assistant extracts the phonetic transcription for all subparts of a compound word that can be found in the dictionary. Lastly, synonyms are found by using both word embeddings and an LLM to get a list of synonym candidates. If a selected candidate already exists in the dictionary, the assistant can help create the necessary links and ID numbers.

        The core of the Article Accelerator, however, is the module that generates suggestions for sense definitions based on existing definitions for semantically similar or related senses in the dictionary. These are found by combining compound splitting with a word embedding model. However, it is the user (i.e. the lexicographer) who selects the final list of senses, which are then included in the input to a generative model.

        The goal is for the model to produce new definitions that reflect the style of the dictionary and require only minimal post-editing by the lexicographer. To find the optimal combination of prompt and generative model, we perform an experiment with fully edited but unpublished monosemous lemmas from DDO. We test two different prompts on three models (ChatGPT 4o, Claude 3.7 Sonnet, Llama 4 Scout) and manually compare the model's output with the definition written by a lexicographer.

        The manual evaluation is carried out by two experienced lexicographers. This gives us knowledge about the quality of the automatic definitions and gives us the best conditions for choosing the ideal prompt and model.

        Speakers: Nathalie Norman, Nicolai Hartvig Sørensen, Jonas Jensen, Kirsten Appel, Sanni Nimb
      • 12:00 PM
        CJVT Igre: New Word Games Based on the Digital Dictionary Database of Slovene 1h

        DEMO

        CJVT igre (https://igre.cjvt.si/) is a new digital platform offering word games designed to foster lexical awareness and engagement with standard Slovene. Developed by the Centre for Language Resources and Technologies at the University of Ljubljana, the portal currently hosts three games—Cvetka, Besedolov, and Vezalka—with two more in development. Each game utilizes curated lexical data from the Digital Dictionary Database of Slovene, enhanced through targeted lexicographic work to ensure playability, thematic coherence, and age-appropriateness. This includes refining word lists, rating difficulty, and enriching entries with semantic metadata. Cvetka focuses on orthographic guessing tasks with daily thematic prompts, Besedolov on semantic word search challenges within 11x11 grids, and Vezalka on word formation from a constrained letter set. Designed for both educational and general audiences, the games integrate varying levels of difficulty, optional hints, and dynamic scoring. This paper showcases the platform’s interface, gameplay mechanics, and the linguistic and technical adaptations required to transform lexicographic resources into effective digital games.

        Speakers: Špela Arhar Holdt, Iztok Kosem
      • 12:00 PM
        Comparative Analysis of Medical Adjectives in Croatian General Dictionaries 1h

        POSTER

        The representation of medical adjectives in Croatian general dictionaries reveals significant inconsistencies, reflected in uneven lemma inclusion, ambigous or absent domain labels, and limited definitional precision. This paper analyzes the 80 most frequent adjectives, based on corpus data from the Croatian Medical Corpus (CMC) (Kocijan, Kurolt & Mijić, 2020), in the three major Croatian general dictionaries: Veliki rječnik hrvatskoga standardnog jezika (2015), Hrvatski enciklopedijski rječnik (2002), and Rječnik hrvatskoga jezika (2000). The analysis focuses on lemma status, the presence of domain labels, and the accuracy of definitions.

        To contextualize the Croatian practice, the study includes a brief comparison with Merriam-Webster Dictionary (2025), which demonstrates better lemma coverage and more terminologically informed definitions, but also exhibits inconsistencies that reflect the broader challenges of systematically representing medical adjectives in general lexicography.

        The paper's findings reveal inconsistencies in Croatian lexicographic practice and highlight the need for more conceptually grounded, corpus-based approaches that integrate terminological precision with lexicographic usability.

        Speakers: Martina Pavić, Daša Farkaš
      • 12:00 PM
        Digitalization of Romanian dictionaries 1h

        POSTER

        Recently, the digitization of resources of any type has become an increasingly discussed topic. In the linguistic field, lexicography is among the most influenced by this process, with digital dictionaries playing an essential role both for online consultation by specialists and for the automatic development of useful resources in natural language processing, as well as downstream applications.

        The first dictionary automatically digitized by the “Iorgu Iordan - Alexandru Rosetti” Institute of Linguistics and made available to the public is the Etymological Dictionary of Romanian (https://delr.lingv.ro). It was parsed only shallowly, to make possible searches by the head word of the lexical entry, its variants and words from the same lexical family. It was developed rather as a proof of concept for the automatic parsing of the entries in dictionaries developed traditionally and originally meant only for printing.

        The third edition of the Orthographic, Orthoepic and Morphological Dictionary of the Romanian Language (DOOM3) was produced by the Institute, initially also in printed format. Shortly after its paper format’s launch on the market, the idea of making it accessible online to the general public and in a format that meets the current needs of users (i.e., quick access on mobile devices) led to its publication on the Internet (https://doom.lingv.ro), in a manner that allows for regular searches (by the title word), but also advanced ones (for example, by combining the various types of linguistic information represented in the dictionary: parts of speech, grammatical categories, language of origin, register, variants, etc.). The latter was made possible by the deeper parsing of its entries. Also, the entire theoretical apparatus that precedes the dictionary itself in the printed version, i.e. the Introductory Study, is also accessible online, which facilitates working with it, through the possibility of automatically searching its content for occurring words.

        The online version is a more complex tool than the printed dictionary, because it has implemented a mechanism for suggesting the correct forms in the event that the user enters, in the search bar, a wrong word or forms that are no longer recommended/accepted by the norm.

        Following the success among students, specialists, teachers and the large public of the digital edition of the Orthographic, Orthoepic and Morphological Dictionary, the Institute invested effort in the digitalization of the new edition of the Romanian Language Dictionary (DLR). A new graphical interface has recently been created. For the moment, searches can only be made by the title word and are of several types: exact search, search with/without diacritics, search with prefixes or suffixes using the special characters * and ? (for example ab or tor, for prefixes and suffixes, respectively). The dictionary article contains several dynamic elements, especially regarding quotations, which are displayed compactly. Upon request, the user can see all quotations of a meaning or hide them completely for a synthetic view of the semantic tree (see https://dlr-test.lingv.ro/cautare/abandon). It is also possible to browse through the list of all words or download the list of words when searching with prefixes or suffixes.

        In the future, we would like to add an advanced search that can be done according to criteria including: part of speech, register/usage, as well as consider other lexicographic resources to be made available online.

        The method used to transpose the printable format into the online version is the same for all three dictionaries, despite the fact they have different structures.

        Speaker: Mititelu Catalin
      • 12:00 PM
        Exploring Derivational Families through Intelligent Lexicography 1h

        POSTER

        This paper presents a novel approach to exploring derivational families within the framework of Intelligent Lexicography, using the ŠKOLARAC corpus: a collection of Croatian school essays written by L1 learners (native-speaking students) in grades 5 through 8 and enriched with metadata such as gender, grade level, and region. By combining rule-based linguistic processing in NooJ, a linguistic development environment for formalizing morphological and syntactic patterns, with tailored morphological procedures for Croatian, the study identifies and maps derivational networks of three pedagogically relevant lexical morphemes (CRT, PIS, and RAD) tracing their associated inflected and derived forms as they appear in young learner corpora. The extracted data are visualized using radial graphs, butterfly charts, and hierarchical structures, enabling a multifaceted analysis of morphological productivity and lexical variation. This integrated workflow demonstrates how intelligent tools can enhance lexicographic practice by uncovering deep morphological relationships in authentic learner language. The findings support the development of adaptive, learner-sensitive lexicographic resources with applications in linguistics, language education, and curriculum design, particularly in the context of developing digital dictionaries and vocabulary tools tailored to young learners.

        Speakers: Krešimir Šojat, Kristina Kocijan
      • 12:00 PM
        Handling abstract constructions in a dictionary-based constructicon 1h

        POSTER

        Taking seriously the common construction grammar statement that “it’s constructions all the way down” (Goldberg, 2006: 18), the Hungarian Constructicon aims to encompass the widest possible range of constructions. As it is a dictionary-based constructicon, it naturally contains what a dictionary can provide — from morphemes to words, and to partially schematic multiword constructions containing open slots. What had been missing were the more schematic abstract constructions. In this paper, we have added some important constructions of this kind to the database of the constructicon as an experiment, and have enhanced the integrated analyzer tool to handle them appropriately. Now, the system has the machinery to recognize all types of constructions in text and display them to the user. Thanks to the integration of abstract constructions, it does not present constructions in isolation; it reveals the intertwined nature of them, their connections and interactions instead. This results in a fundamentally extended functionality compared to a dictionary. A case study in Section 5 demonstrates the capabilities of the system. The list of the integrated abstract constructions is far from complete, expanding it remains future work.

        Speakers: Bálint Sass, Éva Dömötör, Balázs Indig, Mátyás Lagos Cortes, Veronika Lipp, Márton Makrai, Gergely Pethő
      • 12:00 PM
        Implementing Frames in the Phrase-based Active Dictionary: why Frames are needed but FrameNet can only be a partial solution 1h

        POSTER

        This paper explores the differences between the Phrase-based Active Dictionary (PAD) and FrameNet in their approaches to meaning representation, focusing on the verbs agree and follow. The PAD, a component of the PhraseBase project, adopts a splitting-friendly methodology that emphasizes granularity and ontological consistency, ensuring a more comprehensive coverage of polysemy. In contrast, FrameNet prioritizes broader conceptualization, often leaving finer distinctions unaddressed. Through a detailed matching process, this analysis reveals that several senses traced in the PAD are not covered or not distinguished in FrameNet, highlighting the need for an extended concept of Frame. The proposed extension of the system includes increased granularity, the incorporation of encyclopedic knowledge by using ostensive aids, and cultural sensitivity. These enhancements would improve the visual representation of Frames or enhance their representation potential, making them more accessible and informative for users of the PAD. The paper concludes by addressing open questions about the systematic implementation of these extensions and their implications for linguistic analysis and lexicographic practice. By combining theoretical insights with practical applications, the PAD aims to offer a model for deepening meaning representation for advanced language learners and translators.

        Speaker: Laura Rebosio
      • 12:00 PM
        Introducing DigiMet: a Psycholinguistic Database for Croatian Multi-Word Expressions 1h

        POSTER

        The lack of normative resources for the Croatian language has incited the development of a novel resource that would not only compile normative data for Croatian but also focus on an underrepresented group of linguistic units – figurative multi-word (MWE) expressions. Thus, the creation of a normative database for figurative MWEs in Croatian is a significant step in the right direction that will address the gap in the availability of such tools for the Croatian language.

        There are currently several normative databases available for Croatian single words such as the Croatian Psycholinguistic Database (Peti Stantić et al., 2021), psycholinguistic databases of affective norms and emotions (Ćoso et al., 2019; 2023), and the database of norms for non-adapted English words ENGRI CROWD (Bogunović et al., 2024). Given that all of the above sources contain normative data for individual words, a need arises to create a similar tool that would showcase norms for multi-word units. There is currently only one such database available; COMETA database (Citron et al. 2020) of affective and psycholinguistic norms for German conceptual metaphors is an open-access database featuring norms for emotional valence and arousal, imageability, and metaphoricity for conventional metaphors in both sentence and story contexts.

        This is why the DigiMet database has been planned for development as a tool that will systematically catalog affective and lexico-semantic norms for Croatian metaphors along six different dimensions – 1. valence, 2. arousal, 3. concreteness, 4. imageability, 5. metaphoricity i 6. familiarity. The collection of norms will be carried out on a minimum sample of 500 native Croatian speakers using online distribution platforms such as SurveyMonkey. For this purpose, a combination of contrastive corpus research and manual data checking was carried out in the initial research phase. Using the MetaNet.HR database and corpus search in SketchEngine (SkE) (hrWaC 2.2, MaCoCu, enTenTen21), metaphors detected in Croatian (J1) and English (J2) and related MWEs were selected (verb-noun collocations were chosen as representative form of MWEs due to proven productivity in different languages). Lexical-semantic data on metaphorical MWEs was also extracted.

        The DigiMet database, in its final form, will represent the first openly accessible repository of metaphor norms for the Croatian language, which also represents the first database of affective and lexical-semantic data for Croatian multi-word expressions. This resource will enable further cross-linguistic comparisons and interdisciplinary experimental research.

        Speaker: Jasmina Jelčić Čolakovac
      • 12:00 PM
        lexicographR: R infrastructure to develop and deploy digital dictionaries from scratch 1h

        DEMO

        This demo introduces lexicographR (citation withheld for anonymization), a prototype computer application aimed at facilitating the creation of digital dictionaries for scholars working in low-tech environments, where access to programming skills is severely hindered by lack of funding, institutional support and technical training. Based on recent user-surveys (Lugli 2024b), these scholars are typically domain experts or language teachers without formal training in lexicography and work on specialized dictionaries pertaining to their area of expertise. As such, they are often not aware of best practices and current methods in lexicography. Few use corpora and many have been writing their dictionaries in Word or Excel files, which makes it harder for them to automatically integrate new lexical data from corpora into their existing work. They typically struggle to deploy their lexicographic output as interactive online resources, and perceive existing free-of-charge digital dictionary development solutions, such as Lexonomy and Living Dictionaries (Daigneault and Anderson 2023; Měchura 2017), as insufficiently customisable for their highly specialized dictionaries and the specific needs of target audiences (Lugli 2024). The demo will first discuss the results of our user surveys and user-need identification process. It will then briefly discuss our development philosophy, which, given the ephemeral nature of interfaces and web-technologies, prioritizes lowering the costs and technical barrier to the creation of machine-readable and re-usable dictionary data over the development of digital interfaces. Still, to foster the dissemination of dictionary data among strata of the population who are less used to interacting with data directly, we have also provided a simple way to build flexible and lightweight interfaces to deploy dictionary data online as interactive digital dictionaries.

        The core of the demo will consist of a demonstration of lexicographR's main functionalities, each of which is designed to assistance with a specific lexicographic task:
        1. conversion of pre-existing dictionary data from Word, Excel, csv/tsv and FLEx, CoNLL-u and vrt/vert files into JSON.

        1. processing corpus data from CoNLL-u, vrt/vert, csv/tsv, FLEx and plain text and extracting corpus frequencies nd distribution information for each dictionary headword

        2. extracting collocations from the corpus for each dictionary headwords

        3. extracting from the corpus for each dictionary headwords

        4. creating data-visualizations for the information extracted from the corpus as well as for pre-existing dictionary data

        5. designing a dictionary interface and generating the files necessary to publish the pre-existing dictionary data (potentially augmented with information extracted from the corpus and data-visualization) as either a Shiny app or a Quarto book.

        6. converting the dictionary data published in the digital dictionary to JSON-LD for release in online data repositories, such as Zenodo or figshare.

        The paper will conclude with an overview of some of the dictionaries that have been created using the lexicographR app.

        Speaker: Ligeia Lugli
      • 12:00 PM
        Presenting verbal aspect data in a learner’s dictionary: Devices and usage scenarios 1h

        POSTER

        A system of lexicographic presentational devices for data on verbal aspect has been developed that is aimed at providing advanced foreign language learners of English, German or Italian with data for individual verbs and their different readings. It is part of a monolingual, production-oriented electronic dictionary, the Phrase-based Active Dictionary (DiMuccio-Failla, 2025; DiMuccio-Failla & Giacomini, 2022).

        Verbal aspect is understood here as the way in which speakers structure events and situations in language with regard to their boundaries (Sasse, 2002, p. 201). It is a conceptual category that is language-specific (Dessì Schmid, 2014), which means that providing data on verbal aspect can be beneficial for foreign language learners. Verbal aspect is expressed by the verb and its combination with linguistic devices, e. g. adverbials and tense, and it is tied to individual verb readings: Every verb reading has its characteristic set of ‘aspectual properties’ from a semantic as well as a syntactic point of view. For analysis, aspectual properties can be subsumed under more general aspectual classes (i. a. Vendler, 1957; Mourelatos, 1978; Croft, 2012).

        The suggested system of presentational devices for verbal aspect consists of: 1) a visual representation of the aspectual class and corresponding semantic properties of the verb reading, 2) combinatorial options (adverbials, verbs and tense), 3) usage notes with explanations on semantic and/or syntactic particularities and 4) aids for disambiguating similar verb readings. The devices provide a range of data for the targeted user group of advanced language learners and are placed in different parts of the dictionary’s article structure: The visual representations and combinatorial options are given alongside every verb reading. The usage notes are tied to the specific items the explanations refer to. The aids for disambiguating similar verb readings contain a link to their similar counterpart. Each type of device is associated with a symbol and the symbols are placed in the dictionary article as buttons to allow users to display the data on demand.

        To illustrate the potential information gain for the target users, the presentational devices are demonstrated and related to usage situations from function theory (Tarp, 2008): text production, (the text production stage of) translation into the foreign language and the revision of existing texts. We describe how the presentational devices cater to user needs in these situations and how they integrate with other microstructural items. The individual devices cater to different usage- and function-related user needs depending on the usage situation and user needs of a usage situation are covered by different devices. We exemplify the devices as well as different access routes within the dictionary, including aspect-class-based access via the above-mentioned visualisations.

        Speaker: Sarah Piepkorn
      • 12:00 PM
        Project of a Specialized Dictionary Website 1h

        POSTER

        The objective of the research is to develop a technology for converting specialized dictionary text into a website with a developed user interface.

        The object of the study was “Dictionary of Ukrainian biological terminology” (7,342 entries and about 26,000 terms in Ukrainian, Russian and English), that contains definitions, terms polysemy, synonymy, stresses for Slavic languages, and grammatical information.

        Since the dictionary text was available in digital publishing format (PDF), no prior digitization was required. Our approach is to step-by-step transform the linear text of a dictionary into a website. The basic steps are as follows:

        1. Dictionary text normalization: restoration of the text line that represents the dictionary entry, stress marking, font markers fixation, correction of inevitable publishing errors in the dictionary entry structure, etc. This was the most time-consuming step, and it required manual processing. The text was converted into .doc format. MS Word text processor was used for processing, the result was text in .txt format, in which HTML tags were used to mark substrings, presented in bold and italic.

        2. Designing a dictionary lexicographic system model. This model serves as a basis for building a parsing algorithm, designing a database schema and interface elements. The model was designed based on an analysis of the printed version of dictionary entries markup. Lexicographic systems model methodology allows us to identify all structural elements that can be identified automatically, and to establish connections between them. Each dictionary entry is assigned one universal structure, i.e. any dictionary entry is considered as a derivative of one “template” entry.

        3. Construction of an XML schema based on the conceptual lexicographic model.

        4. Automatic conversion of dictionary text (.txt format) into an XML document, allowing to explicate all defined structural elements and the connections between them. To automatically mark the dictionary text with XML tags, a program was developed that highlights the elements of the dictionary entry structure. We consider an XML document as a stand-alone product that effectively represents lexicographic data forfurther use for various purposes.

        5. Lexicographic database creation. NoSQL (document-oriented databases) was chosen for this. In the case of relational databases, data is stored as a set of multiple tables and links between them. Working with individual tables as a single object requires a powerful software infrastructure. Moreover, the evolutionary potential of such a digital object is limited by the opacity of the database. Since dictionary entries are the basic elements of a lexicographic system with a strictly defined structure, it is logical to represent them as classes in object-oriented programming languages with subsequent processing, editing and storage in explicit form. The main advantage of NoSQL databases for our project is their ability to store explicitly lexicographic objects without changing their internal structure, which opens direct access to each element of the lexicographic object and significantly simplifies the possibility of editing and modifying (extending) it.

        6. Converting XML file to database. This was performed automatically.

        7. Designing of interface schemes and creation of a website (currently in progress).

        Speakers: Mykyta Yablochkov, Alona Dorozhynska, Iryna Ostapova, Iuliia Verbynenko
      • 12:00 PM
        Qualitative Evaluation of LLM Translation of MWEs for Developing a Croatian Sense Repository 1h

        POSTER

        As part of the COST Action CA21167 Universality, Diversity and Idiosyncrasy in Language Technology (UniDive), the ELEXIS-WSD Parallel Sense-Annotated Corpus (Martelli et al., 2021; Čibej et al., 2025) is being expanded to include subcorpora in additional languages—among them, Croatian—as well as new annotation layers. Each language subcorpus of ELEXIS-WSD contains the same 2,024 sentences extracted from WikiMatrix (Schwenk et al., 2019).

        The corpus was initially translated from English using two machine translation platforms: Google Translate and Hrvojka (https://hrvojka.gov.hr/). The translations then underwent a two-step manual validation process to first select the more suitable translation for each sentence and correct errors, then the final versions were reviewed in terms of the accuracy of term equivalents and idiomatic expressions. The resulting set was then automatically tokenized, lemmatized, and POS-tagged, and is currently undergoing manual correction.

        The next phase involves creating an open-source sense repository for Croatian, which is being developed based on an existing pedagogical dictionary (Authors, 2025). The repository will be enriched through a combination of manual and automated methods, including the use of large language models (LLMs) to define missing senses. Since domain-specific terms and certain multiword expressions (MWEs) (Odijk, 2013) posed challenges for the tested translation platforms, a new evaluation task was conducted to assess the competence of LLMs in translating MWEs. The underlying hypothesis was that if an LLM could successfully translate MWEs from English into Croatian, it should also be capable of adequately identifying and defining their senses. Some studies have shown that LLMs perform particularly well in the semantic interpretation of MWEs (Gantar, 2024).

        Each English sentence was automatically translated in a separate prompt using an adapted pipeline for two large language models: ChatGPT-4o and the recently developed Slovene GaMS-9B-Instruct (https://huggingface.co/cjvt/GaMS-9B-Instruct). A preliminary evaluation was conducted on the first 200 sentences. As the translations generated by the GaMS-9B-Instruct model contained a significant number of Serbian lexical items (e.g., fudbal, holandski napadač, spoljni stručnjaci instead of nogomet ‘football’, nizozemski napadač ‘Dutch striker’, vanjski stručnjaci ‘outside experts’), this set of translations was excluded from further evaluation. Five linguists then compared the ChatGPT-4o translations with the manually validated automatic translations, and marked differences.

        This paper presents an analysis of the most common differences between the automatic translation of MWEs from English into Croatian by an LLM and the human validation of machine translation. ChatGPT-4o demonstrates a high level of proficiency in handling MWEs as opposed to its predecessors in this translation task. Differences between the compared translations include: a) wrong terminological equivalents (e.g., medicinski uvjeti / medicinska stanja ‘medical conditions’, Bézierove površine / Bézierove plohe ‘Bézier surfaces’); b) differences at the morphosyntactic level (Otto nagrada / nagrada Otto ‘Otto Award’; riževi nemiri / rižini nemiri ‘rice protest’); c) English-influenced literal translations, mostly in verbal MWEs (uzeti ime / dobiti ime ‘take its name’, častiti kao sveca / štovati kao sveca ‘honour as a saint’), d) the treatment of metaphorical MWEs (pod protestom / u znak protesta ‘under protest’, proces se raspada / proces se urušava ‘the process breaks down’), and e) named entities, which is a challenge in other languages, too (Krstev et al., 2024). The provisional typology will be used in developing templates for defining MWEs in the sense repository for Croatian.

        Speakers: Ana Ostroški Anić, Jaka Čibej, Ivana Filipović Petrović, Martina Pavić, Siniša Runjaić, Robert Sviben
      • 12:00 PM
        The Challenges of Syntactic Descriptions of Multiword Expressions in Electronic Lexicography 1h

        POSTER

        In this paper, we provide a comprehensive overview of the way in which the morpho-syntactic properties of multiword expressions are represented in lexical resources to support Natural Language Processing downstream applications. Starting from an up-to-date and comprehensive overview of the existing lexica dedicated to multiword expressions and containing their syntactic description, we outline the current state of play in encoding syntactic information about multiword expressions (internal structure, argument structure, word order, discontinuity, verb alternations). We also discuss the relevance of the syntactic description of multiword expressions for several Natural Language Processing tasks. Our work contributes to the literature that fosters improvements in both the development and deployment of multiword expression lexica to ensure that they can support future Natural Language Processing innovations more effectively.

        Speakers: Verginica Barbu Mititelu, Voula Giouli, Gražina Korvel, Chaya Liebeskind, Irina Lobzhanidze, Rusudan Makhachashvili, Stella Markantonatou, Alexandra Markovic, Ivelina Stoyanova
      • 12:00 PM
        The dictionary of pluricentric Portuguese project: theoretical aspects 1h

        POSTER

        The dictionary of pluricentric Portuguese project, which is at its initial stage at (University of Coimbra), aims at providing a free, online dictionary that describes Portuguese as it is used in several territories around the globe. The purpose of this poster is to present theoretical questions that need to be answered to guide the methodological decisions for the creation of this dictionary, bearing in mind our alignment with the idea of “socially responsible lexicography” (Calañas Continente & Domínguez Vázquez, 2023) and the socio-political-cultural complexities inherent to the Portuguese language area. From an official status viewpoint, Portuguese is used in nine countries and one territory. Nevertheless, the functional status of the language varies significantly in these regions, ranging from its status as the mother tongue of the majority of the population (Brazil, Sao Tome and Principe, Portugal), to its role as the predominant vehicular language, typically as a second language (Angola, Mozambique), to its status as a minority language (Cabo Verde, Guinea-Bissau, Timor-Leste), to the point of its virtual non-use (Guinea Equatorial, Macao). As to language standards, Brazil and Portugal have been traditionally considered norm-setting centres, having fully-fledged standardizing and codifying instruments such as dictionaries and grammars, with the European variety being adopted as the norm in the other countries. However, this bicentric view has been challenged by researchers who have shown that local varieties of Portuguese have been emerging in other countries. In addition, there is a growing demand in society for the recognition of these varieties as valid and legitimate as the dominant varieties, with the compilation of a Dictionary of Mozambican Portuguese currently underway (see Machungo & Firmino, 2022). This highlights the complex relationship between language, power, and identity. These complex socio-political-cultural contexts of all these multilingual territories, together with our ideological position to counter what Rizzo (2019: 287) has identified as “homogenizing tendencies in certain language policies that seek to impose a dominant reality”, make the production of a dictionary of pluricentric Portuguese a highly challenging undertaking. One of the greatest challenges is the fact that, in territories where Portuguese was introduced as a result of colonisation, the dominant exonormative view of the language leads to a significant gap between how language is used on a daily basis and the use imposed by the school and other language regulators. This has several consequences to our lexicographic project, starting with the establishment of what definition of norm is suitable to our project, which in turn will support decisions regarding the corpus to be used as a source. Considering all that in this dictionary project means that prior theoretical research must be carried out in order to inform the decision-making process regarding corpus compilation, headword candidate list, entry configuration, entry microstructure, to name but a few. In this poster, we will position ourselves in terms of theoretical references, present crucial questions for the making of the dictionary, and share tentative answers. We hope this paper will promote exchange of knowledge and experience with peer lexicographers facing similar challenges in their projects, as well as encourage reflection on the political role of lexicography (Crowley, 1999).

        Speaker: Tanara Zingano Kuhn
      • 12:00 PM
        The Hare and the Tortoise: Pipeline for Latvian Information and Communication Technologies Secondary Term Formation 1h

        POSTER

        The Information and Communication Technologies (ICT) field has evolved rapidly in recent decades. Thus, to describe new devices, activities, and concepts that appear yearly, a vast number of terms are created primarily in English, while other languages rely on secondary term formation (STF) for ICT end-users (ETSI Guide, 2022). Systematic secondary rendering and dissemination up-to-date terminology in the target language (Chiocchetti and Ralli, 2013; Stefaniak, 2023) are crucial for language development and benefit professionals, students, and the public. We analysed the STF process in Latvian for the ICT domain during the development of the Language Technology (LT) course at the University of Latvia.

        For over 30 years, the Terminology Commission of the Latvian Academy of Sciences (TCLAS, 2025) and its sub-commissions, including the Information and Communication Technologies Sub-Commission (ICTSC), have carried out term formation. ICTSC comprises of ICT professionals, terminologists, and linguists. ICT students also participate in meetings to approbate terms for the first time. The commission meets twice a month during the academic year. Terms are sourced from higher education, industry, and translating agencies, including the European Commission. They are added to the biweekly agenda, discussed, and, if accepted, recorded in an open-access Academic term database, available on the web since 2005 (ATB, 2025).

        For the LT course, terms were manually extracted from lecture slides. Given ICTSC’s capacity to produce about 20 high-quality terms during a 2-hour meeting, terms were prioritised based on their relevance in the LT course. Identified terms were reviewed and defined, supplemented with usage examples and visuals. Possible Latvian term variants were proposed, with ICTSC members conducting preliminary written discussions, and 111 terms were accepted and are available in the Academic term database (ATB, 2025).

        The STF process includes several challenges where AI tools could be applied. As the concept of the term is usually expressed most precisely in its definition, the most significant challenge is providing a clear definition for terms used in several ICT subdomains. Second comes weighing arguments for and against creating source-language oriented terms that can be easily back-translated and will be recognisable versus creating secondary terms that precisely reflect the definition but might be far from the direct translation of the original term (e.g., Bag of Words). The third challenge is the length of the term and euphonism – how easily it can be pronounced. As a rule of thumb, the longer the term, the less likely it will be used in spoken communication, and the direct calque will be used.

        The STF process was researched (Šostaka et al., 2023), and several approaches were tested to speed up “mechanical” parts of the term creation. The first approach was using an AI tool (ChatGPT 4.0) on 140 concepts and terminology units within ISO/IEC 22989:2022(en), searching and then evaluating suggestions for STF in Latvian and comparing them to the terms already approved by the Terminology Commission (Šostaka et al., 2025). Out of 140 concepts, 75 terms had an exact match, 65 had a partial match, while 5 had no match.

        The second approach was checking the time saved using a tool for term extraction from online dictionaries (Šostaka et al., 2024). The tool allows to review user-specified sources (e.g., Merriam-Webster dictionary) on the Internet, related to ICT terms; it is scalable, and it is possible to add sources of the user’s choice in other fields and languages. It allowed us to save 74 minutes when searching 40 terms, as opposed to 106 minutes needed for a manual search.

        Speakers: Dace Šostaka, Inguna Skadiņa
      • 12:00 PM
        Up to No Good: Exploiting Word Embeddings for an Automatic Extraction of Candidates for a Lexicon of Slovene Taboo Language 1h

        POSTER

        Lexicons of taboo language are useful language resources that can serve multiple purposes. In addition to their direct use to either automatically censor words deemed inappropriate for a given context (e.g. to help mitigate the problem of online hate speech), they can also help filter out materials not suitable for educational purposes (see Zingano Kuhn et al., 2022), games with a purpose (Arhar Holdt et al., 2021), training general language models (e.g. to remove pornographic content from training data). In addition, taboo language, particularly the section related to hate speech, needs to be well-documented in dictionaries as they are used as authoritative language resources (Gorjanc, 2005). Taboo language lexicons can also be useful for linguistic analyses and contrastive translation studies since swearing and taboo language are frequently culturally specific – see e.g. Klemenčič (2016) for a contrastive study of swearing in Slovene and Swedish; however, the study focused on a limited set of hand-picked expressions since no comprehensive list yet exists for Slovene, at least not in a machine-readable format.

        What is included in existing Slovene language resources is either not openly accessible, is inaccurately represented (e.g. with pejorative as the only label, even though the context can be radically different in terms of intensity or taboeness: cf. bedak 'fool' vs. peder 'faggot'), or is limited in scope (Thesaurus of Modern Slovene; Krek et al., 2023), with material stemming mostly from corpora of standard Slovene, where the usage of offensive vocabulary is limited.

        While similar lexicons have been compiled from existing language resources (e.g. van Huyssteen & Tiberius, 2023), we present an approach for constructing a list of Slovene taboo language candidates using the FastText embeddings trained on a number of Slovene corpora (including web-crawls). We first extract seed entries from the Thesaurus of Modern Slovene 2.0 (Krek et al., 2023), which is part of the Digital Dictionary Database of Slovene (DDDS; Kosem et al., 2021). in which at least one of the senses has been assigned a relevant label (hate speech, vulgar/coarse, expresses a negative attitude; see Arhar Holdt et al., 2022). We group them manually (e.g. religion-based, race-based, gender-based, homophobic slurs, words with sexual connotation), then use their embeddings (Terčon et al., 2023) and cross-compare them with other embeddings using cosine similarity to obtain a list of candidates for similar words.

        We discuss the results of this extraction as well as the advantages (e.g. the detection of non-standard words or words that are rare in the corpus and might not be detected through a frequency-based approach) and disadvantages of this approach (e.g., it focuses on single-word expressions and is lexeme-focused instead of sense-focused). The resulting lexicon will be made available under an open-access license (CC BY-SA 4.0), also as part of the Sloleks Morphological Lexicon of Slovene (Čibej et al., 2022), which is part of the DDDS. The lexicon can provide a basis for a more detailed lexicographic analysis within DDDS, and the method can be applied to other languages.

        Speaker: Jaka Čibej
    • 1:00 PM 2:30 PM
      Lunch 1h 30m
    • 2:30 PM 4:00 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Ivana Filipović Petrović
      • 2:30 PM
        Choosing Suitable Text Corpora for Identifying Collocations – A Case Study of a Large Reference Dictionary of Contemporary German 30m

        Collocations are a well-covered research area in lexicography. With the advent of evidence-based lexicography and the availability of large text corpora, computational methods of extracting typical co-occurrences from such corpora and supporting lexicographers in identifying collocations among them became a research focus. Especially the statistical properties of collocations (i.e. application of various association measures) have been evaluated for different languages, collocation types, gold standards and corpora (e.g. Evert et al. 2017; Garcia, García Salido, and Alonso-Ramos 2019). In hindsight though, and despite the undisputed heuristic value of statistical methods for the task at hand, the overall results of such studies do not provide clear conclusions, especially with respect to the practical implications for lexicographic work. Combined, they highlight the dependency of the results on available datasets, investigated collocation types, as well as the underlying corpora in terms of their composition and the affordable preprocessing (Uhrig, Evert, and Proisl 2018). Some results even indicate that for high-quality, dependency-annotated corpora – in contrast to large but scarcely annotated web corpora used in previous studies – raw frequency data can be as indicative for extracting collocations as association measures. Consequently and given recent advances in deep learning, the focus shifted from the evaluation of association measures to the adaptation of increasingly capable statistical language models for the identification and classification of collocations (Espinosa-Anke, Codina-Filbà, and Wanner 2021; Falk et al. 2021; Ljubešić, Logar, and Kosem 2021).

        In this study, we examine a more fundamental question that is addressed only in passing by the aforementioned work. This question becomes more important as the focus shifts from the precision of association measures to the recall required when constructing representative datasets for training classifiers: Which type of corpora are actually suitable for extracting collocation candidates and exemplifying their usage? To this end, we compare several corpora of the vast corpus collection of the ‘Digitales Wörterbuch der deutschen Sprache’ (DWDS), that comprises more than 70 billion tokens of German texts, including reference corpora, web corpora and high-quality print newspapers. In order to study the coverage of collocations by these corpora, we assembled a gold standard from three lexical resources of collocations of contemporary German: the collocations described in DWDS entries, a dictionary of German collocations (Quasthoff 2011), and a dataset from a recent dissertation (Strakatova 2024), yielding in total approximately 350,000 collocations of different syntactic types. We verify the presence of these collocations in various corpora of the DWDS corpus collection. Comparing the coverage of our gold standard datasets by those corpora, we conduct a case study to answer questions such as: a) How good is the coverage of common collocations by carefully selected but small reference corpora? b) Are giga-token web corpora sufficient to cover a broad set of collocations as documented in comprehensive reference dictionaries? c) Do high-quality newspapers surpass web-corpora or can they be replaced by well-curated web corpora?

        Speakers: Luise Köhler, Gregor Middell, Alexander Geyken
      • 3:00 PM
        The role of subjectivity in lexicography: Experiments towards data-driven labeling of informality 30m

        Language corpora have long been used in linguistics and lexicography, but recent developments now allow large language models (LLMs) to support or even transform these fields. This study investigates the potential of LLMs for annotating informal language use in Estonian – a language underrepresented in LLM training data yet supported by a large corpus. Focusing on the informal register label used in the Dictionary of Standard Estonian, we explore whether LLMs can assist lexicographers in determining the informal label. This paper describes two experiments that make use of LLMs, including GPT, Gemini, and Claude. The first experiment yielded useful insights but also highlighted necessary improvements. In the second experiment, we evaluated the LLMs’ consistency and accuracy in categorizing words as informal or neutral/formal. Results showed that LLMs achieved around 76% agreement with expert human annotators, significantly above random chance, suggesting their usefulness as a supplementary resource in lexicography. GPT-4o demonstrated high accuracy, stability, and cost-efficiency, making it a reliable candidate for such a lexicographic task. The study highlights the inherent subjectivity in register labeling and the value of combining corpus data, expert judgment, and LLM output. Overall, LLMs represent a promising tool for modern dictionary work.

        Speakers: Lydia Risberg, Eleri Aedmaa, Maria Tuulik, Margit Langemets, Ene Vainik, Esta Prangel, Kristina Koppel, Hanna Pook
      • 3:30 PM
        So Close but Still Far: Case Study on Application of LLMs in Idioms Identification, Definition, and Generation of Illustrative Examples 30m

        Automation has revolutionised lexicography, introducing the ’post-editing lexicography’ model, where the role of the lexicographer involves refining automatically generated dictionary drafts. Since the launch of ChatGPT in November 2022, numerous papers have explored the potential applications of LLMs in dictionary production. The rapid evolution of LLMs necessitates a re-evaluation of conclusions drawn approximately two years prior regarding their application in automating dictionary entry creation, particularly in light of the advanced capabilities demonstrated by contemporary models.

        We will illustrate an experiment conducted on a dataset of 400 (397) MWEs with idiomatic meaning, aiming to evaluate the usefulness of LLMs in Serbian descriptive lexicography tasks (idiom generation, word-sense disambiguation of MWEs, definition writing, and generation of illustrative examples). We requested two types of illustrative examples: those in which a MWE has an idiomatic meaning, and examples with that meaning paraphrased literally (without the idiom). We will highlight the challenges and issues encountered with several models (ChatGPT-4o and 4.1, Gemini-2.5-Flash and 2.5-Pro) and discuss the differences in their performance based on given LLM prompts using direct chat and APIs access via Python scripts.

        Speakers: Aleksandra Marković, Ranka Stanković
    • 2:30 PM 4:00 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Slobodan Beliga
      • 2:30 PM
        Using Large Language Models to Generate Distractors for Language Games 30m

        This paper presents two tasks involving large language models (LLMs)—Gemini-2.0-flash and GPT-4o—used to generate distractors (i.e., incorrect options) for synonym and collocation questions in a language game. The lexical data for both tasks was sourced from the Digital Dictionary Database of Slovene (DDDS). Prompts were initially tested on a sample dataset with both models, and the better-performing model was selected for each task: Gemini-2.0-flash for synonyms, and GPT-4o for collocations. Evaluation results showed strong performance of the models, with over 80% of the generated distractors rated as appropriate. Common issues included non-existent or rare words and legitimate synonyms in the synonym task, and common collocations or distractors that improperly altered collocational structure in the collocation task. Additional filtering of the data was required to ensure game readiness. Further plans include using LLMs for the production of data for other games, as well as using LLM in the preparation of lexicographic data in the DDDS.

        Speakers: Iztok Kosem, Špela Arhar Holdt
      • 3:00 PM
        Automated Transcription of Mixed-Script Dialectal Materials 30m

        The project Dictionary of Bavarian Dialects in Austria "Wörterbuch der bairischen Mundarten in Österreich"(WBÖ) project maintains an archive of approximately 3.6 million handwritten dialectal paper slips documenting dialectal evidence. While 2.4 million entries have been manually digitized and converted to TEI format, the remaining 1.2 million paper slips from sections A-C require automated processing. This paper presents a novel three-stage workflow concept combining Handwritten Text Recognition (HTR) technology with existing digitized holdings to overcome the challenges posed by heterogeneous writing systems, multiple scribes, and poor material condition. Initial tests with existing HTR models yielded unsatisfactory results. The proposed solution leverages the existing Database of Bavarian Dialects "Datenbank der bairischen Mundarten in Österreich" (DBÖ) to automatically correct HTR transcription errors through similarity-based alignment and N-gram matching algorithms. The corrected transcriptions serve as a gold standard or a kind of ground truth for training a specialized HTR model tailored to historical dialect materials. This methodology enables the creation of substantial training datasets without manual transcription, potentially generating 33.6 million words for model training. The approach promises complete digital access to the WBÖ archive and provides a transferable template for similar lexicographic projects with historical slip collections.

        Speaker: Markus Kunzmann
      • 3:30 PM
        Learner’s reactions to false polysemy 30m

        Studies comparing dictionary entries generated with AI with those of well-established dictionaries edited by lexicographers show that LLMs tend to perform better in some tasks (e.g. writing definitions) than in others (e.g. word-sense disambiguation (e.g. Nichols 2023, Lew 2023, Jakubíček & Rundell 2023, Rees & Lew 2024). One of the problems resulting from the latter is that of “false polysemy” (Jakubíček & Rundell 2023: 525), where the differences between senses listed under a headword are unclear.

        Admittedly, the separation of meanings in dictionaries is artificially drawn by lexicographers, and there are often mismatches in sense distinctions across dictionaries. Yet it is still possible for experts to evaluate whether meaning boundaries are sufficiently clear-cut. What is less known is how learners react to false polysemy. Granted that people rarely read dictionary entries in full (Tono 1984, Nuccorini 1994, Bogaards 1998, Dziemianko 2016), and have been reported to stop reading once they find the information they need (Lew, Grzelak & Leszowicz 2013), we wanted to explore whether false polysemy disrupts the consultation process.

        This study analysed how 98 L2-English undergraduate students reacted to false polysemy. They took an online quiz consisting of 20 unknown lexical items presented in the context of sentences selected from corpora, some of which were shortened or slightly edited to remove contextual clues. For each vocabulary test item, the participants were given two definitions copied from Reverso, a new English dictionary developed with the assistance of LLMs. Example sentences and sense indicators that could give additional cues about meaning were deliberately omitted. For half of the test items, the pair of definitions provided were indisputably different. For the other half, the definitions were not clearly distinct according to two independent experts (i.e., they were exemplars of AI-generated false polysemy). The test items were shown to the participants in a random order, and each time they were asked to select which of the two definitions (also randomly ordered) was a better fit. They were then asked to judge on a Likert scale how confident they were that they had selected the correct sense. We also recorded the time spent on each test item, the order of the definition selected (first or second), and whether it was correct (when senses were distinct). A sample of the participants was then interviewed to gain further insights into their reactions.

        Preliminary results indicate that the participants had little difficulty selecting the correct sense in the true polysemy condition. However, when faced with false polysemy, their confidence dropped and they took longer to decide. Both effects were statistically significant. Our findings suggest that false polysemy can be detrimental to the user experience, and underscore the need for AI-powered systems that acknowledge and address the problem proactively, as recognized by the developers of Reverso, where human expertise, editorial guidelines and built-in feedback loops are key. That said, future user studies on false polysemy require naturalistic observations, as dictionary users may react differently when not explicitly asked to pick one out of two controlled definitions.

        Speakers: Tomasz Michta, Ana Frankenberg-Garcia
    • 2:30 PM 4:00 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Kris Heylen
      • 2:30 PM
        Information seeking behavior of the English learner in the AI era 30m

        ONLINE PRESENTATION

        Technology has largely affected the way language learners seek information. Digital formats virtually superseded the paper dictionary (Ptasznik, Wolfer and Lew, 2024), online translators gained much importance (O’Neill, 2019), and web browsers became the first port of call (Kosem et al., 2019). Obviously, generative AI systems imitating human-like communication mark another watershed for online information behavior (De Schryver et al., 2023; Qu and Wu, 2024).

        The aim of the study is to investigate English language learners’ information seeking behavior on the web in the AI era. The following research questions are posed:

        RQ1: Which online tools: search engines/browsers, dictionaries, translators or AI assistants do learners of English access to solve language problems?

        RQ2: How often and in what situations do they turn to these tools?

        RQ3: How do English learners assess their digital literacy needed to solve language problems using the tools?

        RQ4: How do they evaluate the online tools?

        RQ5: Which devices are most often used to find linguistic information online?

        To answer the research questions, an online questionnaire was designed. So far, it has been conducted among 379 B1/B2+ learners of English in Slovenia, out of whom 161 provided valid answers. Preliminary results indicate that in situations of linguistic deficit, online translators are the first port of call (70%, mainly Google translate, DeepL and Pons), followed by search engines/browsers (60%, mostly Google, less often Safari and Chrome). About 40% of the respondents consult online dictionaries (like the Cambridge Dictionary) and AI assistants (ChatGPT, occasionally Deepseek and Grok; RQ1). Online translators and search engines/browsers are typically used once or a few times a week, online dictionaries – once a week or once a month, while AI assistants – every day, once or a few times a week (RQ2). As a rule, all the tools are consulted for both official and unofficial purposes (i.e., to get help with comprehension and production in daily situations both related and unrelated to university/job). Leisure activities (writing creative texts for pleasure or playing word games) are the least important consultation motives (RQ2). The respondents think a lot of their digital proficiency. Virtually all of them claim that at least half of their last 10 inquiries assisted by any tool were successful (RQ3). Also the tools themselves are highly esteemed. Almost all AI users enjoy their chats, and above 83% of learners like turning to the other tools. However, online dictionaries are considered the most trustworthy (91%), followed by search engines/browsers (68%) and AI assistants (61%). Online translators are trusted the least (53%; RQ4). Interestingly enough, smartphones most often serve to search the web, chat with AI and consult online translators, while online dictionaries are usually accessed from computers (RQ5).

        The full paper gives a deeper insight into the tendencies emerging from the collected data, including open-ended questions (e.g., advantages and disadvantages of the investigated tools). The limitations of the study and new avenues of research are also discussed.

        Speakers: Anna Dziemianko, Mojca M. Hočevar
      • 3:00 PM
        Matching meaning: Evaluating ChatGPT’s ability to assign corpus examples to dictionary senses of polysemous sound-related verbs 30m

        ONLINE PRESENTATION

        A major change in dictionary exemplification was brought about by the arrival of corpus data, which replaced lexicographer-made examples with authentic ones from real spoken and written discourse. Monolingual English learners’ dictionaries (MELDs) prefer a third type of examples, corpus-based ones, with unnecessarily complex vocab and structure, and unclear content removed from them. However, apart from corpus-based examples which follow definitions of senses, MELDs online include sections of non-modified corpus examples placed usually at the bottom of entries and not matched with any senses.

        The paper aims to explore corpus examples sections accompanying polysemous sound-related verbs and leverage ChatGPT-4 to match corpus examples with the senses already distinguished in the respective dictionary entries. The verbs were selected from the twelve strongest and forty-four strong synonym matches of the verb 'sound' in the sense “produce noise” on Thesaurus.com. Apart from the basic, literal meaning, each of these verbs has a figurative, metaphorical meaning or meanings, e.g. echo “to repeat opinions in agreement”, and resonate “to receive a sympathetic response”. Learners’ dictionaries were chosen for analysis, as exemplification is particularly important in them. The selected MELDs are Longman Dictionary of Contemporary English (LDOCE), Cambridge Advanced Learner’s Dictionary (CALD) and Collins Dictionary (Collins), as they all have sections dedicated to corpus examples. CALD and Collins explicitly inform the user that the examples have been automatically selected, and therefore the editors do not take responsibility for possible sensitive content or mismatches with the entry word.

        The present study demonstrates that ChatGPT is successful at separating literal from metaphorical examples of sound-related verbs, which is not surprising, as current research indicates the capability of Large Language Models (LLMs) for polysemy and metaphor identification and interpretation (e.g. Bond et al. 2024 and Lin et al. 2024). The performance of ChatGPT is then checked in a more challenging task, that of matching corpus examples with the already existing senses in each of the analysed dictionaries. The prompts include the numbered senses that feature in the dictionaries under a certain headword together with the definitions and accompanying examples, which serve as models for ChatGPT.

        The corpus examples sections in the dictionaries tend to be rather lengthy, especially in CALD, and, for instance, at the entry for 'resonate' they amount to 104 examples. Therefore, the task of assigning corpus examples to separate senses would be drudgery for human lexicographers. In online dictionaries, such corpus examples can be located below corpus-based examples in expandable boxes, a practice which is already seen in Oxford Advanced Learner’s Dictionary for corpus-based examples. It was found that sometimes ChatGPT admits it cannot assign any corpus example to a sense, because no example demonstrates it. Such cases will be analysed with scrutiny, and ChatGPT will be asked to generate missing examples, a task which it does not turn out to be impressive at, as Lew (2023) observes.

        Speaker: Sylwia Wojciechowska
      • 3:30 PM
        The DICI-A: A Learner Dictionary of Italian Collocations 30m

        ONLINE PRESENTATION

        In this presentation we describe the DICI-A (Dizionario delle collocazioni italiane per apprendenti), a new learner dictionary of Italian collocations.

        The DICI-A includes ca. 11,000 collocations belonging to six syntactic relations: i. Verb + Direct object (mantenere una promessa, ‘to keep a promise’); ii. Adjective + Noun/Noun + Adjective, where the adjective is a modifier before or after a noun (brutta avventura, ‘bad adventure’; tempo libero, ‘free time’); iii. Verb + Adjective (stare zitto, ‘to stay quiet’); iv. Verb + Adverb, (fare presto, ‘to hurry up’); v. Adverb + Adjective (altamente positivo, ‘highly positive’); and vi. Noun + Noun (parco divertimenti, ‘amusement park’).

        In the context of Italian phraseological lexicography, in which three different monolingual collocation dictionaries have been published in the last 15 years (Urzì 2009; Tiberii 2012; Lo Cascio 2013), the DICI-A is a lexicographic resource that brings an important added value, since none of the existing dictionaries were specifically aimed at L2 learners, and none were created according to strictly corpus-based criteria.

        The presentation will describe the following features of the DICI-A, resulting from methodological choices made during its development:

        • it is a corpus-based dictionary: collocations were extracted from an Italian written and spoken reference corpus (Author et al. under review), by integrating measures of frequency and dispersion with association measures (Gablasova et al. 2017; Gries 2024) of exclusivity (Mutual Information; Evert 2005) and strength of association (LogDice; Rychlý 2008);

        • the automatically extracted collocations were filtered through a two-step process: a validation against two of the three existing collocation dictionaries, and a human assessment performed by six linguists specialised in phraseology;

        • as a dictionary targeted at learners, each entry of the final 11,000 collocational list was assigned to a specific proficiency level (A: base; B; intermediate; and C: advanced) according to the Common European Framework of Reference (Council of Europe 2020), by combining different criteria, such as the rank of collocations in a frequency list, their internal composition, their use by learners at different proficiency levels, attested in a learner corpus of Italian (Author et al. 2023), and their domain of use (La Russa et al. 2023);

        • definitions and examples for each of the collocational entries were obtained using Generative AI (Ptasznik et al. 2024): a specific prompt provided through the ChatGPT 4o API interface was found to be effective in producing definitions and examples easily understandable by learners, even at low proficiency levels, as demonstrated by two ad hoc tests (Author et al. 2025).

        The DICI-A will be publicly available from the end of 2025 in digital format, and searchable through a dedicated web and mobile interface.

        Speakers: Stefania Spina, Fabio Zanda, Irene Fioravanti, Luciana Forti, Damiano Perri, Osvaldo Gervasi
    • 4:00 PM 4:30 PM
      Coffee break 30m Lobby

      Lobby

    • 4:30 PM 6:00 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Kristina Koppel
      • 4:30 PM
        How Effective is AI as a Language Consultant? 30m

        This paper explores the applicability of generative artificial intelligence in the field of language consulting, focusing on ChatGPT-4 and the Slovenian language. The analysis is based on an experiment involving 30 real user questions submitted to the Language Consulting Service (LCS) of the Fran Ramovš Institute of the Slovenian Language. The questions cover a range of linguistic categories and were submitted to ChatGPT under controlled conditions. The responses were then compared with expert-produced answers and evaluated in terms of factual accuracy, stylistic appropriateness, terminological correctness, and overall usefulness. The results show that while ChatGPT performs well in terms of clarity, tone, and structure, its output often contains inaccuracies and occasionally misleading information. At this stage, ChatGPT is not suitable as a stand-alone tool for end-users. However, it could serve as a helpful draft generator for human language consultants. The study also outlines ways to improve AI output, including better prompts and access to relevant databases. Although some fundamental limitations of AI remain, its controlled use in language consulting may offer practical support, especially in cases involving repetitive or less complex queries.

        Speaker: Urška Vranjek Ošlak
      • 5:00 PM
        Corpus-Based Vocabulary Profiling for Ukrainian: From Lexical Analysis to the PULS Digital Learning Platform 30m

        While CEFR-aligned vocabulary profiles have been developed for many languages (e.g., English, German, and Swedish), Ukrainian as a foreign language (UFL) still lacks an empirically grounded lexical profile. A foundational issue in creating such profiles is combining lexical frequency data with expert knowledge to assign CEFR-level labels. Existing UFL word lists rely primarily on professional expertise rather than systematic data analysis. The development of a Ukrainian vocabulary profile is further complicated by the prevalence of level-straddling textbooks, significant variability of vocabulary across learning materials, and the inherent inflectional complexity of the language. We aim to bridge these gaps by developing a graded word list for UFL learners (CEFR levels A1–C2), using a comprehensive, data-based approach to vocabulary classification.

        To this end, we have constructed a one-million-word corpus based on 21 UFL textbooks (A1–C2) using Ukrainian NLP tools and resources, namely the NLP-UK toolkit (github.com/brown-uk/nlp_uk) and the VESUM dictionary (vesum.nlp.net.ua), for automatic tokenization, lemmatization, and morphological tagging. The corpus has yielded a word list of 37,087 lemmas for which both frequency and distributional data (across levels and textbooks) were recorded. This dataset has enabled us to analyze lexical frequency, dispersion, and variability across a representative selection of UFL textbooks.

        Another data input was provided by a general-language corpora. We have analyzed lemma frequency data from two Ukrainian corpora (GRAC and BRUK). By integrating frequency data from three corpora with UFL expert analysis, we have assigned CEFR levels to each lexical item and categorized them by part of speech and communicative topic. Crucially, we have applied the significant onset of use approach (Alfter et al., 2016) to address inconsistencies in existing Ukrainian learning materials and achieve a reliable classification.

        The paper outlines the methodology for vocabulary extraction, exploration, and profiling. Expert decision-making follows a two-stage CEFR alignment process to ensure accuracy, consistency, and pedagogically relevant progression. In the external alignment stage, experts independently assign proficiency levels to words. In the internal alignment stage, these assignments are refined by analyzing words within semantic and derivational clusters. This approach proves particularly effective for languages with complex morphology like Ukrainian.

        A CEFR-labeled vocabulary profile of 5,891 lexical items, with a target of 10,000 lemmas, developed through in-depth lexical analysis, is published on the PULS platform (puls.peremova.org). It is designed as a digital learning resource with lexical database functionality, allowing word list extraction by CEFR level, thematic group, and part of speech. Currently, A1 and A2 vocabulary items are available, with higher levels in progress. This profile serves as the foundation for the prospective Ukrainian Learner’s Dictionary (ULD), which will include detailed lexical entries with part of speech, CEFR label, thematic group, definition at the level of individual senses, corpus-based examples, pronunciation (audio), English equivalents, pictorial illustrations where relevant, and semantic and derivational relations.

        The PULS platform fills a critical gap in creating a comprehensive learning system for UFL. Its central component, the Ukrainian Learner’s Dictionary, is the first-ever CEFR-labeled corpus-based UFL reference source that will serve the needs of learners, educators, material creators, and proficiency test designers.

        Speakers: Olena Synchak, Vasyl Starko, Mariana Burak, Mykhaylo Svystun
      • 5:30 PM
        Identifying the Most Representative Phraseological Units Using Language Corpora and Artificial Intelligence for Lexicography: The Case of Slovenian Comparative Phrasemes 30m

        In preparing phraseological units for the third edition of the Standard Slovenian Dictionary (eSSKJ), the authors aimed to identify the most relevant comparative phrasemes in the contemporary standard language using objective corpus-based criteria. A key goal was to determine how representative specific phrasemes and their variants are in actual use. Two lists of the hundred most frequent comparative phrasemes with the structure adjective + kot ‘as’ + noun (e.g., bel kot sneg ‘white as snow’) were extracted from the metaFida v1.0 corpus and CLASSLA-web.sl 1.0 corpora. The twenty most frequent were analyzed in greater detail. The results were compared with the Database of Comparative Phrasemes compiled from older dictionaries and collections, as well as with entries in eSSKJ. Artificial intelligence was also used experimentally to identify representative comparative phrasemes, with up to 80% alignment with expert choices.

        Speakers: Matej Meterc, Nataša Jakop
    • 4:30 PM 6:00 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: David Lindemann
      • 4:30 PM
        Corpus-Based Methods and AI-Assisted Terminography for Contextonym Analysis 30m

        This paper presents contextonym analysis as a hybrid method combining corpus-based techniques and generative artificial (GenAI) tools to support the writing of precise, context-sensitive terminological definitions. Grounded in the Flexible Terminological Definition Approach, this method is based on the premise that definitions should reflect the most relevant conceptual content activated in specific contexts. Contextonyms (frequent surface co-occurrents within a 50-word window) are extracted in word sketch (WS) form in Sketch Engine and help reveal salient semantic features of a target term without relying on predefined syntactic or semantic relations. The paper outlines strategies for interpreting contextonyms, including filtering concordance lines, consulting WSs, and prompting GenAI tools to assist with interpretation. A typology of contextonyms is proposed, along with a case study illustrating how the method supports the creation of domain-specific definitions. By combining corpus data with AI-assisted interpretation, contextonym analysis offers a robust and user-friendly approach to terminological definition writing.

        Speaker: Antonio San Martín
      • 5:00 PM
        User interaction with assistive technology for a thorough evaluation of a WCAG 2-compliant e-dictionary: assessing the accessibility of the Diccionario de la Lengua Española, v. 23.8 30m

        While the move to the digital design of lexical resources has, in principle, enhanced the physical and sensory accessibility of dictionaries, a lack of adherence to accessibility standards such as WCAG 2 (Web Content Accessibility Guidelines) (Campbell et all 2023) can introduce significant barriers (NCD 2006; Botelho 2021). These barriers often hinder access to the information and capabilities within those tools or, at the very least, create a user experience that is far from equitable for individuals with disabilities (Lazar et al. 2015: ch. 3, 141; Griffith et al. 2020). However, is formal adherence to standards the only benchmark for actual accessibility to the information, resources and potential knowledge pathways within the e-dictionary?

        This study focuses on the accessibility challenges faced by e-dictionary users with visual disabilities. Their exclusion from intellectual or creative tasks frequently stems from ableist perspectives that unjustly assume all-encompassing disabilities for functionally diverse people (Sierra Martínez et al., 2024). However, research has shown that individuals who lack one sense or function often develop remarkable compensatory or divergent abilities (Occelli et al., 2017; Chebat et al., 2020; Sabourin et al., 2022), offering significant potential for professional and intellectual contributions. Yet, they continue to face exclusion in the access to educational and professional contexts due to systemic barriers.

        The Diccionario de la Lengua Española (Real Academia de la Lengua 2025), a key reference for the Spanish language, recently underwent a major redesign to achieve state-of-the-art accessibility by aligning with WCAG 2.2 guidelines, particularly as regards programmatic structure and labelling, visual findability and understandability, and use of WAI-ARIA (Accessible Rich Internet Applications) attributes for dynamic content and advanced user interface controls. But does this redesign thoroughly fulfil its accessibility goals?

        As proven by users and accessibility experts, and shown in academic literature, a high score on automated validation tools and strict compliance with guidelines does not necessarily translate into genuine accessibility (Power et al. 2012; Lazar et al. 2015: 153-155). User research is critical in both lexicography (Lew & de Schryver, 2014; Tarp, 2019: 245-246) and accessibility studies (Lazar et al. 2015: ch. 8; Henry et al. 2020). This paper presents an exploratory usability test conducted by a blind user with standard competence in screen reader usage and high academic and professional qualifications, analysed and interpreted by a web accessibility expert. The results identify several areas for improvement in a resource that performs very well in terms of formal accessibility. Examination of actual interaction, however, made us focus on potential problems in usability aspects of the dictionary at the macro and micro structural levels, interaction patterns, and the communication of this information through the assistive technology used, significantly reducing or cancelling their effectiveness (Lew 2012).

        Our evaluation methodology combines spontaneous screen reader usability testing, code inspection, and the critical use of automatic validation tools. The results underscore the need for a more user-centred approach to complement existing standards. These findings can contribute not only to advancements in web accessibility standards and practices but also in accessible lexicographic design.

        Speakers: Jesús Torres del Rey, María García Garmendia
      • 5:30 PM
        From Word of the Year to Word of the Week: Daily-updated Monitor Corpora for 25 Languages 30m

        This paper presents a long-term privately-funded programme focusing on collecting of timestamped monitor corpora in a wide range of (currently 25) languages. These corpora are primarily designed for researching linguistic trends (including neology) and language change over time. They are available through the Sketch Engine platform and vary significantly in size — from 3 million tokens for Irish to over 100 billion tokens for English. The languages currently included are Arabic, Catalan, Chinese, Czech, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Irish, Maltese, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Slovene, Spanish, Tamil, and Ukrainian; new languages are continuously being added (with Afrikaans, Amharic, Armenian, Azerbaijani, Georgian, Igbo, Indonesian, Oromo, Urdu, Uzbek and Yoruba being the next candidate set of further 10 languages to be added in the coming months).

        The corpora are constructed from news articles published on websites worldwide that offer content via newsfeeds (in the form of RSS and Atom formats). Data coverage ranges from as early as 2014 for the oldest corpora to 2023 for the most recently introduced languages. New data is being collected on a daily basis and an update for each trend corpus is published twice a week. The current work builds on the previously published JSI Newsfeed Corpus (Krek & Herman, 2017), which provided news content only until 2022. Since 2021 for English and 2023 for other languages, the data collection process has been carried out independently on the previous work, expanding the number of supported languages and incorporating new data sources. Sketch Engine already contains extra functionalities that are available to corpora with diachronic annotation. Our trend corpora offer analysis on daily, monthly, quarterly or yearly basis, and besides the dedicated Trend function in Sketch Engine (Kilgarriff, 2015) such metadata can be used to refine a lexicographer’s analysis in a concordance search, wordlist discovery or collocational behavior of words provided by theWord Sketch feature.

        Nearly 30,000 newsfeeds are queried six times a day, yielding up to 180,000 new articles on weekdays and more than 110,000 articles on weekends per day. The publication date is extracted from the information supplied by the feed, ensuring time-stamping as accurate as possible. The processing pipeline includes several web text cleaning procedures, namely the main text body extraction, removal of near-duplicates, and enriching the data with linguistic annotations, following methodologies similar to those used for the JSI Newsfeed Corpus and the TenTen corpora family (Kilgarriff, 2014).

        In addition to corpus construction, the paper details statistics on feed activity – download volumes, the decay rate (how long an existing newsfeed typically lasts to work) – and the most represented websites per language. The paper also showcases examples of functionality offered by Trend corpora that support corpus lexicography and linguistic research, including neologism detection, word sense shift analysis, and timelinebased analysis of trending words and phrases.

        Speakers: Ondřej Herman, Miloš Jakubíček, Jan Kraus, Vít Suchomel
    • 4:30 PM 6:00 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Geraint Rees
      • 4:30 PM
        Compiling a candidate list of taboo constructions for an under-resourced language 30m

        ONLINE PRESENTATION

        Taboo-language resources remain scarce for under-resourced languages like Afrikaans – despite their clear relevance for natural language processing (NLP) and applications in artificial intelligence (AI). Although Afrikaans has a long-standing lexicographic tradition, it still lacks an open-access reusable lexical database for the taboo language. One of the most crucial steps in developing a constructional database for taboo language is to identify a candidate list of taboo constructions for potential lexicographic treatment. This paper outlines and tests a range of procedures to compile and refine such a list, with the goal of establishing a replicable methodology for similar work in other under-resourced languages. The methods draw on existing data of different types and corpora representing different registers. However, many entries are either false positives or ambiguous and require validation. Hence, we experiment with various semi-automated modelling techniques. These techniques include refining the candidate list through frequency analyses in corpora, expanding the list through partial corpus matching, and comparing the results against an attested, verified subset of taboo terms.

        Speakers: Monique Rabé, Martin J. Puttkammer, Gerhard B. van Huyssteen
      • 5:00 PM
        You get it through lexicography: extracting suppressed language from LLMs using lexicographic scenarios as jailbreaking tools 30m

        ONLINE PRESENTATION

        Taboo words present a challenge for a lexicographer to include and describe in a language resource, as they are forms of verbal violence. However, discarding offensive words from general-purpose lexicographic wordlists disregards the representation of an integral part of the mental lexicon. The present study aims at using lexicographic scenarios to jailbreak four GPT variants into the retrieval of offensive words that are frequently used yet undocumented in most lexicographic resources. While Large Language Models (LLMs) can be used to document a headword, the presence of taboo items may prevent these systems from providing an answer. Our results reveal that the type of the model and the lexicographic framing of the extraction task improved the responses of the models and increased the success rate, with the optimal configuration reaching 87.5% success rate. The AI-generated lexicon of offensive words currently contains approximately 250 headwords grouped into gender, age, religion and race categories. The words also vary in their inherently or contextually offensive types. A searchable user-friendly version is accessible through https://arabic-studies.com/Elex/index.html. The main contributions of this lexicon are detecting lexicographically undocumented offensive terms, pointing to the negative context of several headwords and discovering new senses of apparently neutral ones. In addition, LLMs provide very useful morphological, semantic and socio-cultural information in the definitions, despite the inconsistencies and some overgeneralizations in the definitions. Although corpus evidence proved the success of LLMs in detecting offensive words and senses, the automatic evaluation of AI-generated example sentences showed their limited value from a pedagogical perspective.

        Speakers: Esra Abdelzaher, Ágoston Tóth
      • 5:30 PM
        A Corpus-Based Dictionary for the Endangered Megrelian Language 30m

        ONLINE PRESENTATION

        This paper presents a corpus-based approach to compiling a bilingual Megrelian-English online dictionary. The Megrelian language belongs to the UNESCO Atlas of the World’s Languages in Danger group of “increasingly endangered” languages, and faces a number of critical challenges, among them a lack of standardised resources, intergenerational transmission, and minimal digital presence. Unlike widely spoken languages equipped with pretrained models and various linguistic tools, "increasingly endangered" languages like Megrelian lack even basic NLP tools such as annotated corpora, PoS taggers, and morphological analysers. Moreover, the complexity of their grammar and phonology require special approaches that cannot simply be adapted from high-resource languages. To address these gaps, we developed an annotated corpus of contemporary Megrelian, consisting of 97691 tokens and 60959 types. It is based on data collected through fieldwork in Samegrelo, Georgia, from the years 2022 to 2025. The whole process was subdivided in two main stages: fieldwork conceptualization and data collection, followed by laboratory analysis and data processing.

        The bilingual Megrelian-English dictionaries were developed in parallel, using the same dataset processed in Fieldworks Language Explorer (FLEx, 2024). This approach enabled the integration of corpus annotations into the dictionary entries. Following the principles described in Atkins & Rundell (2008), Gibbon & Van Eynde (2000), we used lexeme-based and root-based configurations, resulting in the creation of two online dictionaries, available online. The first dictionary is oriented toward the translation of individual words, while the second focuses on the translation of individual morphemes. In the first case, each lexical entry is supported by morphosyntactic information, phonetic transcription (IPA), glosses, and semantic descriptions. In the second case, the entries represent individual morphemes, providing not only glosses, but also information about their occurrences and links to their use in the corpus. The finalised data is available online through https://xmf.iliauni.edu.ge/.

        The paper is subdivided into several parts: 1. Introduction, outlining the significance of Megrelian as part of the Kartvelian language family and introduces the project dedicated to the documentation of the Megrelian language; 2. Background and Data Collection, providing overviews the existing Megrelian dictionaries and represents the data collection stages; 3. Annotation and Corpus Development, describing the data annotation and processing stages and giving information on corpus size, linguistic coverage, etc.; 4. The Dictionaries - Design and Generation, presenting the configurations for both the lexeme-based and morpheme-based dictionaries, and also thoroughly describing the export and converstion stages, oulining the linkage between the corpus and the dictionary entries, and; 5. Conclusions, Challenges and Future Works, which summarises the corpus-based lexicographic approach to the Megrelian language, provides a short description of the ongoing challenges, and describes future plans concerning the use and potential improvement of the data.

        Speakers: Irina Lobzhanidze, Rusudan Gersamia
    • 7:00 PM 10:00 PM
      Gala dinner 3h Restaurant Špica

      Restaurant Špica

      Cesta svobode 9, 4260 Bled
    • 9:00 AM 10:00 AM
      Keynote: Keynote 3 Arnold hall

      Arnold hall

      Convener: Simon Krek
      • 9:00 AM
        We need to talk about data structures in lexicography 1h

        It has been almost half a century since we started “doing” lexicography on computers. Let’s stop for a minute now and take a critical look at the data models we have been using to represent the structure of dictionaries in dictionary writing systems and other software.

        In this talk, I will trace the history of lexicographic data modelling from its beginnings as text markup for retro-digitised dictionaries, to the present day when most dictionaries are born-digital. I will show that, regardless of which notation we use (XML, JSON or other), the underlying design pattern is almost always a tree structure in which the various content items (headwords, senses, definitions…) are arranged in a parent-child hierarchy.

        I will argue that the tree-structured pattern is not expressive enough to handle some phenomena that occur in dictionaries, such as entry-to-entry cross-references, the placement of multiword subentries, and complex hierarchies of subsenses. These things would be easier to manage in a graph-based data structure, such as a relational database or a Semantic Web-style knowledge graph.

        Dictionary projects which insist on a purely tree-structured data model are failing to make full use of the digital medium. But upgrading to a graph-based data model is difficult because tree-structured thinking is entrenched in the minds of lexicographers and dictionary users alike. This talk will conclude with an introduction to DMLex, a recently standardised “Data Model for Lexicography” which aims to ease this transition by being a hybrid model, combining tree structures where possible with graph structures where necessary.

        Speaker: Michal Měchura
    • 10:05 AM 10:30 AM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Carole Tiberius
      • 10:05 AM
        Contrasting a new AI-powered dictionary designed for on-screen reading with electronic dictionaries that have evolved from print editions 25m

        The use of LLMs in lexicography is a hot topic and indeed the focus of eLex 2025. In the past couple of years, several papers have emerged comparing existing dictionary entries with zero-shot chatbot queries (e.g. Nichols 2023) or with dictionary-like content obtained through the dynamic interaction between experts and chatbots (e.g. Lew 2023, Jakubíček & Rundell 2023). However, studies so far have not appeared to have contrasted well-established dictionaries compiled and edited by lexicographers with new types of dictionaries conceived with AI support.

        This paper contrasts a new English dictionary created with the assistance of AI that has been designed for on-screen reading with two prestigious electronic dictionaries that have evolved from print editions. The definitions of 39 lexical items from a text on digital well-being published online in The Conversation (Shaleha 2024) were compared in: (a) The Oxford Dictionary of English (ODE), accessed directly from the reading screen by right-clicking on the target item when using an Apple device; (b) the Merriam-Webster Dictionary (MW), accessed via a separate tab from the on-screen reading material; and (c) the new Reverso dictionary, embedded in the reading material through a browser extension.

        To focus on vocabulary that readers of English as an additional language might genuinely want to look up, the lexical items included in the analysis were those marked as “off list” in a vocabulary profiling tool (Cobb, n.d.) and in Oxford 3000.

        The target items consisted of 13 adjectives, 17 nouns (3 plural) 8 verbs (of which 4 were inflected) and 1 adverbial expression. Part of speech was disambiguated contextually where needed (e.g. prolonged was classified as an adjective, not a verb).

        To assess the ease of consulting definitions for these items while reading on screen, the three dictionaries were compared according to the following parameters:

        1. Coverage (was the target sense provided?)
        2. Findability (was the target sense easy to spot?)
        3. Readability (how long were the definitions and what vocabulary did they use?)
        4. Look-up experience (how straightforward was it to access the dictionary while reading?)

        The main differences observed were with regard to the last two of the above. Although Reverso is not immune to known problems of AI in lexicography (Michta & Frankenberg-Garcia, 2025), it outperformed ODE and MW in terms of readability and look-up experience, offering readers short, easy-to-understand definitions that users can consult with minimal disruption while reading electronic texts.

        Speaker: Ana Frankenberg-Garcia
    • 10:05 AM 10:30 AM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Janoš Ježovnik
      • 10:05 AM
        LLM-Assisted Dialect Lexicography: Challenges and Opportunities in Processing Historical Bavarian Dialects 25m

        This paper investigates the potential of LLMs in supporting lexicographic work on non-standard linguistic varieties using data from the Dictionary of Bavarian Dialects in Austria (WBÖ). Based on approx. 2.4 million digitized and TEI-encoded dialect paper slips published via the Lexical Information System Austria (LIÖ), we construct a domain-specific corpus and evaluate LLMs in semantic classification and dictionary entry generation. Key preparatory steps include metadata enrichment, glossary and ontology development, and prompt engineering combined with Retrieval-Augmented Generation (RAG) techniques. Preliminary results suggest that LLMs can assist in organizing dialectal material into coherent semantic groupings. However, challenges persist regarding data preprocessing, structural conformity, and selection of representative examples. We discuss methodological implications and outline future directions, including the integration of agent-based systems and fine-tuning approaches tailored to dialect resources. This study contributes to the broader discourse on AI-assisted lexicography, highlighting both the potential and limitations of current LLM technologies in handling underrepresented language varieties.

        Speakers: Philipp Stöckle, Daniel Elsner, Wolfgang Koppensteiner, Katharina Korecky-Kröll
    • 10:05 AM 10:30 AM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Bálint Sass
      • 10:05 AM
        Toward a corpus-based multilingual terminology database for Intercultural Communication 25m

        This contribution focuses on the methodological aspects of the ICoMuTe project aiming to design a corpus-based multilingual terminology database for Intercultural Communication (ICC). The project seeks to explore how ICC terms relate to each other within six European languages (Dutch, English, German, French, Italian, Spanish), how these terms are connected to their scientific and cultural contexts, and how they can be translated across different languages and cultures while preserving meaning.

        The selected approach is corpus-based, using comparable corpora of ICC handbooks and a parallel corpus of texts produced by the European Parliament dealing with key questions related to ICC. Using text recognition and data mining tools (e.g., Sketch Engine), the most frequent ICC terms per language are extracted and analysed in context. To account for the culturally specific aspects of terms while achieving a high degree of cultural neutrality, a semantic model based on tags has been developed for comparing and linking terms across languages in a neutral manner, but natural language corpus-based definitions are also provided that reflect the cultural load of each term.

        The main findings suggest that semantic tags are relevant to balance the cultural specificity and neutrality of ICC terms, and that English acts as a reference linguistic and cultural framework for the emergence and development of terms in other languages.

        Speakers: María Iglesias Vázquez, Charlotte Venema, Marie Steffens
    • 10:30 AM 11:00 AM
      Coffee break 30m
    • 11:00 AM 12:30 PM
      Parallel sessions 1 (Arnold hall) Arnold hall

      Arnold hall

      Convener: Jelena Kallas
      • 11:00 AM
        DMLEX on Wikibase: Legacy dictionaries as collaboratively editable dataset 30m

        This paper presents an experimental workflow for converting legacy digitized dictionaries into the DMLex standard and subsequently importing them into a Wikibase instance. DMLex, a serialization-independent model developed by the OASIS LEXIDMA Technical Committee, aims to provide a universal and modular representation of lexicographic data. The study tested whether dictionaries from heterogeneous sources—originally encoded in internal XML formats—could be reliably transformed into DMLex-compliant representations and repurposed for collaborative editing and enrichment on a structured linked data platform. The transformation was achieved through a combination of rule-based scripts, manual refinement, and large language model assistance. While DMLex proved adaptable to a wide range of lexical phenomena, several limitations became apparent during the Wikibase integration phase. These findings suggest that practical deployment of DMLex benefits from clearer conventions and validation strategies when applied beyond theoretical modeling. The results confirm DMLex’s potential for future-proof dictionary modeling, while also highlighting areas where further specification and community consensus are needed to support its application in digital infrastructures and collaborative environments.

        Speakers: Simon Krek, Primož Ponikvar, Andraž Repar, Iztok Kosem, David Lindemann
      • 11:30 AM
        Image-to-Sense Alignment Using AI Tools 30m

        This paper evaluates the results of using GPT-4o mini language model batch processing with image recognition capability to align 1,572 images of 398 polysemous nouns in the Dictionary of the Slovenian Standard Language (second edition) to their specific dictionary senses, and it compares them to the results of the manual image-to-sense alignment process. The images were manually assigned to entries in a previous task, but no sense information was provided at the time. The language model showed relatively high overall agreement with the human annotator (i.e., 85.1%). In cases in which multiple senses were selected per image in both manual and automated annotation, the agreement was even slightly higher (i.e. in 89.4% of all sense evaluations). The agreement rate was higher when the language model evaluated only the matching senses and lower when it also evaluated the non-matching senses within the entry.

        Speakers: Andrej Perdih, Dejan Gabrovšek, Janoš Ježovnik
      • 12:00 PM
        Woordpeiler: A New Tool for Visualizing and Analyzing Lexical Trends in Contemporary Dutch 30m

        Representative monitor corpora with detailed metadata offer a solid empirical basis for documenting lexical innovation and change (Kosem et al. 2021). However, continuously updated time-stamped textual data presents challenges for data management, lexicographic analysis, and visualization. Building on its existing corpus infrastructure, the Dutch Language Institute (INT) has developed Woordpeiler (“Word Pollster”, https://woordpeiler.ivdnt.org/), an online application to (a) visualize and analyze word frequencies over time and (b) support the analysis of neologisms and lexical trends in Dutch since 2000.

        As part of its mission to maintain a sustainable Dutch language infrastructure, INT developed the Corpus Hedendaags Nederlands (CHN), currently (September 2025) containing 4.3 billion tokens across 10.6 million documents. The corpus supports INT’s lexicographic workflow and is available through CLARIN. Daily and yearly data from major Dutch-language newspaper publishers (in the Netherlands, Belgium, Suriname, and the Dutch Caribbean) is processed via an automated workflow. All data is converted into a unified TEI format, enriched with metadata (e.g. language variety) and linguistic annotation. Using INT’s BlackLab system (de Does et al. 2017), the data is indexed and published as weekly (internal) or monthly (external) CHN updates.

        While CHN users could already obtain word frequencies through BlackLab’s query interface, Woordpeiler adds visualization and trend analysis tools. Frequency data for POS-tagged word forms, lemmas, and bigrams are exported to a PostgreSQL database optimized with TimeScaleDB. Through Woordpeiler’s interface (Fig. 1), users can generate interactive graphs for words and bigrams to visualize and compare changes in absolute and relative frequencies across customizable time intervals (day, week, month, year). Wild cards can be used for searches and graphs can be filtered or split by language variety (Belgium, Netherlands, Suriname, Caribbean), with tooltips providing statistics and links to the underlying corpus data. In advanced search, users can refine searches by lemma, part of speech and newspaper (only internally). Graphs can be downloaded PNGs or shared through unique URLs.

        A separate pane (Figure 2) offers additional trend analyses (currently only available internally). One function detects “trending” words or bigrams in a given interval using simple maths keyness (Kilgarriff 2009) relative to the preceding period. Users can adjust smoothing and also detect disappearing words via inverse keyness. A second function identifies new words or bigrams in a selected interval, optionally allowing a limited number of earlier nonce occurrences. Results appear as sortable, POS-filterable lists with accompanying frequency graphs.

        Woordpeiler and its database are fully integrated into INT’s corpus-processing workflow, minimizing publication lags and ensuring quality control. The tool will support corpus-lexicographic work by adding validated frequency information to the central lexicon GiGaNT and improving workflows for identifying neologisms and out-of-dictionary words. Additionally, Woordpeiler serves science communication and outreach goals: it underpins a monthly and annual Woordpeiling (“Word Poll”) shared via INT’s website and social media, and it is used in educational materials about language variation and change for secondary school students.

        Speakers: Kris Heylen, Vincent Prins, Katrien Depuydt, Jesse de Does, Laura van Eerten, Thomas Haga
    • 11:00 AM 12:30 PM
      Parallel sessions 2 (Sonce hall) Sonce hall

      Sonce hall

      Convener: Ana Frankenberg
      • 11:00 AM
        Inductive Categorization for Conceptual Analysis with LLMs: A Case Study from the Humanitarian Encyclopedia 30m

        Corpus-based conceptual analysis for the Humanitarian Encyclopedia (HE) grapples with vast amounts of lexical data to describe the meaning of key humanitarian notions and detect conceptual variation among actors (Odlum & Chambó, 2022). By building on Frame-based Terminology (Faber, 2015, 2022), the HE is incorporating qualitative methods necessary to subsume lexical data into manageable semantic triples in a way that ensures the traceability and transparency of modeling decisions.

        While traditional inductive qualitative analysis is labor-intensive, researchers are now replicating these methods using LLM-assisted workflows. Following this trend, our paper presents an observational study with a dataset of 274 spans labeled as causes of forced displacement that were manually annotated on a random sample of 1,000 concordances obtained from an English corpus of humanitarian documents from ReliefWeb (Isaacs et al., 2024). In this initial assessment, we test LLM inductive categorization using four models locally: Magistral Small 1.0 (Mistral-AI et al., 2025) with 24 billion parameters and three DeepSeek R1 models (DeepSeek-AI, 2025), with 8, 32 and 70 billion parameters. They are evaluated against a manual categorization comprising 34 causality groupings produced by two annotators through consensus.

        To assess baseline similarities, we provide models with minimal, zero-shot instruction, while also requiring structured outputs and conducting 40 runs per model (10 runs per text format: lines, CSV rows, JSON dictionary and Python list). We evaluate model fitness by measuring (1) degree of task completion, (2) category assignment similarity to the gold standard and (3) semantic overlap of LLM-generated category labels with those in the gold standard. For category assignment similarity, multiple Jaccard similarity scores were converted into a single normalized measure. Category labels from the top ten runs (those exhibiting the highest degree of category assignment similarity) demonstrated semantic overlap with manual labels. Nevertheless, the results were mixed: some LLM-generated labels were invalid, whereas others, although absent from the gold standard, were considered pertinent by the annotators.

        In conclusion, models displayed low overall similarity scores when given little instruction and hundreds of spans to classify in one batch, consistently omitting spans despite being prompted not to do so. Outlier runs achieved similarity scores comparable to annotators, while revealing useful insights not captured in the manual categorization. The results underscore the complexity of categorizing data for a single, domain-specific concept. However, this also highlights the potential of LLMs as complementary tools for qualitative analysis tasks in the conceptual analysis workflow of the HE. Future work will investigate multi-category tasks, hybrid human-in-the-loop approaches, refined prompting strategies, and additional pre- and post-processing of lexical data.

        Speakers: Loryn Isaacs, Santiago Chambó, Pilar León-Araúz
      • 11:30 AM
        Passive Vocabulary of Czech Native Speakers: A Statistical Approach 30m

        This paper explores the theory of measuring vocabulary size, including the various methods that can be used and the parameters that have to be set. We have examined the experiments carried out on English and Dutch. Gouldenet al. (1990) claims the average native speaker knows about 17,000 English base words (non-derived words). Keuleers et al. (2015) and Brysbaert et al. (2016) claim the average native speaker with secondary education knows about 42,000 headwords (lemmas). We have conducted an experiment similar to that of Keuleers and Brysbaert on Czech, with the input of 100,000 letter sequences from the wordlists of large web corpora. We assume the vocabulary size of Czech native speakers (as well as the vocabulary size of native speakers of any language) could be bigger, exceeding 57,000 (Czech) headwords, should we provide the participants with more inputs (150,000 sequences, or even more) or should we count the specialized terminology of their fields of interest.

        Speakers: Marek Blahuš, Miloš Jakubíček, Vojtěch Kovář, František Kovařík
      • 12:00 PM
        Automatically Updated Corpora of EU National Parliaments with Terminology Extraction in Twenty Languages 30m

        We present a collection of monolingual text corpora derived from the steno protocols of 30 parliamentary chambers across 22 EU member states, covering 20 languages. The corpora are continuously and automatically updated, enabling intralingual and cross-lingual analysis of parliamentary discussions. Each chamber’s protocols are regularly downloaded, processed, and transformed into a unified prevertical text format. A terminology extraction grammar is available for each language, allowing the identification of terms specific to each parliament by comparing the parliamentary debates with a general-language reference corpus (or a custom subsection of the debates to the whole body of them). The corpora include timestamps, enabling the observation of trending topics across all European national parliaments within a single platform. Corpus quality depends on the availability and format of the source data, which ranges from simple text files, DOCX, HTML, to XML and JSON (With documented APIs). A monitoring system ensures ongoing compatibility with any format changes. Currently, the corpora consist of over 2.8 billion words and are managed in Sketch Engine.

        Speakers: Marek Blahuš, Ota Mikušek
    • 11:00 AM 12:30 PM
      Parallel sessions 3 (Zrak hall) Zrak hall

      Zrak hall

      Convener: Tomasz Michta
      • 11:00 AM
        A Pipeline for Automated Dictionary Creation with Optional Human Intervention 30m

        This paper presents a modular pipeline for automated dictionary creation using large language models (LLMs). It addresses the well-known limitations of prompting systems such as ChatGPT to produce entire entries in a single step – outputs that may read fluently but often lack structural consistency, transparency, originality and verifiability. The proposed system overcomes these weaknesses by decomposing the lexicographic process into a sequence of narrowly constrained, XML-validated stages, each guided by custom-crafted prompts and Document Type Definitions (DTDs). Rather than asking an LLM to “write a dictionary entry,” the system treats it as a disciplined assistant performing a defined subtask under strict supervision.

        At each stage – ranging from extracting and shortening corpus examples to grouping, defining, translating and formatting – the output is verified against an XML grammar and preserved for audit. This structure enforces reproducibility and allows human intervention at any point, combining the speed and adaptability of machine generation with the oversight and accountability of traditional lexicography. The process is entirely corpus-grounded: every example can be traced to a verifiable source, and every decision in the pipeline is documented. Errors can be corrected where they occur rather than through repeated prompting, and edited intermediate files can be reintegrated seamlessly into the workflow.

        Technically, the pipeline is implemented in Python and designed to integrate easily with standard dictionary environments such as IDM’s DPS system. It is language-agnostic and domain-independent: prompt files and DTDs can be adapted to any language pair, dictionary type or corpus source. The modular architecture also enables the insertion of new stages – for example, automatic tagging of usage labels, collocations or etymological notes – without altering the underlying structure. The system produces both machine-readable XML output and human-friendly Markdown files for editorial review, ensuring compatibility with established lexicographic and publishing workflows.

        Two sample entries for the Danish adjective nørdet demonstrate that the pipeline achieves consistent formatting, transparent sourcing and idiomatic translations while avoiding plagiarism and hallucination. Evaluation suggests that each complete run (typically five stages) produces a usable draft entry at minimal cost and within seconds. The approach therefore provides a sustainable framework for dictionary production, especially for under-resourced languages or specialised terminologies where editorial time and funding are limited.

        By embedding formal validation and corpus traceability into every step, the system offers a practical model for responsible integration of LLMs into lexicography. It shifts the human role from mechanical compilation to high-level editorial judgement, enabling lexicographers to supervise, refine and extend AI-generated content with full transparency. Released as open source under the MIT Licence, the pipeline invites adaptation, experimentation and community collaboration.

        Speaker: Thomas Widmann
      • 11:30 AM
        Better something than nothing: Analysis of GPT-4 performance in identifying Croatian proverbs 30m

        The task of automatic detection of idiomatic expressions such as proverbs is an established problem in natural language processing. Before the advent of large language models, attempts were made to describe proverbs by modelling their syntactic structure (Rassi et al., 2014). Later, others employed contextual embeddings and neural networks to identify idioms (Škvorc et al., 2022) which is a task closely related to proverb detection.

        This research effort aims to analyse the performance of the ChatGPT large language model (ChatGPT 4o) in the task of detecting proverbs and proverb-related expressions. As proverbs are often used in political discourse to underscore messages or augment arguments and points of view (G¡ndara, 2004), the research presented here will use the minutes of the Croatian parliament sessions made available by the Croatian parliamentary corpus ParlaMeter-hr (Dobranić et al., 2019) to build a list of proverbs occurring in contemporary discourse.

        A list of 151 Croatian proverbs used in contemporary speech and texts was obtained from (Varga & Matovac, 2016) and other sources. Proverbs are mostly used as idiomatic expressions, with little variation. This fact was used to create a custom simple fuzzy search algorithm, which was then applied to a small section of the ParlaMeter-hr corpus to extract sentences which contain proverbs. The extracted list was further manually checked and verified. This simple search technique yielded 126 confirmed occurrences of sentences which contained proverbs.

        The next step included prompting GPT-4o with a combination of prompts to determine its ability to detect proverbs, using both the chat and API interface. The prompts ranged from a very simple zero-shot to elaborate instructions with accompanying list of proverbs.

        It was discovered that GPT-4o created a list of Croatian proverbs as a response to the chat based zero-prompt which contained only 12 items. Uploading the list of proverbs resulted in only 54% accuracy. API prompt returned better results, the zero-shot prompt reached 79% accuracy in under 5 minutes, while the most elaborate many-shot prompt using the curated list of proverbs reached 94% accuracy, but took over 120 minutes at an increased financial cost.

        Speaker: Nikola Bakarić
    • 12:30 PM 12:45 PM
      Closing
    • 2:00 PM 6:00 PM
      Workshop: Globalex workshop on Lexicography and Neology Sonce hall

      Sonce hall

      • 2:00 PM
        Welcome 10m
      • 2:10 PM
        From Poetry to Lexicon: The Role of Lexicography in Documenting Literary Neologisms 25m
        Speaker: Nikos Mathioudakis
      • 2:35 PM
        A Study into the Translation of Chinese Culture-bound Neologisms in Large Chinese-English Dictionaries 25m
        Speaker: Jinhong Huang
      • 3:00 PM
        Lexicographical treatment of homophonic neologisms in Chinese dictionaries 25m
        Speakers: Jiang Li, Wang Yi
      • 3:25 PM
        The energy crisis in Germany and its impact on general language vocabulary 25m
        Speaker: Alexander Geyken
      • 3:50 PM
        Coffee break 30m
      • 4:20 PM
        ENEOLI Wikibase: A collaborative working platform for the European Network on Lexical Innovation 25m
        Speakers: Ana Salgado, David Lindemann
      • 4:45 PM
        Evaluating Suitability of Open-Weight Large Language Models for Neologism Detection in Lithuanian: Methodological and Practical Issues 25m
        Speaker: Marius Glebus
      • 5:10 PM
        The challenge of AI-generated neologisms 25m
        Speakers: Cécile Poix, Natalya Shevchenko
      • 5:35 PM
        Discussion and closing 25m