The present research explores the use of large language models (LLMs) in digital lexicography, specifically for translating Italian multiword expressions (MWEs) into English and French.
The study aims to assess the capability of contemporary LLMs in providing accurate and reliable translation equivalents, examples and definitions of Italian MWEs into English and French, while also...
The paper outlines technological and methodological ways to arrange the dictionary parsing process. The Spanish Dictionary (Diccionario de la lengua Española 23 ed. – DLE 23) website (https://dle.rae.es/) serves as a basis for the research. First of all, asthe most complex multi-parameter lexicographic frameworks, explanatory dictionaries of national languages are of the most interest because...
The public release of ChatGPT in late 2022 made an impact on many professional domains. Notwithstanding the many controversies surrounding Generative Artificial Intelligence (GenAI), such as ethics, copyright, accountability, or ecology, we need to acknowledge an important and relevant feature of Large Language Models and chatbot systems built around them: their ability to produce mostly...
Recent findings indicate that current large language models (LLMs) face difficulties in generating clear-cut, well-motivated definitions in a consistent way. This shortcoming is the consequence of their reliance on opaque data sources and their inherently unstable, non-deterministic outputs. In response, this research aims to develop an LLM-based methodology for producing adjectival...
The COST Action ‘European Network on Lexical Innovation’ (ENEOLI) has conducted a comprehensive survey in October-November 2024 regarding the methods, practices, tools, and resources used in the study and documentation of lexical innovations, including neologisms and novel senses. The 249 respondents from 50 countries represented linguists, lexicographers, terminologists, translators, software...
Traditionally, historical texts’ optical character recognition (OCR) has primarily been conducted using specialised software such as Transkribus, eScriptorium, Kraken, and similar tools. To achieve accurate character recognition, these systems require extensive pre-training and the creation of a refined "ground truth" dataset. The comprehensiveness of model pre-training directly correlates...
Finding non-recorded senses is important for dictionary maintenance, where using automatic methods helps reduce manual efforts. We use automatic Word Sense Induction (WSI) to compare recorded sense numbers among a sample of headwords in a comprehensive Swedish monolingual dictionary with induced sense numbers for the same words in a Swedish corpus. We propose this as a simple technique to find...
The purpose of the presentation is to explore the design and development of an innovative online pedagogical dictionary of Greek Sign Language, specifically tailored to the linguistic and educational needs of Deaf and Hard-of-Hearing (DHH) learners in Greece. Emphasizing accessibility and pedagogical usability, the dictionary integrates Artificial Intelligence (AI) technologies to support...
Constructicography, or the description of grammatical constructions in a lexicographic format, is an emerging field currently in the stage of developing and automating methods for treating large numbers of (semi-)schematic constructions. This study explores how existing lexicographic data and language models can be used to facilitate the constructicographic workflow. Our results suggest that...
This paper introduces the Dictionary of Contemporary Serbian Language (RSSJ), an ongoing large-scale digital lexicographic project designed to serve both human users via web and mobile applications and machines through APIs. Coordinated by the diaspora association “Gathered around the Language” and the Society for Language Resources and Technologies (JeRTeh), RSSJ aims to produce a dictionary...
Large Language Models (LLMs) tend to expose severe language and cultural biases when working in medium- and low-resourced languages. In this paper, we present our work on Danish benchmarking and evaluation of LLMs to more precisely diagnose and potentially remedy such bias. To this aim, we apply available lexical-semantic resources to compile a set of Natural Language Understanding (NLU) tasks...
This paper reports on recent advancements in the development of the Mangalam Dictionary of Buddhist Sanskrit, the first corpus-driven dictionary dedicated to Buddhist Sanskrit. This is a low-resource, historical, and domain-specific language variety instantiated in South Asian Buddhist literature dating from approximately the first millennium CE. The paper focusses on advances in the...
The paper introduces a hybrid methodology for cross-linguistic identification of phraseme constructions, developed within the scope of a pilot study on Croatian repetitive constructions. The study explores how artificial intelligence and corpus technologies can be systematically combined to uncover functionally equivalent patterns across languages. The proposed strategy rests on three...
The use of corpora is well established in lexicography, also in Estonia, but since the analysis of corpus data and the post-editing of automatically generated data from the corpus is labour-intensive, the use of large language models (LLMs) has led to growing interest in lexicography (e.g., Evert et al. 2024; Kosem, Gantar et al. 2024; Tiberius et al. 2024). In 2024, the Institute of the...
The Dutch Language Institute (INT) has a long tradition compiling historic and contemporary dictionaries and other types of lexicographic databases, mainly for Dutch but also for some other languages with a relation to Dutch. Lexicographic work at the institute is computer-supported but there is still a great deal of manual work involved. Therefore, INT is exploring how new technologies...
Due to the policy of Russification in the 20th century, the Ukrainian language underwent an influx of Russianisms, among other forms of interference with its structure. Today, many Ukrainians require guidance regarding non-Russified usage, and a Large Electronic Dictionary of Ukrainian (VESUM, vesum.nlp.net.ua) is designed to meet this need. With a register of over 430,000 lemmas, it is the...
The focus of this paper is on Generative Artificial Intelligence (GenAI), chatbots and some implications for lexicography and dictionary use. It has been well documented that chatbots originally tended to “hallucinate” if they did not have an answer to the prompt put to them. Much larger training databases have, however, been developed and chatbots have become more accurate. Multiple...
In lexicography, one of the long-standing issues is understanding the nature of its core element of description commonly referred to as the headword (in DMLex and traditional lexicography), canonical form (in OntoLex and the Lexical Markup Framework – LMF), orthographic form (in the Text Encoding Initiative – TEI Lex0), lemma (in Wikidata), or lexical unit. With the transition from paper to...
This study explores the use of several chatbots based on recent generative large language models for automatic term extraction (ATE) from smaller text samples. The samples were selected from three domains: board games, ice hockey, and kitesurfing; and they cover three languages: English, French, and Portuguese. We used four prompting strategies: zero shot, one shot, few shots, and few shots...
This paper presents the Oxford English Dictionary’s (OED) current exploration into the application of artificial intelligence to historical Word Sense Disambiguation (WSD), a fundamental aspect of OED’s core research. Building on a longstanding tradition of technological innovation, the OED is investigating how Large Language Models (LLMs) can support the identification and retrieval of...
The Vienna Corpus of Arabic Varieties (VICAV) is a digital research infrastructure for the documentation and analysis of the linguistic diversity of Arabic varieties^. Integrating methods from language technology and the digital humanities, VICAV provides a modular, sustainable platform for the creation, management, and publication of heterogeneous language resources within a shared data...
Generic nouns such as Sache and Ding pose a challenge for semantic annotation due to their referential underspecification and context-dependent meaning. Although frequently classified under categories like {artefact} or {object}, their actual referents often belong to abstract or cognitive domains, as in Der Placeboeffekt ist eines der faszinierendsten Dinge in der Welt der Medizin. Drawing on...
In this paper we show how the academic content and computational tools featured in Lexicom form a parallel history of the last 25 years of innovation in lexicography. Lexicom is a 5-day intensive workshop offering handson training in corpus-based dictionary creation, from collecting and annotating language data to publishing the final product. Since it was launched in 2001, by Sue Atkins, Adam...
POSTER
Writing dictionary entries is not only time-consuming but also an expensive process due to the highly specialized knowledge and experience required of the lexicographer. To facilitate the task of compiling the Danish monolingual dictionary DDO (ordnet.dk/ddo), we aim to establish an automatic assistant based on applied language technology (e.g. n-gram analysis, word embeddings, etc.)...
DEMO
CJVT igre (https://igre.cjvt.si/) is a new digital platform offering word games designed to foster lexical awareness and engagement with standard Slovene. Developed by the Centre for Language Resources and Technologies at the University of Ljubljana, the portal currently hosts three games—Cvetka, Besedolov, and Vezalka—with two more in development. Each game utilizes curated lexical...
POSTER
The representation of medical adjectives in Croatian general dictionaries reveals significant inconsistencies, reflected in uneven lemma inclusion, ambigous or absent domain labels, and limited definitional precision. This paper analyzes the 80 most frequent adjectives, based on corpus data from the Croatian Medical Corpus (CMC) (Kocijan, Kurolt & Mijić, 2020), in the three major...
POSTER
This paper presents a novel approach to exploring derivational families within the framework of Intelligent Lexicography, using the ŠKOLARAC corpus: a collection of Croatian school essays written by L1 learners (native-speaking students) in grades 5 through 8 and enriched with metadata such as gender, grade level, and region. By combining rule-based linguistic processing in NooJ, a...
POSTER
Taking seriously the common construction grammar statement that “it’s constructions all the way down” (Goldberg, 2006: 18), the Hungarian Constructicon aims to encompass the widest possible range of constructions. As it is a dictionary-based constructicon, it naturally contains what a dictionary can provide — from morphemes to words, and to partially schematic multiword constructions...
POSTER
The lack of normative resources for the Croatian language has incited the development of a novel resource that would not only compile normative data for Croatian but also focus on an underrepresented group of linguistic units – figurative multi-word (MWE) expressions. Thus, the creation of a normative database for figurative MWEs in Croatian is a significant step in the right...
POSTER
The objective of the research is to develop a technology for converting specialized dictionary text into a website with a developed user interface.
The object of the study was “Dictionary of Ukrainian biological terminology” (7,342 entries and about 26,000 terms in Ukrainian, Russian and English), that contains definitions, terms polysemy, synonymy, stresses for Slavic languages,...
POSTER
As part of the COST Action CA21167 Universality, Diversity and Idiosyncrasy in Language Technology (UniDive), the ELEXIS-WSD Parallel Sense-Annotated Corpus (Martelli et al., 2021; Čibej et al., 2025) is being expanded to include subcorpora in additional languages—among them, Croatian—as well as new annotation layers. Each language subcorpus of ELEXIS-WSD contains the same 2,024...
POSTER
In this paper, we provide a comprehensive overview of the way in which the morpho-syntactic properties of multiword expressions are represented in lexical resources to support Natural Language Processing downstream applications. Starting from an up-to-date and comprehensive overview of the existing lexica dedicated to multiword expressions and containing their syntactic description,...
POSTER
The Information and Communication Technologies (ICT) field has evolved rapidly in recent decades. Thus, to describe new devices, activities, and concepts that appear yearly, a vast number of terms are created primarily in English, while other languages rely on secondary term formation (STF) for ICT end-users (ETSI Guide, 2022). Systematic secondary rendering and dissemination...
Collocations are a well-covered research area in lexicography. With the advent of evidence-based lexicography and the availability of large text corpora, computational methods of extracting typical co-occurrences from such corpora and supporting lexicographers in identifying collocations among them became a research focus. Especially the statistical properties of collocations (i.e. application...
ONLINE PRESENTATION
Technology has largely affected the way language learners seek information. Digital formats virtually superseded the paper dictionary (Ptasznik, Wolfer and Lew, 2024), online translators gained much importance (O’Neill, 2019), and web browsers became the first port of call (Kosem et al., 2019). Obviously, generative AI systems imitating human-like communication mark...
This paper presents two tasks involving large language models (LLMs)—Gemini-2.0-flash and GPT-4o—used to generate distractors (i.e., incorrect options) for synonym and collocation questions in a language game. The lexical data for both tasks was sourced from the Digital Dictionary Database of Slovene (DDDS). Prompts were initially tested on a sample dataset with both models, and the...
Language corpora have long been used in linguistics and lexicography, but recent developments now allow large language models (LLMs) to support or even transform these fields. This study investigates the potential of LLMs for annotating informal language use in Estonian – a language underrepresented in LLM training data yet supported by a large corpus. Focusing on the informal register label...
Studies comparing dictionary entries generated with AI with those of well-established dictionaries edited by lexicographers show that LLMs tend to perform better in some tasks (e.g. writing definitions) than in others (e.g. word-sense disambiguation (e.g. Nichols 2023, Lew 2023, Jakubíček & Rundell 2023, Rees & Lew 2024). One of the problems resulting from the latter is that of “false...
Automation has revolutionised lexicography, introducing the ’post-editing lexicography’ model, where the role of the lexicographer involves refining automatically generated dictionary drafts. Since the launch of ChatGPT in November 2022, numerous papers have explored the potential applications of LLMs in dictionary production. The rapid evolution of LLMs necessitates a re-evaluation of...
ONLINE PRESENTATION
In this presentation we describe the DICI-A (Dizionario delle collocazioni italiane per apprendenti), a new learner dictionary of Italian collocations.
The DICI-A includes ca. 11,000 collocations belonging to six syntactic relations: i. Verb + Direct object (mantenere una promessa, ‘to keep a promise’); ii. Adjective + Noun/Noun + Adjective, where the adjective is a...
ONLINE PRESENTATION
Taboo-language resources remain scarce for under-resourced languages like Afrikaans – despite their clear relevance for natural language processing (NLP) and applications in artificial intelligence (AI). Although Afrikaans has a long-standing lexicographic tradition, it still lacks an open-access reusable lexical database for the taboo language. One of the most crucial...
While CEFR-aligned vocabulary profiles have been developed for many languages (e.g., English, German, and Swedish), Ukrainian as a foreign language (UFL) still lacks an empirically grounded lexical profile. A foundational issue in creating such profiles is combining lexical frequency data with expert knowledge to assign CEFR-level labels. Existing UFL word lists rely primarily on professional...
While the move to the digital design of lexical resources has, in principle, enhanced the physical and sensory accessibility of dictionaries, a lack of adherence to accessibility standards such as WCAG 2 (Web Content Accessibility Guidelines) (Campbell et all 2023) can introduce significant barriers (NCD 2006; Botelho 2021). These barriers often hinder access to the information and...
ONLINE PRESENTATION
Taboo words present a challenge for a lexicographer to include and describe in a language resource, as they are forms of verbal violence. However, discarding offensive words from general-purpose lexicographic wordlists disregards the representation of an integral part of the mental lexicon. The present study aims at using lexicographic scenarios to jailbreak four GPT...
ONLINE PRESENTATION
This paper presents a corpus-based approach to compiling a bilingual Megrelian-English online dictionary. The Megrelian language belongs to the UNESCO Atlas of the World’s Languages in Danger group of “increasingly endangered” languages, and faces a number of critical challenges, among them a lack of standardised resources, intergenerational transmission, and minimal...
This paper presents a long-term privately-funded programme focusing on collecting of timestamped monitor corpora in a wide range of (currently 25) languages. These corpora are primarily designed for researching linguistic trends (including neology) and language change over time. They are available through the Sketch Engine platform and vary significantly in size — from 3 million tokens for...
In preparing phraseological units for the third edition of the Standard Slovenian Dictionary (eSSKJ), the authors aimed to identify the most relevant comparative phrasemes in the contemporary standard language using objective corpus-based criteria. A key goal was to determine how representative specific phrasemes and their variants are in actual use. Two lists of the hundred most frequent...
This paper investigates the potential of LLMs in supporting lexicographic work on non-standard linguistic varieties using data from the Dictionary of Bavarian Dialects in Austria (WBÖ). Based on approx. 2.4 million digitized and TEI-encoded dialect paper slips published via the Lexical Information System Austria (LIÖ), we construct a domain-specific corpus and evaluate LLMs in semantic...
This contribution focuses on the methodological aspects of the ICoMuTe project aiming to design a corpus-based multilingual terminology database for Intercultural Communication (ICC). The project seeks to explore how ICC terms relate to each other within six European languages (Dutch, English, German, French, Italian, Spanish), how these terms are connected to their scientific and cultural...
This paper presents an experimental workflow for converting legacy digitized dictionaries into the DMLex standard and subsequently importing them into a Wikibase instance. DMLex, a serialization-independent model developed by the OASIS LEXIDMA Technical Committee, aims to provide a universal and modular representation of lexicographic data. The study tested whether dictionaries from...
Corpus-based conceptual analysis for the Humanitarian Encyclopedia (HE) grapples with vast amounts of lexical data to describe the meaning of key humanitarian notions and detect conceptual variation among actors (Odlum & Chambó, 2022). By building on Frame-based Terminology (Faber, 2015, 2022), the HE is incorporating qualitative methods necessary to subsume lexical data into manageable...
This paper evaluates the results of using GPT-4o mini language model batch processing with image recognition capability to align 1,572 images of 398 polysemous nouns in the Dictionary of the Slovenian Standard Language (second edition) to their specific dictionary senses, and it compares them to the results of the manual image-to-sense alignment process. The images were manually assigned to...
This paper explores the theory of measuring vocabulary size, including the various methods that can be used and the parameters that have to be set. We have examined the experiments carried out on English and Dutch. Gouldenet al. (1990) claims the average native speaker knows about 17,000 English base words (non-derived words). Keuleers et al. (2015) and Brysbaert et al. (2016) claim the...
We present a collection of monolingual text corpora derived from the steno protocols of 30 parliamentary chambers across 22 EU member states, covering 20 languages. The corpora are continuously and automatically updated, enabling intralingual and cross-lingual analysis of parliamentary discussions. Each chamber’s protocols are regularly downloaded, processed, and transformed into a unified...
Representative monitor corpora with detailed metadata offer a solid empirical basis for documenting lexical innovation and change (Kosem et al. 2021). However, continuously updated time-stamped textual data presents challenges for data management, lexicographic analysis, and visualization. Building on its existing corpus infrastructure, the Dutch Language Institute (INT) has developed...