Conveners
Parallel sessions 1 (Arnold hall)
- Tanara Zingano Kuhn
Parallel sessions 1 (Arnold hall)
- Jaka ฤibej
Parallel sessions 1 (Arnold hall)
- Kristina Kocijan
Parallel sessions 1 (Arnold hall)
- Kristina Kocijan
Parallel sessions 1 (Arnold hall)
- Ivana Filipoviฤ Petroviฤ
Parallel sessions 1 (Arnold hall)
- Kristina Koppel
Parallel sessions 1 (Arnold hall)
- Carole Tiberius
Parallel sessions 1 (Arnold hall)
- Jelena Kallas
The public release of ChatGPT in late 2022 made an impact on many professional domains. Notwithstanding the many controversies surrounding Generative Artificial Intelligence (GenAI), such as ethics, copyright, accountability, or ecology, we need to acknowledge an important and relevant feature of Large Language Models and chatbot systems built around them: their ability to produce mostly...
Recent findings indicate that current large language models (LLMs) face difficulties in generating clear-cut, well-motivated definitions in a consistent way. This shortcoming is the consequence of their reliance on opaque data sources and their inherently unstable, non-deterministic outputs. In response, this research aims to develop an LLM-based methodology for producing adjectival...
Finding non-recorded senses is important for dictionary maintenance, where using automatic methods helps reduce manual efforts. We use automatic Word Sense Induction (WSI) to compare recorded sense numbers among a sample of headwords in a comprehensive Swedish monolingual dictionary with induced sense numbers for the same words in a Swedish corpus. We propose this as a simple technique to find...
This paper introduces the Dictionary of Contemporary Serbian Language (RSSJ), an ongoing large-scale digital lexicographic project designed to serve both human users via web and mobile applications and machines through APIs. Coordinated by the diaspora association โGathered around the Languageโ and the Society for Language Resources and Technologies (JeRTeh), RSSJ aims to produce a dictionary...
Large Language Models (LLMs) tend to expose severe language and cultural biases when working in medium- and low-resourced languages. In this paper, we present our work on Danish benchmarking and evaluation of LLMs to more precisely diagnose and potentially remedy such bias. To this aim, we apply available lexical-semantic resources to compile a set of Natural Language Understanding (NLU) tasks...
The use of corpora is well established in lexicography, also in Estonia, but since the analysis of corpus data and the post-editing of automatically generated data from the corpus is labour-intensive, the use of large language models (LLMs) has led to growing interest in lexicography (e.g., Evert et al. 2024; Kosem, Gantar et al. 2024; Tiberius et al. 2024). In 2024, the Institute of the...
In lexicography, one of the long-standing issues is understanding the nature of its core element of description commonly referred to as the headword (in DMLex and traditional lexicography), canonical form (in OntoLex and the Lexical Markup Framework โ LMF), orthographic form (in the Text Encoding Initiative โ TEI Lex0), lemma (in Wikidata), or lexical unit. With the transition from paper to...
This paper presents the Oxford English Dictionaryโs (OED) current exploration into the application of artificial intelligence to historical Word Sense Disambiguation (WSD), a fundamental aspect of OEDโs core research. Building on a longstanding tradition of technological innovation, the OED is investigating how Large Language Models (LLMs) can support the identification and retrieval of...
Generic nouns such as Sache and Ding pose a challenge for semantic annotation due to their referential underspecification and context-dependent meaning. Although frequently classified under categories like {artefact} or {object}, their actual referents often belong to abstract or cognitive domains, as in Der Placeboeffekt ist eines der faszinierendsten Dinge in der Welt der Medizin. Drawing on...
Collocations are a well-covered research area in lexicography. With the advent of evidence-based lexicography and the availability of large text corpora, computational methods of extracting typical co-occurrences from such corpora and supporting lexicographers in identifying collocations among them became a research focus. Especially the statistical properties of collocations (i.e. application...
Language corpora have long been used in linguistics and lexicography, but recent developments now allow large language models (LLMs) to support or even transform these fields. This study investigates the potential of LLMs for annotating informal language use in Estonian โ a language underrepresented in LLM training data yet supported by a large corpus. Focusing on the informal register label...
Automation has revolutionised lexicography, introducing the โpost-editing lexicographyโ model, where the role of the lexicographer involves refining automatically generated dictionary drafts. Since the launch of ChatGPT in November 2022, numerous papers have explored the potential applications of LLMs in dictionary production. The rapid evolution of LLMs necessitates a re-evaluation of...
While CEFR-aligned vocabulary profiles have been developed for many languages (e.g., English, German, and Swedish), Ukrainian as a foreign language (UFL) still lacks an empirically grounded lexical profile. A foundational issue in creating such profiles is combining lexical frequency data with expert knowledge to assign CEFR-level labels. Existing UFL word lists rely primarily on professional...
In preparing phraseological units for the third edition of the Standard Slovenian Dictionary (eSSKJ), the authors aimed to identify the most relevant comparative phrasemes in the contemporary standard language using objective corpus-based criteria. A key goal was to determine how representative specific phrasemes and their variants are in actual use. Two lists of the hundred most frequent...
This paper presents an experimental workflow for converting legacy digitized dictionaries into the DMLex standard and subsequently importing them into a Wikibase instance. DMLex, a serialization-independent model developed by the OASIS LEXIDMA Technical Committee, aims to provide a universal and modular representation of lexicographic data. The study tested whether dictionaries from...
This paper evaluates the results of using GPT-4o mini language model batch processing with image recognition capability to align 1,572 images of 398 polysemous nouns in the Dictionary of the Slovenian Standard Language (second edition) to their specific dictionary senses, and it compares them to the results of the manual image-to-sense alignment process. The images were manually assigned to...
Representative monitor corpora with detailed metadata offer a solid empirical basis for documenting lexical innovation and change (Kosem et al. 2021). However, continuously updated time-stamped textual data presents challenges for data management, lexicographic analysis, and visualization. Building on its existing corpus infrastructure, the Dutch Language Institute (INT) has developed...