Speakers
Description
In our poster presentation, we will present the results of the experiment that tests the potential of large language models (LLMs) in semantic analysis of Estonian. We will focus on LLMs’ ability to analyse polysemy and create definitions. In 2024, the Institute of the Estonian Language started a new project in which we are exploring how LLMs, such as GPT, can help with the presentation of dictionary information. The representation of polysemy in dictionaries is still one of the most difficult tasks that lexicographers are confronted with, even with the availability of modern automatic tools, e.g., Sketch Engine (Kilgarriff et al., 2014). As Adam Kilgarriff (1992) has summarised, there is no consensus among researchers on what constitutes a word sense; nor how broad or narrow these should be; nor definitive guidelines how to determine where one sense ends and another begins. Lexicographers nevertheless try to describe the meanings of a word. Although in the case of a very polysemous word, every lexicographer would probably compile a different word sense division, even if the corpus data and the used corpus tools would be the same. As the word senses in a dictionary are abstractions and generalisations, it is difficult to determine an objective gold standard for this task. We will develop a lexicography-specific methodology to evaluate the output of LLMs. First, we will compile a sample that is representative of the relevant units of the Estonian language (including words with different number of senses in dictionaries) and prompt LLMs with the related tasks. Experts will be included in the evaluation. LLMs have not yet been systematically included in the dictionary work in Estonia. Worldwide there already exist dictionary systems, e.g., TLex (Joffe et al., 2003), that have artificial intelligence built in and which enable lexicographers to compile word articles using AI. Since 2019, the Institute of the Estonian Language has been aggregating language data into a single dictionary and terminology database called Ekilex (Tavast et al., 2018). This open-source database would also allow for integration with AI and NLP. As a result of the project, we plan to innovate the compilation methods by utilizing advanced technology and create language datasets for lexicographers. For English LLMs have been tested on their ability to generate different lexicographic macro- and micro-structural components (e.g., Jakubíček & Rundell, 2023). The best results have been achieved using ChatGPT for generating English definitions (de Schryver & Joffe, 2023; Lew, 2023). Therefore, this is one of the tasks we are interested in testing for Estonian as well. Some of the LLMs on the market support Estonian, e.g., GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI 2023). These models are trained on vast datasets collected from the internet, but the inclusion of Estonian texts is not a deliberate effort, rather a byproduct of data collected. However, the effectiveness of these models in solving lexicographyrelated tasks and for the Estonian language in general is yet to be determined. Our work will contribute to this area by providing results for the following models: GPT-4, GPT-4o and Gemini 1.0 Ultra. The aim is to test these models without adjustments, but depending on the results, further fine-tuning might be needed in future research.