Speaker
Description
Following the release of ChatGPT at the end of 2022, last year has been largely dominated by the advent of Large Language Models (LLMs) and their potential benefits and risks. The focus of the lexicographic community, for example at the Asialex and eLex10 conferences, was set mainly on whether and how LLMs can help (or replace) dictionary compilation and the threats they pose to the future of lexicography (e.g., de Schryver, 2023; Jakubíček & Rundell, 2023; Nichols, 2023; McKean & Fitzgerald, forthcoming; Rundell, forthcoming). On the other hand, relatively little attention was devoted to whether or how lexicography can contribute to enhancing LLMs (e.g., Kernerman, Asialex 2023). This paper will overview both trends and propose a more active role for the latter, i.e., to the lexicographic enhancement of LLMs.
The research carried out by the first group of authors mentioned above has mostly evolved around testing LLM performance of typical lexicographic tasks and evaluating the results in comparison with professional human input. These experiments included the suggestion of headwords, multiword expressions and inflected forms, sense disambiguation, creation (and recreation) of definitions, generation of usage examples, provision of citations, labels and pronunciation, etc. The overall conclusions shared by those authors have been that the (current) quality of the output was “not up to the standard of human editorial work”, as summarized by McKean & Fitzgerald (forthcoming), who also pointed out the flaws and weaknesses of LLMs in general, and recommended to implement a research program for “developing and testing prompts for common tasks, build an evaluation set to judge outcomes (which might also be useful to judge the output of human editors), and test new models and tools as they become available.”
However, the general concerns about LLMs cover a wide range of strong ethical, technical, legal, environmental, and economical issues. To begin with, most of the massive amounts of data needed for language model training stem heavily from web-crawled corpora that are often inflicted by diverse flaws, including inconsistency, bias, need to clean “noise”, and lacking knowledge of the source and usage license. (Multilingual LLMs, in particular, also risk applying data that has been generated automatically by machine translation engines and was not post-edited properly.) For languages other than English and the most major ones, relevant data might be scarce. The training process itself is very costly, as regards experts, time and GPU consumption (which also leads to worse computational pollution). They require the implementation of comprehensive post-editing and fine-tuning, whereas the outcomes are still liable to suffer from “hallucinations” that amount to deceptive fluency. Thus, the results can usually not be used for professional purposes because of unreliable performance and
potential copyright infringement.
On the other hand, quality lexicographic resources feature systemic, high-value and trustworthy linguistic data that can substantially enhance language model development, injecting greater precision, efficiency and reliability, reducing the needs for masses of data and substantial post-editing, and contributing to quality evaluation as well as to increased savings on all fronts. Detailed components and aspects such as sense division, multiword expressions, definitions, examples of usage, register and domain classification, syntactic patterns, semantic labels, grammatical categorization, and others, depict language meticulously and faithfully, and accurate translation equivalents empower multilinguality and cross-lingual linkage.
For example, the principal concepts introduced by the English learner’s dictionary pioneers in the 1930s (cf. Cowie, 1999), and their second wave since the mid-1980s (cf. Hanks, 2012; Adamska Sałaciak & Kernerman, 2016), feature many precious fundamental linguistic insights that are attempted to be achieved with the assistance of enormous amounts of problematic data, e.g., their close attention to phraseology (multiword expressions), examples of usage illustrating typical linguistic patterns, word sense disambiguation in the order of importance and frequency rather than chronologically, and indicating related domains, register, synonyms, antonyms, named entities, etc. as well as corpus-based analysis. Such “ready-made” elements offer invaluable support and facilitation to diverse LLM training, fine-tuning and benchmarking tasks.
Lexicography can therefore play a vital role in reinforcing both LLM performance and users’ trust in LLMs. We will demonstrate how lexicography in the service of LLMs can complement LLMs for lexicography.