Speakers
Description
This paper presents the Oxford English Dictionary’s (OED) current exploration into the application of artificial intelligence to historical Word Sense Disambiguation (WSD), a fundamental aspect of OED’s core research. Building on a longstanding tradition of technological innovation, the OED is investigating how Large Language Models (LLMs) can support the identification and retrieval of illustrative quotations that accurately reflect word sense usage through time – at present one of the most labour-intensive aspects of entry drafting.
The quotation paragraph in OED entries provides readers with a curated timeline of usage, illustrating the emergence, evolution, and typical contexts of a word sense. Constructing these paragraphs requires editors to search historical corpora and databases for relevant material, disambiguate search results to isolate the targeted sense, then select quotations that are both representative and informative and meet OED’s selection criteria. This task is particularly complex when searching content from earlier time periods, where historical variation in spelling and inflection can further complicate retrieval. Editors currently construct complex iterative search strategies across databases such as Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO), and Google Books, often crafting extensive Boolean queries to find relevant material.
To address these challenges, the OED is developing an AI-assisted tool that leverages LLMs to retrieve quotations in specified senses from historical corpora. Rather than relying on manually constructed search strings, the tool allows editors to query the model in natural language, with the LLM returning candidate quotations that match the targeted sense. This approach has the potential to reduce reliance on collocational heuristics, automate the handling of spelling and inflection variants, thus improving the efficiency and accuracy of quotation retrieval.
The paper outlines the technical components of this initiative, including model selection and evaluation, data formatting strategies, prompt engineering strategies, and the quotation retrieval mechanism. Prototype applications are under development to test these components, primarily using EEBO as a foundational dataset. Initial testing reveals promising results, though challenges remain, particularly in mitigating LLM overconfidence and ensuring interpretive caution in ambiguous cases.
In addition to supporting editorial staff, the OED is exploring how this tool can benefit subscribers to OED.com. Survey data from academic users indicates strong interest in expanded access to historical quotations, provided the tool is transparent, trustworthy, and well-cited. The paper gives a preview of how the tool might be accessed online, and discusses how the tool might grow from a “Minimum Viable Product” to something more powerful, whilst maintaining the distinction between viewing quotations that have been selected by editors and those that have been automatically retrieved by the tool. The paper concludes by reflecting on the broader potential of AI-assisted WSD in digital humanities research and lexicography, and outlines future directions for development, including expanded corpus coverage and enhanced user functionality.