Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

We need to talk about data structures in lexicography

Nov 20, 2025, 9:00 AM
1h
Arnold hall

Arnold hall

Speaker

Michal Měchura

Description

It has been almost half a century since we started “doing” lexicography on computers. Let’s stop for a minute now and take a critical look at the data models we have been using to represent the structure of dictionaries in dictionary writing systems and other software.

In this talk, I will trace the history of lexicographic data modelling from its beginnings as text markup for retro-digitised dictionaries, to the present day when most dictionaries are born-digital. I will show that, regardless of which notation we use (XML, JSON or other), the underlying design pattern is almost always a tree structure in which the various content items (headwords, senses, definitions…) are arranged in a parent-child hierarchy.

I will argue that the tree-structured pattern is not expressive enough to handle some phenomena that occur in dictionaries, such as entry-to-entry cross-references, the placement of multiword subentries, and complex hierarchies of subsenses. These things would be easier to manage in a graph-based data structure, such as a relational database or a Semantic Web-style knowledge graph.

Dictionary projects which insist on a purely tree-structured data model are failing to make full use of the digital medium. But upgrading to a graph-based data model is difficult because tree-structured thinking is entrenched in the minds of lexicographers and dictionary users alike. This talk will conclude with an introduction to DMLex, a recently standardised “Data Model for Lexicography” which aims to ease this transition by being a hybrid model, combining tree structures where possible with graph structures where necessary.

Presentation materials

There are no materials yet.