Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

The lemma dilemma, Slovene version

Nov 19, 2025, 10:05 AM
25m
Arnold hall

Arnold hall

Speakers

Polona Gantar Cyprian Laskowski Simon Krek

Description

In lexicography, one of the long-standing issues is understanding the nature of its core element of description commonly referred to as the headword (in DMLex and traditional lexicography), canonical form (in OntoLex and the Lexical Markup Framework – LMF), orthographic form (in the Text Encoding Initiative – TEI Lex0), lemma (in Wikidata), or lexical unit. With the transition from paper to digital environments, both the nature of this element and its description have evolved. At the heart of the “lemma dilemma” lies the relationship between form (particularly in logographic writing systems) and sense—the (description of a) concept intended to be meaningful to humans.

In this paper, we describe how the headword/lemma phenomenon is addressed in the Digital Dictionary Database for Slovene (DDDS). The DDDS includes two types of lexical units: concepts and named entities. The latter are defined lexicographically in the same manner as concepts and are included in the DDDS due to the need to provide information on inflection, pronunciation, normative status, or other linguistic factors.

Lexical units are mechanically divided into single lexeme units and multiword expressions (MWEs), based on their single-word or multi-word status in the Slovene writing system. Typologically, MWEs (excluding multiword named entities) are further divided into compounds and phrases.

The ultimate goal of the DDDS is to compile all types of information about the Slovene lexicon in a single database with a unified data model. Like other Slavic languages, Slovene has a very rich morphology, which often presents a dilemma for lexicographers when choosing the most appropriate word form to represent a concept—i.e., the headword. The DDDS includes a vast number of word forms with morphological data, including pronunciation and stress. Currently, this number stands at 9,312,865.

In the data model, a collection of morphologically linked word forms is defined as a LEXEME. According to this principle, a typical Slovene noun (associated with a unique LEXEME ID) includes 18 word forms, combining three grammatical numbers (singular, dual, plural) and six grammatical cases (nominative, genitive, dative, accusative, locative, instrumental).

As of now, the DDDS contains 395,613 lexemes. When forming a LEXICAL UNIT—which adds the conceptual or semantic layer of description—one word form must be selected to represent the lexical unit. This selected form is traditionally considered the headword, canonical form, or lemma. Consequently, the same LEXEME ID can be used for multiple LEXICAL UNITS, even if different word forms serve as the "headword" for each.

A practical example of this situation is a singular–plural noun pair where the same LEXEME ID and two different word forms are used as headwords to define two distinct concepts: "jajce" (Eng. egg, nominative singular) and "jajca" (Eng. testicles, nominative plural).

In the paper, we will provide a more detailed explanation of these principles, supported by additional examples.

Presentation materials

There are no materials yet.