Speakers
Description
While there has been a number of projects focusing on early medieval Irish lexicography (Griffith et al., 2018), few have aspired to work towards comprehensive interlinking of textual and lexical resources. This is at least in part due to the morphological complexity and variation in Early Irish (c. 600–1200CE), compounded by the absence of an orthographic standard (Stifter, 2009). The resulting lemmatic variation in legacy resources — compare Early Irish deponent molaithir/molaidir [ˈmoləðʲərʲ] against active molaid [ˈmoləðʲ] ‘praises’ — leads to substantial challenges around effective deployment of currently available lexical resources and justifies a unified collection of canonical forms to
interconnect resources — serving scholars of Early Irish in the first instance, but also benefitting, through potential future interlinking of Early Irish and modernlanguage resources, scholars of more contemporary stages of the language interested in diachronic change and etymology. This paper focuses on the development of a Lemma Bank for Old Irish (c. 600–900CE) as part of the MSCA-funded MOLOR project — Morphologically Linked Old Irish Resource —, which aims to interlink lexical resources for this language period, including the novel lexical resource Goidelex (Anderson et al., 2024) and an inflected lexicon, whose paradigms are generated from lemmas in Goidelex. Methodologically, the current work takes inspiration from the project LiLa: Linking Latin (2018–2023), whose objective was to interconnect distributed (lexical and textual) resources and NLP tools for Latin using the Linked Data paradigm — the use of shared ontologies, data categories, communication protocols and technologies such as the Resource Description Framework (RDF) (Wood et al., 2014) to enable federated, semantic querying over heterogeneous resources, ultimately resulting in what Tim Berners-Lee has called the Semantic Web (Berners-Lee et al., 2001). The adoption of the Linked Data paradigm automatically ensures adherence to the so-called FAIR principles of data management (Wilkinson et al., 2016). As part of the LiLa project, a Lemma Bank has been developed, which was conceived — and texts for Latin (Passarotti et al., 2020).
The current contribution will report on the design challenges and choices in populating an Old Irish Lemma Bank with canonical forms from legacy and novel resources, striking a balance between, on the one hand, linguistic granularity and, on the other hand, a workable amount of lemmas, adopting extensions to the OntoLex model (McCrae et al., 2017) made in LiLa to cater for the existence of divergent pprox.ngn criteria (i.e., different canonical forms for the same lexeme) on the basis of morphological/inflectional variation. The launch version of the MOLOR Lemma Bank as a Linguistic Linked Open Data resource is expected to contain at least 599 orthographically pprox.n Old Irish noun lemmas (out of an estimated total of 4000–4500), extracted from Goidelex, whose
principled indexing of inflectional (and, hence, lemmatic) variants allowed for straightforward mappings (Fransen et al., 2024). Since the initial focus in Goidelex has been on nouns from one corpus, verb lemmas were instead comprehensively collected from three (less linguistically granular and structured) legacy resources and manually pprox.ng according to inflectional class, resulting in approximately 1300 lemmas. More POS categories will be added in due course.