Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

A Corpus-Based Dictionary for the Endangered Megrelian Language

Nov 19, 2025, 5:30 PM
30m
Zrak hall

Zrak hall

Speakers

Irina Lobzhanidze Rusudan Gersamia

Description

ONLINE PRESENTATION

This paper presents a corpus-based approach to compiling a bilingual Megrelian-English online dictionary. The Megrelian language belongs to the UNESCO Atlas of the World’s Languages in Danger group of “increasingly endangered” languages, and faces a number of critical challenges, among them a lack of standardised resources, intergenerational transmission, and minimal digital presence. Unlike widely spoken languages equipped with pretrained models and various linguistic tools, "increasingly endangered" languages like Megrelian lack even basic NLP tools such as annotated corpora, PoS taggers, and morphological analysers. Moreover, the complexity of their grammar and phonology require special approaches that cannot simply be adapted from high-resource languages. To address these gaps, we developed an annotated corpus of contemporary Megrelian, consisting of 97691 tokens and 60959 types. It is based on data collected through fieldwork in Samegrelo, Georgia, from the years 2022 to 2025. The whole process was subdivided in two main stages: fieldwork conceptualization and data collection, followed by laboratory analysis and data processing.

The bilingual Megrelian-English dictionaries were developed in parallel, using the same dataset processed in Fieldworks Language Explorer (FLEx, 2024). This approach enabled the integration of corpus annotations into the dictionary entries. Following the principles described in Atkins & Rundell (2008), Gibbon & Van Eynde (2000), we used lexeme-based and root-based configurations, resulting in the creation of two online dictionaries, available online. The first dictionary is oriented toward the translation of individual words, while the second focuses on the translation of individual morphemes. In the first case, each lexical entry is supported by morphosyntactic information, phonetic transcription (IPA), glosses, and semantic descriptions. In the second case, the entries represent individual morphemes, providing not only glosses, but also information about their occurrences and links to their use in the corpus. The finalised data is available online through https://xmf.iliauni.edu.ge/.

The paper is subdivided into several parts: 1. Introduction, outlining the significance of Megrelian as part of the Kartvelian language family and introduces the project dedicated to the documentation of the Megrelian language; 2. Background and Data Collection, providing overviews the existing Megrelian dictionaries and represents the data collection stages; 3. Annotation and Corpus Development, describing the data annotation and processing stages and giving information on corpus size, linguistic coverage, etc.; 4. The Dictionaries - Design and Generation, presenting the configurations for both the lexeme-based and morpheme-based dictionaries, and also thoroughly describing the export and converstion stages, oulining the linkage between the corpus and the dictionary entries, and; 5. Conclusions, Challenges and Future Works, which summarises the corpus-based lexicographic approach to the Megrelian language, provides a short description of the ongoing challenges, and describes future plans concerning the use and potential improvement of the data.

Presentation materials

There are no materials yet.