Nov 17 – 20, 2025
Bled, Slovenia
Europe/Ljubljana timezone

GramatiKat: A Corpus-Based Tool for Detecting Morphological Anomalies and Paradigm Variation

Nov 18, 2025, 12:00 PM
30m
Zrak hall

Zrak hall

Speaker

Dominika Kovarikova

Description

GramatiKat is a freely accessible online application designed to support lexicographic and grammatical work on morphologically rich languages. It provides grammatical profiles, a frequency distribution of lemmas inflected forms, for thousands of Czech nouns, adjectives, and verbs based on large annotated corpora. The concept of grammatical profiling is rooted in the work of Janda and Lyashevskaya (2011), who demonstrated that the distribution of inflected forms can reflect both grammatical structure and semantic properties of lexemes. In GramatiKat, these profiles are compared against a statistically computed Reference Grammatical Profile (RGP), which captures the expected distribution of forms for a given part of speech (Kováříková & Nikolaev, in preparation). This allows users to immediately see whether a given word follows the expected distributional pattern or deviates from it in meaningful ways. Such deviations can signal lexicographically relevant features such as semantic anomalies or collocational behaviour (e.g. participation in multi-word terms, idioms, or other multi-word units).

The information in GramatiKat is derived from two representative corpora of contemporary written Czech, SYN2015 and SYN2020 (each containing 100 million words). Deviations from the norm, i.e. forms that are unusually frequent, infrequent, or entirely missing, are automatically highlighted using standard boxplot methodology (Kováříková & Kovářík 2023). Such anomalies can point to a wide range of lexicographically relevant information, including semantic constraints, syntactic preference, or idiomatic usage, all of which are valuable both for dictionary authors and for their audiences, particularly language learners.

The value of the tool for lexicographers is twofold. First, it offers empirical support for deciding whether certain grammatical forms should be included, exemplified, or specially marked in a dictionary entry. For instance, the noun brva ‘eyelash’ appears almost exclusively in the instrumental singular, as part of the idiom nepohnout ani brvou (‘not to bat an eyelash’), which suggests that it is effectively defective in other forms (Kováříková et al. 2024), which is an information that should be included in the dictionary. Second, even when no overt anomaly is present, the grammatical profile provides a reliable picture of how a word behaves in real usage, for example showing the grammatical roles (nominative for subject, accusative for object). This supports more nuanced dictionary descriptions in line with corpus-driven approaches that aim to derive linguistic generalizations directly from data (Tognini-Bonelli 2001).

From a technical perspective, GramatiKat lowers the barrier to corpus-based grammatical analysis by offering fully preprocessed, transparent, and reproducible data visualizations. The interface supports interactive exploration, filtering, and data export, making it accessible even to those without programming skills. The tool has already been successfully adapted to Slovak and Croatian, demonstrating that, given sufficient high-quality corpus data, the approach is transferable to other morphologically rich languages. Its development is grounded in principles of Open Science and reproducible research (Chromý & Cvrček 2021).

By combining grammatical profiling with robust statistical interpretation, GramatiKat equips lexicographers with a precise and efficient method for exploring morphological behavior across the lexicon. The presentation will illustrate the tool’s functionality through real-world examples, showing both regular and anomalous grammatical profiles, and discussing how these can inform dictionary writing, editing, and revision.

Presentation materials

There are no materials yet.