8–12 Oct 2024
Hotel Croatia
Europe/Warsaw timezone

Creating the Dataset of Croatian Verbal Idioms: Automatic Identification in a Corpus and Lexicographic Implementation

11 Oct 2024, 14:00
30m
Šipun Hall (Hotel Croatia)

Šipun Hall

Hotel Croatia

Speakers

Ivana Filipović Petrović (Croatian Academy of Sciences and Arts) Kristina Kocijan

Description

This research proposes a step forward in the automatic identification and analysis of verbal idioms in Croatian. The use of the NooJ automated text processing tool, along with the MaCoCu corpus and the Online Dictionary of Croatian Idioms (ODCI), provides a robust framework for recognizing and categorizing these multi-word expressions (MWEs). The research comprises two parts: (a) creation of a dataset by utilizing the ODCI that allowed for a set of 898 verbal idioms to be compiled and annotated with linguistic features, including structure, morphological features, and variation patterns; (b) analysis of extracted data that provides insights into the lexicographical and linguistic significance of the idioms, such as variability, modification, and frequency of use. The study highlights the challenges posed by idiomatic variations and the verb’s role as the most variable component in idioms. For instance, the idiom “soliti pamet komu” (to give unsolicited advice) is often modified for expressiveness, such as in the phrase “having a big saltshaker to salt everyone’s mind.” The dataset aims for lexicographic integration into ODCI and supports the creation of electronic language resources. It also contributes to theoretical and cross-lingual research, with the CLARIN repository expected to enhance data reusability in NLP. The study’s findings offer a deeper understanding of verbal idioms’ dynamics and their computational processing.

Co-authors

Ivana Filipović Petrović (Croatian Academy of Sciences and Arts) Kristina Kocijan

Presentation materials

There are no materials yet.