Speakers
Description
POSTER
As part of the COST Action CA21167 Universality, Diversity and Idiosyncrasy in Language Technology (UniDive), the ELEXIS-WSD Parallel Sense-Annotated Corpus (Martelli et al., 2021; Čibej et al., 2025) is being expanded to include subcorpora in additional languages—among them, Croatian—as well as new annotation layers. Each language subcorpus of ELEXIS-WSD contains the same 2,024 sentences extracted from WikiMatrix (Schwenk et al., 2019).
The corpus was initially translated from English using two machine translation platforms: Google Translate and Hrvojka (https://hrvojka.gov.hr/). The translations then underwent a two-step manual validation process to first select the more suitable translation for each sentence and correct errors, then the final versions were reviewed in terms of the accuracy of term equivalents and idiomatic expressions. The resulting set was then automatically tokenized, lemmatized, and POS-tagged, and is currently undergoing manual correction.
The next phase involves creating an open-source sense repository for Croatian, which is being developed based on an existing pedagogical dictionary (Authors, 2025). The repository will be enriched through a combination of manual and automated methods, including the use of large language models (LLMs) to define missing senses. Since domain-specific terms and certain multiword expressions (MWEs) (Odijk, 2013) posed challenges for the tested translation platforms, a new evaluation task was conducted to assess the competence of LLMs in translating MWEs. The underlying hypothesis was that if an LLM could successfully translate MWEs from English into Croatian, it should also be capable of adequately identifying and defining their senses. Some studies have shown that LLMs perform particularly well in the semantic interpretation of MWEs (Gantar, 2024).
Each English sentence was automatically translated in a separate prompt using an adapted pipeline for two large language models: ChatGPT-4o and the recently developed Slovene GaMS-9B-Instruct (https://huggingface.co/cjvt/GaMS-9B-Instruct). A preliminary evaluation was conducted on the first 200 sentences. As the translations generated by the GaMS-9B-Instruct model contained a significant number of Serbian lexical items (e.g., fudbal, holandski napadač, spoljni stručnjaci instead of nogomet ‘football’, nizozemski napadač ‘Dutch striker’, vanjski stručnjaci ‘outside experts’), this set of translations was excluded from further evaluation. Five linguists then compared the ChatGPT-4o translations with the manually validated automatic translations, and marked differences.
This paper presents an analysis of the most common differences between the automatic translation of MWEs from English into Croatian by an LLM and the human validation of machine translation. ChatGPT-4o demonstrates a high level of proficiency in handling MWEs as opposed to its predecessors in this translation task. Differences between the compared translations include: a) wrong terminological equivalents (e.g., medicinski uvjeti / medicinska stanja ‘medical conditions’, Bézierove površine / Bézierove plohe ‘Bézier surfaces’); b) differences at the morphosyntactic level (Otto nagrada / nagrada Otto ‘Otto Award’; riževi nemiri / rižini nemiri ‘rice protest’); c) English-influenced literal translations, mostly in verbal MWEs (uzeti ime / dobiti ime ‘take its name’, častiti kao sveca / štovati kao sveca ‘honour as a saint’), d) the treatment of metaphorical MWEs (pod protestom / u znak protesta ‘under protest’, proces se raspada / proces se urušava ‘the process breaks down’), and e) named entities, which is a challenge in other languages, too (Krstev et al., 2024). The provisional typology will be used in developing templates for defining MWEs in the sense repository for Croatian.