Speakers
Description
While CEFR-aligned vocabulary profiles have been developed for many languages (e.g., English, German, and Swedish), Ukrainian as a foreign language (UFL) still lacks an empirically grounded lexical profile. A foundational issue in creating such profiles is combining lexical frequency data with expert knowledge to assign CEFR-level labels. Existing UFL word lists rely primarily on professional expertise rather than systematic data analysis. The development of a Ukrainian vocabulary profile is further complicated by the prevalence of level-straddling textbooks, significant variability of vocabulary across learning materials, and the inherent inflectional complexity of the language. We aim to bridge these gaps by developing a graded word list for UFL learners (CEFR levels A1–C2), using a comprehensive, data-based approach to vocabulary classification.
To this end, we have constructed a one-million-word corpus based on 21 UFL textbooks (A1–C2) using Ukrainian NLP tools and resources, namely the NLP-UK toolkit (github.com/brown-uk/nlp_uk) and the VESUM dictionary (vesum.nlp.net.ua), for automatic tokenization, lemmatization, and morphological tagging. The corpus has yielded a word list of 37,087 lemmas for which both frequency and distributional data (across levels and textbooks) were recorded. This dataset has enabled us to analyze lexical frequency, dispersion, and variability across a representative selection of UFL textbooks.
Another data input was provided by a general-language corpora. We have analyzed lemma frequency data from two Ukrainian corpora (GRAC and BRUK). By integrating frequency data from three corpora with UFL expert analysis, we have assigned CEFR levels to each lexical item and categorized them by part of speech and communicative topic. Crucially, we have applied the significant onset of use approach (Alfter et al., 2016) to address inconsistencies in existing Ukrainian learning materials and achieve a reliable classification.
The paper outlines the methodology for vocabulary extraction, exploration, and profiling. Expert decision-making follows a two-stage CEFR alignment process to ensure accuracy, consistency, and pedagogically relevant progression. In the external alignment stage, experts independently assign proficiency levels to words. In the internal alignment stage, these assignments are refined by analyzing words within semantic and derivational clusters. This approach proves particularly effective for languages with complex morphology like Ukrainian.
A CEFR-labeled vocabulary profile of 5,891 lexical items, with a target of 10,000 lemmas, developed through in-depth lexical analysis, is published on the PULS platform (puls.peremova.org). It is designed as a digital learning resource with lexical database functionality, allowing word list extraction by CEFR level, thematic group, and part of speech. Currently, A1 and A2 vocabulary items are available, with higher levels in progress. This profile serves as the foundation for the prospective Ukrainian Learner’s Dictionary (ULD), which will include detailed lexical entries with part of speech, CEFR label, thematic group, definition at the level of individual senses, corpus-based examples, pronunciation (audio), English equivalents, pictorial illustrations where relevant, and semantic and derivational relations.
The PULS platform fills a critical gap in creating a comprehensive learning system for UFL. Its central component, the Ukrainian Learner’s Dictionary, is the first-ever CEFR-labeled corpus-based UFL reference source that will serve the needs of learners, educators, material creators, and proficiency test designers.