Speakers
Description
The objective of this paper is to illustrate, through the examination of sample entries, the methodology employed in the creation of a prospective pilot corpusbased dictionary of Serbian as a second language, drawing on advancements applied in other similar projects for different languages (e.g., François et al., 2014; François et al., 2016; Klemen et al., 2023). While Serbian is spoken as the official language across the entire territory of the Republic of Serbia, in specific municipalities with a substantial population of national minorities, languages other than Serbian are officially recognized and spoken as native languages. In municipalities where the majority of the population belongs to a national minority, the educational system is conducted in the language of that minority. In such cases, Serbian is taught as a second language, with a curriculum comprising 90 minutes of classes per week. These language classes follow two distinct formats, contingent upon the possibility of interaction that young members of national minorities have with native Serbian speakers. Programs A and B are designed to take these differences into account (Krajišnik & Strižak, 2018). Program A is tailored for members of national minorities residing in homogeneous environments, where students lack direct contact with the Serbian language. In these cases, Serbian is treated and taught as a foreign language. Conversely, program B is aimed at members of national minorities living in heterogeneous environments, where they are consistently exposed to native Serbian speakers and possess an intermediate or high level of competency in Serbian even at the elementary school level. Additionally, there are guidelines to distinguish program C from program B (Redli, 2023). Program C would be intended for members of Croatian and Bosniak minorities with near-native proficiency in Serbian. This described dictionary is intended for young students learning Serbian as a second language, specifically in accordance with programs A and B, with the exclusion of the final category of speakers in program B (designated as proposed program C). The outlined methodology comprises three key phases. Firstly, the compilation of a receptive electronic corpus of Serbian as a second language (SrbL2Cor 1.0), derived from 24 textbooks used in elementary schools across two publishers. This corpus is stored in the ParCoLab database (Miletic et al., 2017) in XML format, adhering to the TEI P5 Guidelines, but access is restricted due to ongoing copyright negotiations. Additionally, the corpus is lemmatised, morpho-syntactically annotated, and syntactically parsed using Serbian language resources developed within the ParCoLab project. Secondly, the selection of vocabulary lists for lexicographic processing, that entails both automatic extraction from the SrbL2Cor 1.0 Corpus and the manual revision of the extracted lists compared to official non-corpus-based lists recommended for Serbian as a second language program creation (Krajišnik & Dognar, 2018; 2019). The third phase involves establishing a pilot XML dictionary database that will incorporate 500 lexicographically processed lexical items from the specified vocabulary lists. The lexicographic processing includes, besides the lemma entry, inflection data, usage labels, senses with their indicators, native language equivalents for intended dictionary users, and typical syntactic behavior demonstrated through slightly modified corpus examples. Upon project ompletion, part of this database will be accessible for free consultation in the updated multilingual dictionary module of the Serbian verb conjugator, SerboVerb (Marjanović, 2023), developed in conjunction with the ParCoLab project. The paper compares a novel corpus-based processing method, with the article structure informed by pedagogical considerations (e.g., Jelaska, 2005; Krajišnik, 2011), against the approaches found in a few existing yet outdated and scarce bilingual and monolingual paper dictionaries for Serbian as a second language (cf. Jerković & Perinac, 1980; Vasić & Jocić, 1988–1989; Ajdžanović et al., 2016), highlighting its advantages. The innovation in the lexicographic processing of Serbian in this dictionary project involves manual annotating lemmata, their senses and examples using CEFR labels, relying on corpus data rather than intuition. This allows the extraction of specific corpus-based dictionaries tailored to corresponding CEFR levels and the primary target group. In Program A, the distinction is drawn between levels A1 and A2, guided by the frequency of occurrence in the SrbL2Cor 1.0 and lexical relevance for a specific age group of students, as stipulated by official, not corpus-based vocabulary lists (Krajišnik & Dognar, 2018; 2019). Simultaneously, in Program B, levels A2, B1, and B2 are differentiated using the same criteria.